Usage

RECIPE

REsilient ClassifIcation Pipeline Evolution


Running

After the installation process is complete you can run the algorithm and generate the best pipeline based on the input dataset. To run, execute the following command from the root folder of the source code

1
python2 exec.py -dTr DATATRAIN -dTe DATATEST

In the project you downloaded there is a folder named datasets. An example of how to use de algorithm for the dataset iris is:

1
python2 exec.py -dTr ./datasets/iris/iris-Training0.csv -dTe ./datasets/iris/iris-Test0.csv

Note

The input data must be in .csv form regardless the extension of the file.

RECIPE offers another arguments that can be set by the user.

Argument Parameter Valid Values Effect
-s or --seed SEED Positive Integer Set the seed of the algorithm for reproducibility. Default: 1
-c or --config CONFIG String A string referring to a configuration file that defines the parameters of the GP. Default: 'config/gecco2015-cfggp.ini'
-dTr DATATRAIN String A string referring to a file containing the data used to train the pipeline methods
-dTe DATATEST String A string referring to a file containing the data used to test the pipeline methods
-nc NUMBER OF CORES Positive Integer Number of cores to be used on the algorithm execution. Default: 1
-ft FULL TIMEOUT Positive Integer Full time (budget) to execute the whole evolutionary process (in seconds). Default: 0 (i.e., it will look only at the generation count)
-t TIMEOUT Positive Integer Time (budget) to execute each individual (i..e, a pipeline) of the GP on evaluation (in seconds). Default: 300
-mr MUTATION RATE Positive Float It defines the mutation rate for the evolutionary algorithm (max=1.0). Default: 0.1
-cr CROSSOVER RATE Positive Float It defines the crossover rate for the evolutionary algorithm (max=1.0). Default: 0.9
-ps POPULATION SIZE Positive Integer It defines the size for the initial population for the evolutionary algorithm. Default: 30
-gc GENERATION COUNT Positive Integer It defines the maximum number of generations for the evolutionary algorithm. Default: 100
-gr GRAMMAR String It defines the grammar to be used by RECIPE during its evolutionary process. Default: 'bnf/new_ml.bnf'
-en EXPORT_NAME String A string with a file name to export pipeline. Default: 'pipeline.py'
-v VERBOSITY Positive Integer Verbosity level of the output: (3-Full, 2-Intermediate ,1-Basic)

Configuring GP

The program comes with a configuration file (folder config) that can be used to set the best parameters to execute the GP. This file defines the mutation and crossover ratio values, population size, number of generations and elitism.

Results

The program generates 3 files:

  1. Evolution-Training: Data regarding the evolution of individuals using the training data. It is found in the directory 'evolution', containing the following measures separated by commas (i.e., ","): the current generation, the fitness (i.e., f1-weighted) achieved by the worst individual in the population, the average fitness (i.e., f1-weighted) of the population, the fitness (i.e., f1-weighted) achieved by the best individual in the population. Example of the evolutionary file.
  2. Tracking all individuals: A file map containing each evaluated individual and its fitness (i.e., evaluated measure). A pipe ("|") separates the individual from its fitness.Example of the tracking file. It is found in the directory 'fit_map'.
  3. Results : Final file containing the best individual found and the values of the metrics on the test set. It is found in the directory 'results', containing the following measures separated by commas (i.e., ","): accuracy on the learning set, precision on the learning set, recall on the learning set, f1 on the learning set, accuracy on the validation set, precision on the validation set, recall on the validation set,f1 on the validation set, accuracy on the training set, precision on the training set, recall on the training set, f1 on the training set, accuracy on the test set, precision on the test set, recall on the test set, f1 on the test set, the used seed, and the string representing the best pipeline. Example of the result file. The learning and validation sets come from the training set and all metrics are weighted.