Usage
RECIPE
REsilient ClassifIcation Pipeline Evolution
Running
After the installation process is complete you can run the algorithm and generate the best pipeline based on the input dataset. To run, execute the following command from the root folder of the source code
1 | python2 exec.py -dTr DATATRAIN -dTe DATATEST |
In the project you downloaded there is a folder named datasets. An example of how to use de algorithm for the dataset iris is:
1 | python2 exec.py -dTr ./datasets/iris/iris-Training0.csv -dTe ./datasets/iris/iris-Test0.csv |
Note
The input data must be in .csv form regardless the extension of the file.
RECIPE offers another arguments that can be set by the user.
Argument | Parameter | Valid Values | Effect | |
---|---|---|---|---|
-s or --seed | SEED | Positive Integer | Set the seed of the algorithm for reproducibility. Default: 1 | |
-c or --config | CONFIG | String | A string referring to a configuration file that defines the parameters of the GP. Default: 'config/gecco2015-cfggp.ini' | |
-dTr | DATATRAIN | String | A string referring to a file containing the data used to train the pipeline methods | |
-dTe | DATATEST | String | A string referring to a file containing the data used to test the pipeline methods | |
-nc | NUMBER OF CORES | Positive Integer | Number of cores to be used on the algorithm execution. Default: 1 | |
-ft | FULL TIMEOUT | Positive Integer | Full time (budget) to execute the whole evolutionary process (in seconds). Default: 0 (i.e., it will look only at the generation count) | |
-t | TIMEOUT | Positive Integer | Time (budget) to execute each individual (i..e, a pipeline) of the GP on evaluation (in seconds). Default: 300 | |
-mr | MUTATION RATE | Positive Float | It defines the mutation rate for the evolutionary algorithm (max=1.0). Default: 0.1 | |
-cr | CROSSOVER RATE | Positive Float | It defines the crossover rate for the evolutionary algorithm (max=1.0). Default: 0.9 | |
-ps | POPULATION SIZE | Positive Integer | It defines the size for the initial population for the evolutionary algorithm. Default: 30 | |
-gc | GENERATION COUNT | Positive Integer | It defines the maximum number of generations for the evolutionary algorithm. Default: 100 | |
-gr | GRAMMAR | String | It defines the grammar to be used by RECIPE during its evolutionary process. Default: 'bnf/new_ml.bnf' | |
-en | EXPORT_NAME | String | A string with a file name to export pipeline. Default: 'pipeline.py' | |
-v | VERBOSITY | Positive Integer | Verbosity level of the output: (3-Full, 2-Intermediate ,1-Basic) |
Configuring GP
The program comes with a configuration file (folder config) that can be used to set the best parameters to execute the GP. This file defines the mutation and crossover ratio values, population size, number of generations and elitism.
Results
The program generates 3 files:
- Evolution-Training: Data regarding the evolution of individuals using the training data. It is found in the directory 'evolution', containing the following measures separated by commas (i.e., ","): the current generation, the fitness (i.e., f1-weighted) achieved by the worst individual in the population, the average fitness (i.e., f1-weighted) of the population, the fitness (i.e., f1-weighted) achieved by the best individual in the population. Example of the evolutionary file.
- Tracking all individuals: A file map containing each evaluated individual and its fitness (i.e., evaluated measure). A pipe ("|") separates the individual from its fitness.Example of the tracking file. It is found in the directory 'fit_map'.
- Results : Final file containing the best individual found and the values of the metrics on the test set. It is found in the directory 'results', containing the following measures separated by commas (i.e., ","): accuracy on the learning set, precision on the learning set, recall on the learning set, f1 on the learning set, accuracy on the validation set, precision on the validation set, recall on the validation set,f1 on the validation set, accuracy on the training set, precision on the training set, recall on the training set, f1 on the training set, accuracy on the test set, precision on the test set, recall on the test set, f1 on the test set, the used seed, and the string representing the best pipeline. Example of the result file. The learning and validation sets come from the training set and all metrics are weighted.