2. Medidas preventivas, correctoras y compensatorias de los proyectos
2.5 Justificación de la alternativa seleccionada
Data points (268) in the training set dictate a maximum of thirteen descriptors, temperature, five for donor, three for acceptors, two for activator and two for solvents respectively to avoid any overfitting to get statistically meaningful results.107 The donors, acceptors, activators and solvent, which were included in the training set are shown in Figure 4.10. The training set is comprised of experimental conditions (reaction conditions, yield, and stereoselectivity) as well as the numerical descriptors calculated for donors, acceptors, activators and solvents by using DFT calculation in SPARTAN software as described in the Sections 4.2 and 4.3. The reaction condition was kept constant, having previously been found to have no impact on stereoselectivity (Scheme 4.1). Entries 1-106, 116-134, 136- 141, 143-196, 220-302 in Table 6.1 (Chapter 6) were included in 268 training data points set.
Donor
70
Activator
Solvent
Figure 4.10:The compounds included in the training set.
4.6 Machine learning: ‘under the hood’
The machine learning methodology used in the research could be broadly categorized into five individual sections:
1) Data input and preconditioning section 2) Model input section
3) Machine learning and data processing section 4) Prediction section
5) Data output section
1) Data input and preconditioning section
Data input and preconditioning section consists of codes to import the training set data having both descriptor input data and response data into different arrays using the ‘xlsread’ function of MATLAB. Different arrays were also created for other functions, such as storing the model output data in the preconditioning section. Validation and experimental data were also stored into separate arrays to be used for prediction and validation purposes.
2) Model input section
After the import of the training data, the software generates the learner function, which is based on a template regression model tree. Model tree was generated using
71
the “templatetree” function in the MATLAB code. The template tree function could be invoked as shown below:
t = templateTree('NumPredictorsToSample', 'PredictorSelection','Prune','Surrogate',);
The templatetree function generates decision trees and in this case, regression trees based on a number of nested functions including “NumPredictorToSample” which selects the appropriate number of predictors to the random sampling of data.
3) Machine learning and data processing section
The optimal template tree as described in the previous section is used in the Random Forest algorithm as learners. The trees were randomly selected and grown using algorithm based on “Bagging” and “LSBoost” type algorithm.
3.1) Bagging algorithm
Bagging algorithm is an ensemble learning technology called Bootstrap Aggregation. With this technology, random replica models or decision trees are grown on the all the samples in the training set. With this technique, many replica models could be generated from the same training set. For generating the splits in the decision tree, the predictor is randomly selected. This random selection of predictors leads to what is called Random Forest.
3.2) LSBoost
This algorithm is used here for regression based ensemble learning. The least square boosting is done to fit the regression trees with the observed data. At every iteration, the algorithm works by fitting a new learner with the observed difference between prediction from the model and observed data.108 This fitting is achieved by
minimization of the mean-square-error (MSE).
3.3) Tuning of hyperparameters
The learning performance of machine learning algorithms can be enhanced quite significantly by choosing proper hyper-parameters. However, this optimization is still empirical in nature and often depends on the dataset being optimized. As an alternative, automated hyperparameter tuning is becoming increasingly important.106
72
Here we have used the algorithm. “Expected-improvement-plus” in MATLAB for automated tuning of hyperparameters.
4) Prediction section
Upon completion of the training, the model output is stored in a variable ‘Mdl’. The prediction algorithm ‘predict’ can now be used to along with the trained model stored in ‘Mdl’ to predict new experimental results based on the trained model. Once the prediction is completed, the predictor importance is calculated by summation of the variation of Mean square errors (MSE) which generates from split of each predictor and dividing this quantity by the total number of branch nodes. Separate arrays are created for storing the prediction results and the predictor importance.
5) Data output section
The arrays generated for storing the model output in the form of prediction data and predictor importance along with R2 are exported and converted into table data types in
this section. These tables are then written to Microsoft Excel datasheets using the ‘xlswrite’ function of MATLAB.
4.7 Quantifying of accuracy and benchmarking model with R2 and RMSE
In all the results that follow, the quantification of accuracy and benchmarking of Machine Learning models are demonstrated by R2 (coefficient of determination) and Root
Mean Square Error (RMSE) values. The ideal value for the coefficient of determination is 1 and RMSE is 0. These are calculated with Experimental (X) and Predicted (Y) values by the equations given below and it is worth mentioning here that both of these values need to be taken into account when judging the prediction ability of the model. Utilizing only one of these values is not enough, as demonstrated in Figure 4.11. The dashed line and the solid line represents prediction values and experimental values simultaneously. Figure 4.11a represents a model for which the RMSE is 9.7, however the R2 is 0.2 which is poor. Therefore, in this case the model performs poorly, compared to the model represented by Figure 4.11b. The increase in RMSE is negligible here compared to the model in Figure 4.11a and R2 is 1 which is ideal. However, the model represented by Figure 4.11b always
73
case scenario. Both the RMSE and R2 values are ideal, however, this type of model accuracies are seldom observed in practical situations.
Figure 4.11: (a)-(c) represents three different models and quantifies the model performance with the help of RMSE and R2 values.