2.6 VALIDACIÓN DEL MÉTODO DE OLSEN MODIFICADO PARA DETERMINAR
2.6.7 PREPARACIÓN DE REACTIVOS PARA LA DETERMINACIÓN DE NITRÓGENO
2.6.7.4 PROCEDIMIENTO PARA LA DETERMINACIÓN DE NITRÓGENO AMONIACAL
A total of 385 constitutional, geometrical, topological, electrostatic, quantum-chemical
descriptors were considered for the study. The dataset is divided into a training (n=55),
validation (n=18), and test (n=18) set using the Kennard Stone algorithm (R.W.Kennard, 1969).
This ensures maximization of the chemical space during model training and accurate tests
within the space.
DE-BPSO Algorithm is used to select the individual descriptors. Machine learning models are
then developed from each of the selected descriptor sets by each individual particle. The fitness
(𝑐𝑜𝑠𝑡) of the selected descriptors is measured by the RMSE, sample size (𝑚) of both the training
(𝑡) and validation (𝑣) set, number of descriptors in the model (𝑛), and a parsimonious penalty
factor
.
Over-fitting vs. Under-fitting
One of major goals of any Machine Learning method is to provide solutions that perform well
not only on the cases used for learning but also on cases never seen before. This is known as
generalization, how well a model performs on new data, and failure to do so is called over-
fitting. Over-fitting occurs when a solution performs well on the training cases but poorly on the
testing cases. When over-fitting starts to occur the search process is stopped. This points out
that the underlying relationships of the whole data were not learned, and instead a set of
relationships existing only on the training cases were learned, but these have no
correspondence over the whole known cases.
In simple words, Over- fitting is when model is too complex and test errors are large although
training errors are small. There should be always a balance between good classification of the
training set, and good classification of future objects (generalization performance). Over-fitting
means fitting too much the training data, which reduces the generalization performance. This is
very important in large dimensions, or with complex non-linear classifiers. On the other hand,
Under-fitting is when model is too simple and both training and test errors are large. Both over-
fitting and under-fitting lead to poor predictions on new data sets.
Understanding these two phenomena allows to thread the needle and go into the space
between the two extremes. It is in this gap where the model has predictive power in the
validation set lies.
The role of value of
is to control the balance between finding models with over-fit and under-
fit and must be monitored to find a value fit for a given dataset. Through previous researches
by trial and error, for this data set, it has been found out that
= 3.3 to be suitable for the EA to
obtain predictive QSAR models.
To evaluate the MLR models generated from the descriptors selected by the EA, we should use
fitness function. The fitness function is designed to prevent over-fitting and minimize the
number of descriptors in the model.
The top ranked models with the lowest cost are analyzed and interpreted to understand the
physiochemical properties of dimeric aryl diketo acids conducive for biological activity.
In our simulations we found the optimal values as = 0.004, F = 0.7, CR = 0.7) and used them
to develop models for the analysis of -aryl -diketo acids, (Table 1 Averaged Fitness Values of
DE-BPSO Parameters)The population size in the particle swarm is 50 individuals and we had1000 generations for the DE-BPSO feature selection algorithm.
Specifically, each model is evaluated in terms of the coefficient of determination, R
2, mean
squared error (MSE) and root-mean squared error (RMSE). To calculate R
2for all models, we
use the following definition, whereby
yrepresents the target’s mean value:
It should be noted that this version of R
2would be negative if predictions are poorer than
always forecasting the mean. Basically, R
2represents the amount of variation of the
dependent variable explained by the model so, it is considered as singular measure of
predictive accuracy.
MLR models with high correlation (R
2> 0.6), high predictive correlation with the validation (R
2v
> 0.5) and test (R
2test
> 0.5) sets, and cross-validated quality of fit (Q
2
> 0.5) is considered for
analysis.
The model with the lowest cost function has test set statistics R
2train
= 0.8967.
The model has 5 descriptors:
pIC50 = 5.134 + 0.272 X
1- 0.789 X
2+ 0.452 X
3+ 0.558 X
4+ 0.284 X
5Model performance:
Variable Coefficient
Descriptor
X1 0.272 Relative number of aromatic bonds
X2 0.789 Max atomic state energy for a H atom
X3 0.452 Relative number of C atoms
X4 0.558 Relative number of single bonds
X5 0.284 ESP-RNCG Relative negative charge (QMNEG/QTMINUS)
Table 3: Selected Descriptors in β-Diketo Acid QSAR Model- 91 B-diketo acids (DE-BPSO-MLR)
Based on this QSAR model the descriptors with the highest influence on the biological activity
of aryl B-diketo acids are the Relative number of single bonds and Relative number of C atoms.
Model Accuracy plot:
Figure 11: Results by using DE-BPSO Algorithm and Multiple Linear Regression (DE-BPSO-MLR) for developing QSAR model (91 Aryl B-Diketo Acids)
Comparison of our work with the previous researches:
Since, the DE-BPSO is totally a new hybrid algorithm that has been introduced recently; no
significant work had been done previously using this algorithm. The only earlier research
available to compare has been on 37 dimeric aryl -diketo acids acids (Train set =23, Validation
set = 7 and Test set = 7) using Linear Model (MLR) and DE-BPSO algorithm as the feature
selection method (GeneKo,2012). So, we are comparing the result from that research with our
result. The comparison between this work (the data set of 91 dimeric aryl -diketo acids) and
the previous one (the data set of 37 dimeric aryl -diketo acids) shows improvement in the
value of R
2for all the train, validation and test sets.
The final QSAR model and R
2values obtained from previous research are:
pIC50 = 5.134 – 0.360 X1 + 0.628 X2 + 0.619 X3 + 0.157 X4 + 0.242 X5R
2= 0.886 (n = 23), R
2v =
0.765 (n = 7), R
2test= 0.722 (n = 7)
Table 4: Descriptors used in QSAR Model (Data set of 37 β-Diketo Acids)
Based on this QSAR model, hydrophobicity of the compounds (X2) and partial positive charges
on the hydrogen atoms on the molecular surface (X3) have the highest significance in the
biological activities (GeneKo,2012).
However, MLR is a common modelling method for QSAR development, comparing the results of
linear model with Non-Linear Models such as Random Forest will lead to a more accurate
analysis and comparison.
In document
Validación de métodos analíticos para la determinación de nitrógeno asimilable y fósforo en suelo
(página 61-65)