• No se han encontrado resultados

PROCEDIMIENTO PARA LA DETERMINACIÓN DE NITRÓGENO AMONIACAL

2.6 VALIDACIÓN DEL MÉTODO DE OLSEN MODIFICADO PARA DETERMINAR

2.6.7 PREPARACIÓN DE REACTIVOS PARA LA DETERMINACIÓN DE NITRÓGENO

2.6.7.4 PROCEDIMIENTO PARA LA DETERMINACIÓN DE NITRÓGENO AMONIACAL

A total of 385 constitutional, geometrical, topological, electrostatic, quantum-chemical

descriptors were considered for the study. The dataset is divided into a training (n=55),

validation (n=18), and test (n=18) set using the Kennard Stone algorithm (R.W.Kennard, 1969).

This ensures maximization of the chemical space during model training and accurate tests

within the space.

DE-BPSO Algorithm is used to select the individual descriptors. Machine learning models are

then developed from each of the selected descriptor sets by each individual particle. The fitness

(𝑐𝑜𝑠𝑡) of the selected descriptors is measured by the RMSE, sample size (𝑚) of both the training

(𝑡) and validation (𝑣) set, number of descriptors in the model (𝑛), and a parsimonious penalty

factor

.

Over-fitting vs. Under-fitting

One of major goals of any Machine Learning method is to provide solutions that perform well

not only on the cases used for learning but also on cases never seen before. This is known as

generalization, how well a model performs on new data, and failure to do so is called over-

fitting. Over-fitting occurs when a solution performs well on the training cases but poorly on the

testing cases. When over-fitting starts to occur the search process is stopped. This points out

that the underlying relationships of the whole data were not learned, and instead a set of

relationships existing only on the training cases were learned, but these have no

correspondence over the whole known cases.

In simple words, Over- fitting is when model is too complex and test errors are large although

training errors are small. There should be always a balance between good classification of the

training set, and good classification of future objects (generalization performance). Over-fitting

means fitting too much the training data, which reduces the generalization performance. This is

very important in large dimensions, or with complex non-linear classifiers. On the other hand,

Under-fitting is when model is too simple and both training and test errors are large. Both over-

fitting and under-fitting lead to poor predictions on new data sets.

Understanding these two phenomena allows to thread the needle and go into the space

between the two extremes. It is in this gap where the model has predictive power in the

validation set lies.

The role of value of

is to control the balance between finding models with over-fit and under-

fit and must be monitored to find a value fit for a given dataset. Through previous researches

by trial and error, for this data set, it has been found out that

= 3.3 to be suitable for the EA to

obtain predictive QSAR models.

To evaluate the MLR models generated from the descriptors selected by the EA, we should use

fitness function. The fitness function is designed to prevent over-fitting and minimize the

number of descriptors in the model.

The top ranked models with the lowest cost are analyzed and interpreted to understand the

physiochemical properties of dimeric aryl diketo acids conducive for biological activity.

In our simulations we found the optimal values as = 0.004, F = 0.7, CR = 0.7) and used them

to develop models for the analysis of -aryl -diketo acids, (Table 1 Averaged Fitness Values of

DE-BPSO Parameters)The population size in the particle swarm is 50 individuals and we had

1000 generations for the DE-BPSO feature selection algorithm.

Specifically, each model is evaluated in terms of the coefficient of determination, R

2

, mean

squared error (MSE) and root-mean squared error (RMSE). To calculate R

2

for all models, we

use the following definition, whereby

y

represents the target’s mean value:

It should be noted that this version of R

2

would be negative if predictions are poorer than

always forecasting the mean. Basically, R

2

represents the amount of variation of the

dependent variable explained by the model so, it is considered as singular measure of

predictive accuracy.

MLR models with high correlation (R

2

> 0.6), high predictive correlation with the validation (R

2

v

> 0.5) and test (R

2

test

> 0.5) sets, and cross-validated quality of fit (Q

2

> 0.5) is considered for

analysis.

The model with the lowest cost function has test set statistics R

2

train

= 0.8967.

The model has 5 descriptors:

pIC50 = 5.134 + 0.272 X

1

- 0.789 X

2

+ 0.452 X

3

+ 0.558 X

4

+ 0.284 X

5

Model performance:

Variable Coefficient

Descriptor

X1 0.272 Relative number of aromatic bonds

X2 0.789 Max atomic state energy for a H atom

X3 0.452 Relative number of C atoms

X4 0.558 Relative number of single bonds

X5 0.284 ESP-RNCG Relative negative charge (QMNEG/QTMINUS)

Table 3: Selected Descriptors in β-Diketo Acid QSAR Model- 91 B-diketo acids (DE-BPSO-MLR)

Based on this QSAR model the descriptors with the highest influence on the biological activity

of aryl B-diketo acids are the Relative number of single bonds and Relative number of C atoms.

Model Accuracy plot:

Figure 11: Results by using DE-BPSO Algorithm and Multiple Linear Regression (DE-BPSO-MLR) for developing QSAR model (91 Aryl B-Diketo Acids)

Comparison of our work with the previous researches:

Since, the DE-BPSO is totally a new hybrid algorithm that has been introduced recently; no

significant work had been done previously using this algorithm. The only earlier research

available to compare has been on 37 dimeric aryl -diketo acids acids (Train set =23, Validation

set = 7 and Test set = 7) using Linear Model (MLR) and DE-BPSO algorithm as the feature

selection method (GeneKo,2012). So, we are comparing the result from that research with our

result. The comparison between this work (the data set of 91 dimeric aryl -diketo acids) and

the previous one (the data set of 37 dimeric aryl -diketo acids) shows improvement in the

value of R

2

for all the train, validation and test sets.

The final QSAR model and R

2

values obtained from previous research are:

pIC50 = 5.134 – 0.360 X1 + 0.628 X2 + 0.619 X3 + 0.157 X4 + 0.242 X5R

2

= 0.886 (n = 23), R

2

v =

0.765 (n = 7), R

2

test= 0.722 (n = 7)

Table 4: Descriptors used in QSAR Model (Data set of 37 β-Diketo Acids)

Based on this QSAR model, hydrophobicity of the compounds (X2) and partial positive charges

on the hydrogen atoms on the molecular surface (X3) have the highest significance in the

biological activities (GeneKo,2012).

However, MLR is a common modelling method for QSAR development, comparing the results of

linear model with Non-Linear Models such as Random Forest will lead to a more accurate

analysis and comparison.

Documento similar