• No se han encontrado resultados

En Libro de los Han Tratado de la Literatura hallamos el siguiente dato:

The last step of the model development process is its applicability domain estima-

tion. The applicability domain (AD) (see Figure2.10) is defined as “the response and

chemical structure space in which the model makes predictions with a given reliability”

[75]. The chemical space is a multidimensional space, where each dimension repre-

sents: structural, physical, chemical or biological property of a chemical compound. Applicability domain determines the boundary of chemical sub-space where models are reliable and it also supports the controlled extrapolation of these models into entire chemical space. This fact ensures that the QSAR model can be used for chemicals which fall into its applicability domain and at the same time it does not guarantee a high model predictivity. Applying these models for chemicals from outside of their applicability domains increases the likelihood of inaccurate prediction.

The process of AD estimation is model-dependent and based on a training set do- main, moreover, there is a relation between the AD estimation and variable selection

techniques [128]. Thus, there is no universal method for AD estimation. As shown

in [53] the different approaches produce different applicability domains. The choice

of the particular AD estimation methods depends on a requirement for the data distri- bution in a training set and the dimensionality of the model. Currently, there are four main techniques and there is still ongoing research to find efficient methods.

The most common technique for the applicability domain estimation uses range-

based methods [75]. Chemical descriptors are defined as ranges of their values and

generate a hyper-rectangle. The applicability domain is defined by this hyper-rectangle. Unfortunately, this method does not detect the intersection of the hyper-planes and

Figure 2.10: Example of applicability domain estimation for model predicting log Kow

using the acceptor delocalisability descriptor [75].

does not take into account the correlation between descriptors. An another common technique to assess the model applicability domain is Principal Component Analysis

(PCA) [57]. It is based on the rotation of dependent variables X (descriptors) to correct

the correlation between them.

The convex hull calculation is another example of applicability domain estimation. This method estimates the coverage of the n-dimensional set of variables. In two- dimensional space, it is represented as a polygonal figure whose interior defines the model applicability space. This approach is well known in computational geometry

[17]. There are a few efficient algorithms for two and three dimensional problems.

Unfortunately, with the increase of a number of descriptors used to generate model, the complexity of the convex hull calculation also increases. Additionally, this method

does not detect empty regions in the descriptor space. As it is shown in Figure 2.10the

data covers evenly a space for log Kow < 5and acceptor < 0.110. The other regions

in the convex hull does not contains many data (triangle point) or are empty.

One of the most efficient approaches to estimate the applicability domain are dis- tance based techniques. These methods calculate the distance from a searching point

to the training dataset. There are many approaches: distance to the mean of the dataset, average of all distances between query point and the dataset or maximum distance. The Euclidean distance is the most frequently used technique, however, the Mahalanobis

and city-block are used as well [53,75]. Together, they are the most common methods

for finding similarity of chemical compounds. For a given model the applicability do- main threshold is defined. For all chemicals where the average distance to the training dataset is greater then this threshold, the model can give an inaccurate prediction.

The last method to estimate the applicability domain is based on density estima- tion. This method involves the determination of the high density region. There are two approaches: parametric and non parametric. Parametric methods ensure the the data distribution is close to standard normal distribution (Gaussian Process). Non paramet- ric methods do not make any assumption about data distribution (kernel density esti- mation). The calculation of the highest density regions is a complex process according to the dimensionality of the chemical space, thus, there is a challenge to provide a fast

and efficient algorithm. Recent studies [9] show that the random forest classifier is

comparable with the well know Gaussian Process regression [91] for applicability do-

main estimation. The authors provided a generic machine schema for class probability estimators.

The applicability domain can be in two types: global and local. Global applicabil- ity domain defines a broad chemical space using all pre-calculated chemical compound descriptors, whereas local applicability domain is defined by selected descriptors in the model generation process. The breadth of the applicability domain has influence on the model predictivity. The narrower the applicability domain, the higher the predictivity of the model. The applicability domain is often used to validate a predictive model. The elements that are within the boundary of applicability domain as well as elements from the outside the applicability domain are used for quantitative assessment of the

model robustness and its predictive power [114]. The next section discusses the exter-

nal validation methods.