4. RESULTADOS Y DISCUSIÓN
4.4 Evaluación cualitativa de mantenimiento del HBB
The Random Forest (RF) algorithm is another very popular method that has been used extensively in many areas of research and predictive analytics. It is a form of ensem- ble learning as it is constructed by building a whole series of weak predictive models, which can combine their results to form accurate models. Examples of its usage to iden- tify complex genetic relationships will be provided in the literature review at the end of this chapter.
The original idea of the RF algorithm is developed from the use of Decision Trees in classification tasks. These methods work by looking for dichotomous splits in the different features based on certain information criteria such as the Gini Index and the Gain Ratio, a full description of these methods is not described here, but can be read about in Rokach and Maimon (2005). The structure of the decision trees results in an algorithm that takes the form: “If feature x is greater than a set threshold, move to branch 1, if not move to branch two etc. . . . ”. This then plays across the different features, always looking for which feature to split based on the information criteria in the data. When the branches of the tree have been built, a new data point can be classified by traversing through the branches using the criteria and thresholds to get to the end of a certain branch that will return the classification.
While decision trees can be effective models, they come under two main criticisms: they often cannot deal with complex feature interactions as they can only consider a threshold break point in one feature at a time, and another serious criticism is that they have a tendency to overfit to the data that was used to train them, and not perform well at generalising to new data. These issues have been solved by using ensemble methods on the decision trees, hence the terminology that they are now turned into forests.
The first development to build these forests was developed in the 1990s, developing on the idea of random subspace sampling methods (Ho, 1995, 1998). The idea behind this is to build separate decision trees on different samples of the features; this means that none of the trees has the full access to all of the input features when building their branches. This idea of sub-sampling was further developed by Breiman in 2001, when he expanded the idea of a forest to incorporate his previous development of the bagging algorithm, which stands for “bootstrapped aggregation” (Breiman, 1996). This method works by not only taking sub-samples of the features for the different trees, but also bootstrapped collections of the samples; meaning that at each tree, a collection of samples is taken from the training data and these are used to make the forest of the decision trees. This effectively results in an ensemble of weak predictors - the individual trees. However, when all of the information from the results from different trees is aggregated together, this has been shown to be effective in many situations at achieving a high level of predictive performance while avoiding overfitting. Also, while the models cannot take into account any explicit interactions occurring between the features, as the binary decisions are always made on individual features, by making a series of predictors that use different combinations of the features, these methods are capable of solving complex categorisation tasks.
In addition to reports of good predictive performance, there are some other very useful aspects present in the RF methods. When the threshold-based decision splits are made for all of the different branches, for all of the trees in the ensemble, the information criteria used to make these, as well as the features chosen for the splits, are stored. This means that when the final ensemble model is built, there is additional information on the feature importance of all of the different inputs. While the SVM can gain this information when a linear decision boundary is made; this ability is lost with the use of the complex non-linear kernels. Some examples of how this has been put into effect when looking for important SNPs in genetic studies when interactions have been taken into account show that Variable Importance Measuress (VIMs) can be assigned to the variants by a range of different techniques including Random Forests (Nicodemus et al., 2007; Nicodemus and Malley, 2009; Nicodemus et al., 2010).
Another advantage of using an RF is that it is not sensitive to the different scaling of the features. This means that if the raw data contains information from different sources with
vastly different ranges in their distributions of values, these inputs can be implemented into the models; no additional scaling in the pre-processing steps is required.
Why the Support Vector Machine was chosen
At the start of the project, there was a strong amount of consideration given to using the RF algorithm, due to its proven effectiveness in many different fields and the two advantages of presenting the importance of the features and the resilience to the different scalings of the features. Unfortunately, all of the attempts at any pilot trials of these algorithms on the genetic datasets yielded models that were no superior to random chance; the AUC scores were always in the region of 0.5. This was the case for all of the trials carried out in the experimental chapters. It is not fully known as to why this could have occurred, but a few suggestions are presented here.
It is possible that the ensemble method to find the different interactions between the genetic features was not sufficient at identifying any of the possibly very weak interac- tion effects which could be occurring. The kernel based SVMs have an advantage here by actually taking into account these explicit interactions when building their decision boundaries, albeit with a loss in the interpretability of the models.
However, the RFs still did not perform as well as the linear SVMs, and this cannot be explained by the different methods of analysis of the interactions. The only way to get an understanding of this is to examine the different types of boundary which are drawn when using the RF algorithm on similar examples to those seen already.
In figure 2.20, the boundary lines for an ensemble of ten trees in a forest can be seen. The plot has been designed so that all of the different boundaries are laid on top of each other and the overall pattern can be seen. It is immediately noticeable that all of the lines of the boundaries are parallel with at least one of the axes, this is due to the binary splits being based on the single features. The overall pattern shows a tendency to follow the similar direction to the optimum hyperplane in the SVM.
Figure 2.20.: An example of the decision boundary made by a Random Forest algorithm on data that can be reasonably separated with a soft margin linear classifier, using ten different boot- strapped estimators. Each different boundary made has been shaded in turn to create the final aggregation, which would be used for the classification of the data points.
A possible reason for why the RFs are not performing as well, could be due to the incredibly ambiguous nature of the genetic datasets used in the experiments. For all of the trials, the prediction metrics were not particularly high, so there will have been a lot of cross-over of the points. When looking at the image in figure 2.20, it is reasonable to think that there could be some confusion in these areas with a great deal of overlap; the cut-off point is not so clear as it is with the single line in the SVM.
The decision boundary for the XOR example can be seen in figure 2.21. It is clear from this image that the RF is doing a successful job at classification.
Figure 2.21.: An example of the decision boundary made by a Random Forest algorithm on the XOR example, using ten different bootstrapped estimators. Each different boundary made by these has been shaded in turn to create the final aggregation, which would be used in the final classification of the data points.
It is possible that the ambiguous nature of the classification tasks in the genetic datasets used in this thesis proved to be unsuitable for the RF algorithm, as many of the points will be crossed over in the manner seen near the decision boundary in figure 2.20. In situations like those seen in figure 2.21, it is clear that RFs are capable of dealing with interactions in the datasets, despite looking at the feature inputs individually.
2.2.8. Comparing the performance of the Support Vector Machine and