Boosting, Bagging and Ensembles in the Real World: An Overview, some
3.4 Bagging
3.4.1 What Bagging is in a Nutshell
Like with boosting, bagging is usually based on ‘trees’, binary recursive partition- ing. Also, it is a technique that summarizes many ‘trees’ (which classifies it per se as an ensemble; Breiman 2001b). But bagging, as a scheme, the same as it is done in boosting, could be applied with many algorithms. Bagging differs from boosting in two parts: i) it subsets rows as well as columns (similar to bootstrapping), and ii) it has a specific procedure to average out the trees for ‘the best’ result. Both of these steps are sophisticated and are reasons for why bagging is such a success.
The performance sits once more in how those details are really implemented (Table 3.1). Breiman (1996, 2001b) presented a version of bagging, called random forests and which seems to be among the top classifiers, world-wide (Ferandez- Delgado et al. 2014). A base version was released online (https://www.stat.berke- ley.edu/~breiman/RandomForests/) and subsequently in R (by A. Liaw https://
cran.r-project.org/web/packages/randomForest/randomForest.pdf), and a com- mercial version was implemented by Salford Systems Ltd. Details are provided below in the next section.
3.4.2 Short History of Bagging
Bagging in ‘trees’ is credited to the work by Leo Breiman (1996, 2001a, b) and his former PhD student Adele Cutler (https://www.stat.berkeley.edu/~breiman/
RandomForests/; see also Cutler et al. 2007). However, previous work is based on
CARTs, by Leo Breiman et al. (1984). Many of the publicly available random forests algorithms use the code presented by Breiman and Cutler (Breiman 2001a, b). Publicly made available packages in R and Python are essentially just wrappers of that code but which leave the relevant questions of fine-tuning and ultimate inference to the coder and user (Table 3.1). This is a big flaw in those public implementations of random forests because most users lack the testing and under- standing and thus, it leaves out much of the award-winning performance that is found in random forests. Consequently, several random forests implementations and applications can be found in the literature that are substandard, see Table 3.2 for a discussion.
3.4.3 Why Bagging is so Powerful
Bagging involves drawing random (re-sampling) samples from rows (bootstrap- ping), which is a relatively simple and old procedure as such (Efron and Tibshirani 1993). Trees get build from each of those subsamples and then summarized.
Whereas the real innovative part is in the random draw of columns (=predictors).
Users usually did not do that because they pre-selected and then ‘hugged’ their predictors at all costs and wanted to keep all of them for outermost inference. But instead, when ‘bootstrapping predictors’ one obtains more robust information from them! That way, random forests ‘rarely overfits a model (that means, it always uses a lower amount of predictors than what the data allow). Thus, always just a subset of the predictors is used, and then, the best tree gets inferred from that eventually.
All of it is fine-tuned for optimization. There is another ‘trick’ in bagging, and that is, it gets optimized for the best prediction. Whereas linear regressions, as an exam- ple, do not work that way. Their optimization essentially is a) relatively primitive (based on the ‘least squares’, at best (an approach that is over a century old), and b) based on minimizing the variance (r2, Zar 2010). Whereas in bagging, an overall optimization on the predictions (=metrics of ROC/AUC; Fielding and Bell 1997) allows for a higher level of generalization, and thus, robust inference (Breiman 2001a). The emphasis is thus put on prediction for inference and that is where the power sits; whereas r2 values have less meaning and relevance in that discussion (the author knows of models that have a relatively low r2 but a high ROC/AUC and thus perform rather well when tested with alternative evidence). The often-demanded practice to compute a pseudo- r2 in Machine Learning should be dismissed because r2 is derived from a linear concept (Zar 2010) and virtually impossible to mimic in Machine Learning algorithms, e.g. for Neural Networks. It thus should be left alone and just be used for linear regressions (which tend to perform low anyways and thus not a real option) (Textbox 3.1).
Table 3.2A selection of random forests algorithm implementations and their details (Note that many more are described in Ferandez-Delgado et al. 2014 using C++ and Fortran etc) Name of the packageSoft- wareSource citationPerformance assessmentSpecific feature Random forestsFortranBreiman (2001b)The original raw code with relevant settingsRaw code SPM8 Random forestsC++https://www.salford-systems.com/ products/randomforestAn optimized commercial version of the initial Breiman (2001b) algorithmGUI, weigthing, optimized settings randomForestRLiaw and Wiener (2002) https:// cran.r-project.org/web/packages/ randomForest/randomForest.pdf
The raw code in R (r port) leaving relevant performance enhancements to the userregression (added), get tree, combine Biomod2RThuiller et al. https://cran.r-project.org/ web/packages/biomod2/index.htmlUse of the raw code from (Breiman ad Cutler via Liaw and Wiener in R). Not making use of the relevant fine-tuning options to show the true performance of random forests Adds other algorithms in parallel as an ensemble model overall Scikit learnpythonhttp://scikit-learn.org/stable/modules/ generated/sklearn.ensemble. RandomForestClassifier.html
Similar to the above, emphasize on ensemblesAll relevantsettings
Textbox 3.1 On the suggested ‘best’ use of competing landcover, altitude