Boosting, Bagging and Ensembles in the Real World: An Overview, some
Textbox 3.1 On the suggested ‘best’ use of competing landcover, altitude and climate layers for better inference with Machine Learning: Being
3.8 Synthesis and Outlook
Machine Learning is a wide field offering hundreds of variations, if not thousands of algorithms and their implementations to tackle a conservation problem in a holistic fashion for good progress. Tree-based methods like boosting, bagging and ensem- bles are a big part of this approach; and they are known to be extremely powerful, certainly convenient.
What will the future hold for those methods and for the Earth? Arguably, we are all moving with Machine Learning and Data Mining into Artificial Intelligence (AI) and Deep Learning, that’s specifically true for robots and associated drone applications and inferences. In conservation and resource management, the institu- tion and people are meant to be at the core of the decision-making process. This is achieved when the people and institutions decide with democratic principles on resource management and conservation planning with the assistance of computing and up-to date decision making tools - including considerate ethics -, instead of using expert’s opinion that are not based on the latest up to date tools.
However, the people are assisted by computing and aided through decision- making tools, all while we see increasingly the failure of traditional experts (Perera et al. 2010). It is here where Machine Learning can help in a good way (e.g.
Huettmann 2007). While not perfect, boosting and bagging has reached a level of maturity, and accuracies of over 80%, and sometimes way over 95% accuracies have been observed for global models. How much of an accuracy do we really need? It is unlikely that major changes will occur in boosting and bagging any time soon. The methods are there, mature and stable! However, a few things are increas- ingly developing: a better embedding and workflow, rise in computing power, more applications overall, and an emphasis on decision-making process that has Machine Learning, Deep Learning and Artificial Intelligence, at its core.
Our current conservation problem is not so much defined anymore really by which (single) algorithm and model to choose, but to be on the Machine Learning platform overall, online,with all data freely available, free open-access robust softwares,
and then to implement the obtained predictions in a pro-active fashion before more damage occurs on earth and its atmosphere. While the Machine Learning methods are perfected further, the real culprit is by now, policy, governance, the human aspect and its role on sustainable management of the earth and universe. Apart from major ethical questions that’s where the biggest effort is to be placed by now for conservation management of natural resources worldwide. Boosting and bagging are here to stay and thus to be embraced for the best-possible global sustainable outcomes.
Acknowledgement I thank Profs R. O’Connor and A.W. (Tony) Diamond for an early workshop on statistics with ACWERN at UNB, Canada introducing me in the late 1990s to tree-based tech- niques (CART) and multivariate analysis. I thank Dan Steinberg and Salford Systems Ltd. for a workshop with U.S. IALE at Snowbird, Utah, as well as with The Wildlife Society, Alaska Chapter, for a wider debate and introduction of tree-based methods, boosting and bagging. I am indebted to U.S.IALE, the Global Primate Network in Kathmandu, Nepal, Medical University Taipeh, Taiwan, and the Wildlife Institute of India in Dheradun for their workshop promotion and support. Thanks to S. Linke, I. Presse, B. Walter, G. Regmi, M. Suwal, R. Lama, C. Cambu, H. Hera, S. Sparks, Y. Subaru, H. Berrios and the many members of the -EWHALE lab- at UAF for their discussions and partly, support. This is EWHALE lab publication #187.
References
Aggarwal C (2015) Data mining: the textbook. Springer
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Contr AC-19. Institute of Statistical Mathematics, Minato-ku, pp 716–723
Alexander JC (2013) The dark side of modernity. Polity Press, Cambridge
Anderson DR, Burnham KP, Thompson WL (2000) Null hypothesis testing: problems, prevalence, and an alternative. J Wildl Manag 64:912–923
Araujo MB, and New M (2007) Ensemble forecasting of speies distributions. Trends in Ecology and Evolution 22:42–47
Arnold TW (2010) Uninformative parameters and model selection using Akaike’s information criterion. J Wildl Manag 74:1175–1178
Baltensperger AP, Huettmann F (2015) Predicted shifts in small mammal distributions and biodi- versity in the altered future environment of Alaska: an open access data and Machine Learning.
PLoS One. https://doi.org/10.1371/journal.pone.0132054
Berthold P (2016) Mein Leben fuer die Voegel. Kosmos Publisher, Berlin Breiman L (1996) Bagging predictors. Mach Learn 26:123–140
Breiman L (1998) Arcing classifier (with discussion and a rejoinder by the author). Ann Stat 26(3):801–849. https://doi.org/10.1214/aos/1024691079
Breiman L (2001a) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16:199–231
Breiman L (2001b) Random forests. Mach Learn 45:5–32
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information- theoretic approach. Springer, New York
Cai T, Huettmann F, Guo Y (2014) Using stochastic gradient boosting to infer stopover habitat selection and distribution of hooded cranes Grus monacha during spring migration in Lindian, Northeast China. PLos ONE 9. https://doi.org/10.1371/journal.pone.0097372
Chunrong M, Huettmann F, Guo Y (2016) Climate envelope predictions indicate an enlarged suitable wintering distribution for great bustards (Otis tarda dybowski) in China for the 21st century. PeerJ 4:e1630. https://doi.org/10.7717/peerj.1630
Chunrong M, Huettmann F, Guo Y, Han X, Wen L (2017) Why choose random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence. PeerJ 5:e2849. https://doi.org/10.7717/peerj.2849 Cockburn A (2013) A colossal wreck: a road trip through political scandal, corruption and
American culture. Verso Publishers, New York
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88:2783–2792. https://doi.org/10.1890/07-0539.1 Czech B, Krausman PR, Devers PK (2000) Economic associations among causes of species endan-
germent in the United States. Bioscience 50:593–601
De’ath G (2007) Boosted trees for ecological modeling and prediction. Ecology 88:243–251 De’ath G, Fabricius K (2000) Classification and regression trees: a powerful yet sim-
ple technique for ecological data analysis. Ecology 81:3178–3192 https://doi.
org/10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2
Dhar V (1998) Data mining in finance: using counterfactuals to generate knowledge from organi- zational information systems. Inf Syst 23:423–437
Drew CA, Wiersma Y, Huettmann F (eds) (2011). Predictive Species and Habitat Modeling in Landscape Ecology. Springer, New York
Drucker H, Schapire R, Simard P (1993) Boosting performance in neural networks. Int J Pattern Recognit Artif Intell 7:705–771
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman & Hall/CRC Monographs, New York
Elder JF (2003) The generalization paradox of ensembles. J Comput Graph Stat 12:853–864 Elith J, Graham CH, Anderson RP, Dudík M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F,
Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton J, Peterson AT, Phillips SJ, Richardson K, Scachetti-Pereira R, Schapire RE, Soberón J, Williams S, Wisz MS, Zimmermann NE (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29:129–151 Evans JS, Cushman S (2009) Gradient modeling of conifer species using random forests. Landsc
Ecol 24:673. https://doi.org/10.1007/s10980-009-9341-0
Evans JS, Murphy MA, Holden ZA, Cushman SA (2010) Modeling species distribution and change using random forest. Predictive species and habitat modeling in landscape ecology, pp 139–159
Ferandez-Delgado M, Cernadas E, Barrow S, Amorim D (2014) Do we need hundreds of classi- fiers to solve real world classification problems. J Mach Learn Res 15:3133–3181
Fielding A (1999) Machine learning methods for ecological applications. Springer, Boston Fielding A, Bell Y (1997) A review of methods for the assessment of prediction errors in conserva-
tion presence/absence models. Environ Conserv 24:38–49
Forman RTT (1995) Land mosaics: the ecology of landscapes and regions. Cambridge University Press, Cambridge
Fox CH, Huettmann, F, Harvey GKA, Morgan KH,. Robinson J, Williams R and Paquet PC (2017) Predictions from Machine Learning ensembles: marine bird distribution and density on Canada’s Pacific coast. Marine Ecology Progress Series 566:199–216
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an appli- cation to boosting. J Comput Syst Sci 55:119–139
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378
Guthery FS, Brennan LA, Peterson MJ, Lusk LL (2005) Information theory in wildlife science:
critique and viewpoint. J Wildl Manag 69:457–465
Hardy SM, Lindgren M, Konakanchi H, Huettmann F (2011) Predicting the distribution and eco- logical niche of unexploited snow crab (Chionoecetes opilio) populations in Alaskan waters: a
first open-access ensemble model. Integr Comp Biol 51(4):608–622. https://doi.org/10.1093/
icb/icr102
Harrell FE Jr (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer, New York
Hastie T, Tibshirany R, Friedman J (2009) The elements of statistical learning: data mining, infer- ence, and prediction. Springer Series in Statistics
Hegel TSA, Cushman JE, Huettmann F (2010) Current state of the art for statistical modelling of species distributions. Chapter 16. In: Cushman S, Huettmann F (eds) Spatial complexity, informatics and wildlife conservation. Springer, Tokyo, pp 273–312
Herrick KA, Huettmann F, Lindgren MA (2013) A global model of avian influenza prediction in wild birds: the importance of northern regions. Vet Res. https://doi.org/10.1186/1297-9716-44-42 Hilborn R, Mangel M (1997) The ecological detective: confronting models with data. Princeton
University Press, Princeton
Hobbs NT, Hooten M (2015) Bayesian models: a statistical primer for ecologists. University Press, Princeton
Hochachka W, Caruana R, Fink D, Munson A, Riedewald M, Sorokina D, Kelling S (2007) Data mining for discovery of pattern and process in ecological systems. J Wildl Manag 71:2427–2437
Huettmann F (2007) Modern adaptive management: adding digital opportunities towards a sus- tainable world with new values. Forum on Public Policy: Clim Chang Sustain Dev 3:337–342 Jiao S, Guo Y, Huettmann F, Lei G (2014) Nest-site selection analysis of hooded crane (Grus
monacha) in northeastern China based on a multivariate ensemble model. Zool Sci 31:430–437 Johnson DS, Thomas DL, Ver Hoef JM, Christ AD (2008) A general framework for the analysis of
animal resource selection from telemetry data. Biometrics 64:968–976
Kampichler C, Wieland R, Calmé S, Weissenberger H, Arriaga-Weiss S (2010) Classification in conservation biology: a comparison of five machine-learning methods. Ecol Inform 5:441–450 Kandel K, Huettmann F, Suwal MK, Regmi GR, Nijman V, Nekaris KAI, Lama ST, Thapa A,
Sharma HP, Subedi TR (2015) Rapid multi-nation distribution assessment of a charismatic conservation species using open access ensemble model GIS predictions: red panda (Ailurus fulgens) in the Hindu-Kush Himalaya region. Biol Conserv 181:150–161
Keating KA, Cherry S (2004) Use and interpretation of logistic regression in habitat- selection studies. Journal of Wildlife Management 68:774–789
Kononenko I (2001) Machine learning for medical diagnosis: history, state of the art and perspec- tive. Artif Intell Med 23:89–109
Kurt F (1982) Naturschutz-illusion. Paul Parey Publisher, Berlin Germany
Lawler JJ, White D, Neilson RP, Blaustein AR (2006) Predicting climate-induced range-shifts:
model differences and model reliability. Glob Chang Biol 12:1568–1584
Lawler JJ, Yo W, Huettmann F (2011) Designing predictive models for increased utility: using species distribution models for conservation planning, forecasting, and risk assessment. In:
Drew CA, Wiersma Y, Huettmann F (eds) Predictive modeling in landscape ecology. Chapter 5. Springer, New York, pp 271–290
Leopold A, Meine C (2013) A sand county almanac & other writings on conservation and ecology.
Library of America, New York
Liaw A, Wiener M (2002) Classification and regression by randomforests. R News 2(3):18 Liu J, Dou Y, Batistella M, Challies E, Conno T, Friis C, DA MJ, Parish E, CL R, Bl BS, Triezenber
H, Yang H, Zhao Z, Zimmerer KS, Huettmann F, Treglia M, Basher Z, Chung MG, Herzberger A, Lenschow A, Mechiche-Alami A, Newig A, Roch J, Sun J (2018) Spillover systems in a telecoupled Anthropocene: typology, methods, and governance for global sustainability.
Environ Sustain 33:58–69. https://doi.org/10.1016/j.cosust.2018.04.009
Loftus GR (1996) Psychology will be a much better science when we change the way we analyze data. Curr Dir Psychol 5:161–171
Mace G, Cramer W, Diaz S, Faith DP, Larigauderie A, Le Prestre P, Palmer M, Perrings C, Scholes RJ, Walpole M, Walter BA, Watson JEM, Mooney HA (2010) Biodiversity targets after 2010.
Environ Sustain 2:3–8
MacNally R (2000) Regression and model-building in conservation biology, biogeography and ecology: the distinction between – and reconciliation of – ‘predictive’ and ‘explanatory’ models.
Biodivers Conserv 6:655–671
Manly FJ, McDonald LL, Thomas DL, McDonald TL, Erickson WP (2002) Resource selection by animals: statistical design and analysis for field studies, Second edn. Kluwer Academic Publishers, Dordrecht
McArdle (1988) The structural relationship: regression in biology. Can J Zool 66: 2329–2339 Merow C, Silander JA (2014) A comparison of Maxlike and Maxent for modelling species distri-
butions. Methods Ecol Evol 5:215–225
Mueller JP, Massaron L (2016) Machine Learning for dummies. For Dummies Publisher, 435 p Næss A (1989) Ecology, community and lifestyle: outline of an Ecosophy (trans: Rothenberg D).
Cambridge University Press, Cambridge
Nielsen SE, Stenhouse GB, Beyer HL, Huettmann F, Boyce MS (2008) Can natural disturbance- based forestry rescue a declining population of grizzly bears? Biol Conserv 141:2193–2207 O’Connor R, Jones MT, White D, Hunsacker C, Loveland T, Jones B, Preston E (1996) Spatial
partitioning of environmental correlates of avian biodiversity in the Conterminuous United States. Biodivers Lett 3:97–110
Oppel S, Meirinho A, Ramírez I, Gardner B, O’Connell AF, Miller PI, Louzao M (2012) Comparison of five modelling techniques to predict the spatial distribution and abundance of seabirds. Biol Conserv 156:94–104
Perera AH, Drew A, Johnson CJ (2010) Expert knowledge and its application in landscape ecology. Springer, New York
Phillips SJ, Dudik M (2008) Modelling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31:161–175
Regmi GR, Huettmann F, Suwal MK, Nijman V, Nekaris KAI, Kandel K, Sharma N and Coudrat C (2018). First Open Access Ensemble Climate Envelope Predictions of Assamese Macaque Macaca Assamensis in South and South-East Asia: A new role model and assessment of endan- gered species. Endangered Species Research 36:149–160 https://doi.org/10.3354/esr0088 Reinhart A (2015) Statistics done wrong: The woefully complete guide. No Starch Press. San
Francisco
Reich Y, Barai SV (1999) Evaluating Machine Learning models for engineering problems. Artif Intell Eng 13:257–272
Romesburg HC (1989) More on gaining reliable knowledge. J Wildl Manag 53:1177–1180 Schapire RE (1990) The strength of weak learnability (PDF). Machine learning, vol 5. Kluwer
Academic Publishers, Boston, pp 197–227. https://doi.org/10.1007/bf00116037
Schapire RE (1992) The design and analysis of efficient learning algorithms. MIT Press, USA Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictors.
Machine Learning 37:297–336
Silva NJ (2012) The wildlife techniques manual: research & management. 2 volumes. The Johns Hopkins University Press; Seventh edn
Smith BD, Zeder MD (2013) The onset of the Anthropocene. Anthropocene 4:6–13
Venables WN, Ripley BD (2002) Modern applied statistical analysis, 4th edn. Springer, New York Verner J, Morrison ML, Ralph CJ (1986) Wildlife 2000. Modeling habitat relationships of terrestrial
vertebrates. University of Wisconsin Press, Madison
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufman Publisher, Amsterdam
Yen P, Huettmann F, Cooke F (2004) Modelling abundance and distribution of marbled Murrelets (Brachyramphus marmoratus) using GIS, marine data and advanced multivariate statistics.
Ecol Model 171:395–413
Zar JH (2010) Biostatistical analysis, 5th edn. Prentice Hall, Upper Saddle River