Use of Machine Learning (ML)
2.1 Introduction
Over a decade ago, Leo Breiman (2001a) wrote: “There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics.”
“…There is such a thing as being too late. This is no time for apathy or complacency…”
Martin Luther King Jr
F. Huettmann (*) · K. A. Herrick · T. C. Mullet · C. Resendiz · I. Rutzen EWHALE Lab, Biology and Wildlife Department, Institute of Arctic Biology, University of Alaska-Fairbanks, Fairbanks, AK, USA
e-mail: [email protected] E. H. Craig
Aquila Environmental, Fairbanks, AK, USA A. P. Baltensperger
National Park Service, Fairbanks, AK, USA G. R. W. Humphries
Black Bawks Data Science Ltd., Fort Augustus, Scotland D. J. Lieske
Department of Geography and Environment, Mount Allison University, Sackville, NB, Canada
Understanding the complex relationships between animal species and their habitats, and classifying and predicting the responses of species to existing or novel environmental conditions is one of the primary challenges in ecology and conserva- tion (Wilson 1998; Mac Nally 2000; Strogatz 2001). Machine learning (ML) is based on the principle that computers (‘the machine’) are effective tools for detect- ing patterns in data and making predictions based on those patterns (Hastie et al.
2009; Strobl et al. 2009). ML consists of many (over 100) algorithms (Fernandez- Delgado et al. 2014), which are even more powerful when ‘ensembled’ (Hastie et al.
2009). The human brain is challenged to grasp the complexities of ecological sys- tems; it can hardly compete with modern computers and ML algorithms for gaining insight into the 1000’s of dimensions these systems encompass. If the learning pro- cess using data from ecological systems is successful (and well tested), the recog- nized patterns can be generalized and used for classification, prediction, subsequent inference and extrapolation of complex data. These are critical components for achieving science-based conservation management (see Figs. 2.1 and 2.2 for an example and management schema). This approach can be applied to virtually any data and it eliminates the need to specify, a priori, generally untested and poten- tially biased assumptions regarding the underlying statistical distribution of the data (Breiman 2001a); as a consequence, ‘self-fulfilling prophecies’ are avoided by design (compare also with Kéry and Schaub 2012). In spite of the fact that available data may lack a traditional research design, ML algorithms can be used to model and provide insight into the complex, nonlinear relationships that are typical of real ecological systems. This presents a paradigm shift affecting not only data treatment and analysis (Breiman 2001a; Hastie et al. 2009; Huettmann 2005, 2007a), but also monitoring schemes (Magness et al. 2008), specifically the understanding and man- agement of natural resources (Huettmann 2007b) and the way institutions carry out
K. Miller
Auke Bay Laboratories, Alaska Fisheries Science Center, National Marine Fisheries Service, NOAA, Juneau, AK, USA
S. Oppel
RSPB Centre for Conservation Science, Royal Society for the Protection of Birds, Cambridge, UK
M. S. Schmid
Hatfield Marine Science Center, Oregon State University, Newport, OR, USA
EWHALE Lab, Biology and Wildlife Department, Institute of Arctic Biology, University of Alaska-Fairbanks, Fairbanks, AK, USA
CERC in Remote Sensing of Canada’s New Arctic Frontier Université Laval, Québec, Canada M. K. Suwal
Department of Geography, University of Bergen, Bergen, Norway B. D. Young
Department of Natural Science, Landmark College, Putney, VT, USA State of Alaska Division of Forestry, Fairbanks, AK, USA
Fig. 2.1 (a) Photo of the globally endangered Red Panda (Ailurus fulgens; taken by S. Tashi Lama/Global Primate Network-Nepal during 2009 in the Choyatar Community Forest in Eastern Nepal at an elevation of ~2400 m asl.). (b) Machine Learning predictions (RandomForest ensem- ble predictions of Red Panda ecological niche in the Hindu-Kush Himalaya region; Kamel et al. in review)
Fig. 2.2 Flow Chart of machine learning methods, and how it can be used in wildlife, biodiversity and habitat analysis, using
‘presence only’ data and others
and support these endeavors (Huettmann 2012). This is a very relevant topic for sustainability. Arguably, if one considers the very poor state of the world’s biodiversity and habitats (e.g., Mace et al. 2010; Huettmann 2012) it comes as an ethical require- ment to take advantage of the objectivity and power presented by ML approaches.
ML is ‘best available’ science! ML approaches have been successfully applied for decades in disciplines such as genetics, medicine, engineering industry, finance, and in some environmental sciences (Bureau et al. 2005; Cooper et al. 1997; Cutler et al.
2007; Dhar 1998; Galindo and Tamayo 2000; Goldberg and Holland 1988; Kononenko 2001; Kubat et al. 1998; Lee et al. 1996; Reich and Barai 1999; Rosten and Drummond 2006; Shipp et al. 2002), but less so in the wildlife and animal sciences, including behavioral research (e.g. primatology).
Habitat-species/biodiversity relationship modeling is one of the fastest growing sub-disciplines in ecology, as judged by rising citations of such publications (e.g.
Elith et al. 2006). However, wildlife and ecology disciplines currently still choose to rely heavily on stochastic data models, with their associated use of p-values and the concept of parsimony (Akaike Information Criterion [AIC]), to describe and quantify complex systems (Mac Nally 2000; Mogie 2004; Whittingham et al. 2006;
Breiman 2001a). This has translated to the virtually dogmatic application of AIC for inference (e.g. Johnson 1999; Anderson et al. 2000, 2001; Anderson and Burnham 2002; see Guthery et al. 2001 and Stephens et al. 2007 for a discussion). These tra- ditional analyses generally avoid ML or use it in a constrained fashion (Braun 2005;
Hochachka et al. 2007) and almost universally advocate the theoretical development of biologically plausible and constrained statistical data models prior to analysis (Burnham and Anderson 2002) before the actual confirmatory test has been com- pleted. It resembles an analysis where the outcome is already known before it was tested. Arguably, this approach to modeling requires a level of a priori knowledge that is typically absent in real world ecology studies, particularly for broad scale analy- ses. In practice, many real-world situations involve huge datasets where the number of plausible models is not easy to identify, widely unknown or is too large to enu- merate, and the prior information required to support the formulation of appropriate statistical data models is unavailable or weakly understood (Hochachka et al. 2007).
Even with extensively studied species, previously unknown relationships may be identified that change the theoretical framework and interactions, and which require completely new models and explanations to be developed. The power of ML applications is that they can extract and infer the relevant signals and relation- ships from complex data without any prior knowledge of the nature and shape of those relationships (Breiman 2001a; Hochachka et al. 2007). This trait makes ML techniques particularly applicable for data exploration and in a time of increased availability of powerful computers (Cushman and Huettmann 2010), e.g. ‘cloud- computing’, and with an ever growing supply of data from global online data- bases. At a minimum, ML applications can intelligently guide the preliminary exploration and analysis of vast amounts of data, and they do so in less time and with greater efficiency than hitherto possible using traditional statistical approaches (see e.g., Huettmann 2007a; Hochachka et al. 2007; Kampichler et al. 2010;
Hochachka et al. 2012).