Data Issues and Availability Related to Data Mining and Machine Learning

Algorithms, Data Issues and Availability, Workflows, Citizen Science, Code Sharing,

Textbox 1.2 Cheer Pheasant predictions in the Hindu-Kush Himalaya

1.6 Data Issues and Availability Related to Data Mining and Machine Learning

One of the biggest criticisms that the machine learning community receives is that these algorithms are ‘black boxes’ (Craig and Huettmann 2008). This suggests that we arbitrarily input data into an algorithm, and then some sort of ‘voodoo magic’

occurs, and an output is given. This is an unfortunate misnomer and has led to

“machine learning” being considered a ‘dirty term’ in some circles. However, we posit here that we use ‘black boxes’ all the time (e.g., cars, computers, mobile phones, social media, etc.…). The term itself is quite subjective, particularly at a

Table 1.2 A selection of machine learning algorithms and associated descriptions most commonly used in ecology and wildlife biology

Algorithm Description MAXIMUM

ENTROPY

Maxent is a specific software package that has been used almost

exclusively for species distribution modeling, and the various wrappers for this algorithm have been designed as such (Phillips et al. 2006; Phillips et al. 2017). The input tends to be a series of presence-only locations and environmental covariates. Maxent calculates a set of constraints on the environmental covariates and then estimates the probability of presence by measuring uncertainty within the set constraints. Maxent works by

‘thinking’ in terms of covariate space as opposed to geographic space (i.e. we guess a species occurs in a certain range of values, say

temperature, in proportion to the availability of those range of values).

A full statistical description can be found in Elith et al. (2011).

CLASSIFICATION AND

REGRESSION TREES (CART)

This is the earliest of the tree-based methods (Breiman et al. 1984) and forms the basis for both boosted regression trees and random forests.

CART works by randomly selecting variables and splitting the variable space (e.g., the x-y plane of dependent vs independent variables). The variance of the data on each side of the split is measured and recursive partitioning is applied to determine the split that minimizes the variance of the two data ‘clouds’. This split forms a sort of ‘if/then’ conditional statement (e.g., if a data value of variable x > 2, then go down the left branch, or if the data value of variable x < 2 go down the right branch).

A series of these branches form a tree, which can be queried for the most important predictors (independent variables). Predictions to data are made by applying the rule set to new data points.

BOOSTED REGRESSION TREES

This algorithm is derived directly from CART, and is essentially a series of iterated trees, where at each iteration the error is minimized through the application of a loss function (e.g., root mean squared error). The point of this method is to begin with poorly performing variables and trees, and re-fit those using the residuals from the previous model. Predictive performance is measured at each iteration and when performance starts getting worse, the iterating stops. The final tree ends up being a linear combination of all the trees and has the lowest amount of error (Friedman 2002; De'ath 2007). Since many trees are used, it can be seen as a sort of ensemble model (although not in the traditional sense in that there is no model averaging). There is a clever extension and optimization of this approach in some commercial algorithms and can be linked with ‘bagging’

techniques (see details below). Those latter concepts tend to be among ‘the best’ available (i.e., make the best predictions and are the easiest to interpret).

RANDOM FORESTS

This method is derived from applying a ‘bagging’ method to CART (Breiman 2001b) and is probably the most successful (or most popular) machine learning algorithm used in ecology (a list of papers would be too extensive but see Cutler et al. 2007 for an introduction to their use in ecology). It is an ensemble modeling method based on a specific

bootstrapping technique. Many large trees (many branches) are constructed by sampling with replacement, and the final model ends up being an average of all those trees (by votes in classification schemes or averaging in regression schemes). Random forests improves on bagging by making the splitting process more efficient through the use of out-of-bag data instead of computationally expensive k-fold cross-validation. Many

implementations of random forests exist on varying platforms.

(continued)

Algorithm Description GENETIC

ALGORITHMS

Genetic algorithms are derived from the process of evolution wherein competing solutions ‘evolve’ over time until an optimal solution is reached (Holland 1975; Olden et al. 2008; Fernández et al. 2010). These have been used in ecology to a limited degree under the acronym GARP (genetic algorithm for rule set production; Stockwell and Noble 1992). GARP has only really be used in ecological niche modeling to date (Elith et al. 2006;

Peterson et al. 2007) but could be extended to other problem sets. Genetic algorithms work by creating a series of random ‘solutions’, which are then

‘mutated’. The best solutions are selected from these and then recombined or re-‘mutated’. However, current implementations of genetic algorithms lag behind other machine learning tools and traditional statistical techniques can outperform these due to their tendency to over-learn (over-fit) the data (Olden et al. 2008).

Bayesian machine learning

Based on Bayes’ theorem (Laplace 1986; one of the oldest statistical concepts), these methods work by building an understanding of systems through the expression of probabilities and updating of those probabilities with new evidence. In a machine learning context, Bayesian methods can be applied to classification, where the likelihood of membership to each of the classes is calculated for each of our data points, and new data are assigned to the class with the highest likelihood. This method can give good results with few training data (Kotsiantis et al. 2006) as they use “prior” information on the parameters of the model, that help inform the outcome. However, a good ecological knowledge of the system at hand is not always possible, and the choice of prior distributions often require a ‘best guess’.

SUPPORT VECTOR MACHINES

This method is not commonly used in ecology, but it has some merit for both classification and regression. The essence of support vector machines lies in x-y planes, where data are separated by straight lines. The lines are created by making the margins the largest possible difference between all the points (see Fig. 6 in Thessen 2016). Data are classified or predicted based on which side of the margin they fall into. These are trained in an iterative fashion and can be tuned using other functions in cases where data are ‘messy’ (as in ecological data). See Kotsiantis et al. (2006) for a detailed description of support vector machines.

ARTIFICIAL NEURAL NETWORKS

This ‘machine learning algorithm’ is the basis for much of the image and vocal recognition software that exists currently, and also the basis for many of the artificial intelligence algorithms that currently dominate the technological world. With the advent of ‘deep learning’, neural networks have taken the center stage in the machine learning community but are actually one of the first machine learning methods to gain popularity in the 1970s and again the early 1990 (backward propagation etc). They work by simulating the way the human brain processes information (Recknagel 2001; Hsieh 2009). Input data are taken into a series of ‘nodes’ in the form of the independent variables. These data are weighted in the links between nodes and then passed to the next level, where information about those data are extracted, weighted, and passed to a next set of nodes. This could be viewed as a sort of ‘conveyor belt’, except using back-propagation, errors can be corrected. These algorithms are exceptionally difficult to program (though this is changing rapidly) and are very efficient for high dimensional data. See Hagan et al. (2014) for a detailed description of artificial neural networks.

Table 1.2 (continued)

time when a plethora of easy-to-follow tutorials exist for free on the internet (see for instance https://www.r-bloggers.com/in-depth-introduction-to-machine-learning- in-15-hours-of-expert-videos/ for an in-depth breakdown of machine learning or Hastie et al. 2009).

There is no mystery to machine learning algorithms given enough time to study their inner workings. They have been programmed by people with a thorough understanding of machine learning tools and can be decoded and re-traced. We believe the ‘black box’ argument is incorrect as an objective a criticism of machine learning. While we appreciate the difficulty in having to decode these methods, particularly without the computational training required to do so (we have all been there, and continue to learn to this day), a fairer thing to say would be “machine learning is a black box to me, and I should strive to learn about it”.

Sometimes, the ‘black box’ argument turns into “where is the code?”. This could be turned around to GAMs, GLMs, GLMMs, LMs, and other frequentist methods as well. We put inputs in, we read the and interpret the output, but there are few people who know the exact inner workings of the code (save for those fluent in programming languages). The only difference is that frequentist methods have been around for so long, and that the mathematical equations are taught to us as students, so we claim to understand them better. This would easily change if the inner workings of machine learning were taught the same way. Either way, this is why transparency is key in the sciences. If a machine learning scientist can show the code from input to predictions (from scratch, without using any pre-packaged libraries in R, for example), while someone else comes along and simply uses the ‘lm’ (linear modeling) function in R, which one is a ‘black box’? The point we strive to make here, as stated above, is that the term ‘black box’ is really not objective. Getting past the notion of the ‘black box’ as an objective criticism and turning it into a ‘transpar- ent box’- or at least a grey box - would greatly help ease ecologists into the use of machine learning algorithms.

Open access code and data are a must for data mining and machine learning work. Not just to get past the stigma of the ‘black box’, but also for scientific transparency. Science operates (or at least, in our opinion, it should) on the tenet that another scientist should be able to come along and replicate experiments exactly and get the same results. This ensures that our work remains truthful and honest (e.g. Zuckerberg et al. 2011). To do this, the entire workflow (i.e. code and data) must be available and formatted in a way that someone could come along and easily run the analysis. We would recommend that the data sources are adequately refer- enced and if possible, a datafile (either a database or a flat spreadsheet) with all the pertinent data are provided. This is not typically done, however, and certainly not with well documented metadata. An argument for this has been out of fear of ‘scoop- ing’, which is not documented and very rare to occur in ecology. In some cases where data could be used by malevolent members of the public to harm a species, there is some argument towards restricting data but making it available to other scientists (Tulloch et al. 2018).

The metadata associated with a dataset is the data that describes its format and workflow (i.e., the description of how data and code are applied to create the output)

as well as contact information for those involved in its creation (Huettmann 2015).

Although this concept has been around for more than 10 years, metadata is rarely included or mandated upon submission to most journals. Although a somewhat painstaking process to some people, it is another important part of science to ensure transparency (Huettmann 2007). A smart budget for any project should take this, and other data management techniques, into account (i.e., by including salary time for preparing metadata and curating code and data sources).

Data management in and of itself requires a whole separate book, and its impor- tance in ecology is highly under-stated (Zuckerberg et al. 2011; Huettmann 2015).

Within our field, there is not a ‘best practice’ guide, but it is greatly needed. Good data management techniques such as proper computer filing systems, documenting code, documenting spreadsheets, metadata, backing up, redundant hard drives, high quality software and hardware, etc. ensure long-term survival of data and the science we do. It is a form of scientific accounting, which is non-existent in our current system.

In document Machine Learning for Ecology and Sustainable Natural Resource Management (página 35-39)