From Data Mining with Machine Learning to Inference in Diverse and Highly Complex
4.3 Confront Models with Data: Moving towards an Evidence–Based Analysis
Although rarely done in natural resource management, arguably, all data can be expressed and summarized as a (quantitative) model. Thus, the data then turn into a single formula (which essentially is a set of rules!) that basically ‘clones’/mimics the training data, It represents an entire experimental setting, or an ecosystem (for example). This is the act of creating a ‘learner’! (Schapire 1990, 1992; Freund and Schapire 1997; Schapire and Singer 1999; Friedman 2002). The beauty here obvi- ously is that now you have captured the data and can express them (i.e., an entire ecological situation) as a formula with metrics to modify them; all done in software on your computer! But while virtually any linear regression application is doing just that, an even better summary can be achieved with non-linear methods provided through machine learning (Elith et al. 2006; Hastie et al. 2009; Mi et al. 2017).
Often, this does not work out precisely, and it comes home to an approximation of the data set, and the mismatch is shown as the variance unexplained. Fig. 4.2 shows it in two examples (Fig. 4.3).
Table 4.2 A few methods used to choose ‘the best’ model (and its predictors) [Arguably, ‘the best’ is a subjective term, but in modeling it can be presented well when using performance metrics, e.g. Pearce and Ferrier (2000)]
Model selection
Method + Ref Pros + Ref Cons +Ref Comments
Visual (eye-balling)
None, but many applications of pure eye-balling exist, e.g. Manly et al.
(2002), Johnson and Seip (2008).
Reinhart (2015) Eye-balling is widely done and even approved of, whereas it is not precise and flips the entire reason and concept of working quantitatively (for more details see Reinhart 2015).
Univariate It’s fast and appears simple (Zar 2010).
Misses reality (McGarical et al. 2000).
Widely applied, although ecology and nature are known to be multi- variate, e.g. Naess (1989), McGarical et al.
(2000).
p-values Shows ‘significant’
explanations and predictors.
Significant-itis (Chia 1997) to be ‘euthanized’
(Anderson and Burnham 2002, Burnham and Anderson 2002).
Arguably, still the dominant approach in the sciences, e.g. Quinn and Keough (2004).
AIC
(Burnham and Anderson 2002)
Finds ‘one’ (or the best) strong predictor. It implies due diligence and
‘science well done’.
Overfit, biased, poor prediction (Arnold 2010, Guthery et al. 2005, Guthery 2008).
The performance metric of AIC has no biological nor mathemat-ical justify-cation. It is arbitrary.
Stepwise (Venables and Ripley 2002)
Fast; implies an exhaustive approach.
Misses reality
Whittingham et al. (2006).
Forward and backward approaches are not in good agreement with each other (Venables and Ripley 2002).
This is essentially a first, simplistic and naïve version of data mining but using the wrong (linear) methods.
Bayesian * Use of priors (informed and uninformed ones), WinBugs algorithm etc. e.g. Hobbs and Hooten (2015)
Automated inference and subjective, e.g. Gelman (2008)
An alternative to frequency statistics and its inference, but poor predictions and such inference overall.
Maxent a Fast, reliable (Elith et al. 2006),
Tends to be point-and- click, low skill and expertise needed; often outperformed (e.g. Mi et al. 2017). For inference discussion see Yackulic et al. (2012)
Among the best prediction, classification and inference tools. A prediction and machine learning approach.
(continued)
Table 4.2 (continued) Model selection
Method + Ref Pros + Ref Cons +Ref Comments
Machine Learning: CART a
Fast, alternative to liner regressions;
new insights (Breiman et al.
1984); De’ath and Fabricius (2000)
Usually outperformed by boosting and bagging.
(Breiman 2001b, Elith et al. 2006).
A strong method from the 1980s. Breaks linearities in data. Often the basis for a great
‘Learner.’
Machine Learning:
Random Forest (bagging) a
Fast, very reliable (Mi et al. 2017), achieving high.
Breiman (2001b;
Cutler et al. 2007)
Often ignored or not known or favored by investigators and funders.
Strobl et al. (2007) reports bias.
Among the best prediction, classification and inference tools, often used in ensembles.
Machine Learning:
Boost-ing a
Fast, very reliable (Elith et al. 2006).
Often ignored or not known or favored by investigators and funders.
Among the best prediction, classification and inference tools; , often used in ensembles Machine learning
algorithms, e.g.
Hastie et al.
(2009), Fernandez- Delgado et al.
(2014).
Powerful and flexible
Often ignored or not known or favored by investigators and funders.
A large field waiting to be explored further for progress.
aWhile these are concepts and algorithms as such they have inherent metrics and approaches to model selection, and are frequently used to rank, find and compare model selections. Eventually, any model selection is a function of the metric and concept/algorithm used
Fig. 4.2 “The Experiment”, the real world, sampling and application
In modern terms, that unexplained ‘noise’ or variance (deviance, depending on the underlying distribution) is captured as the entropy (i.e., unexplained structure, chaos in the data). One may easily become more engaged in this discussion (see Georgescu-Roege 1971 for entropy), but for this section here we simply stick with the notion of a model that is fitting the data, and ‘the rest’ as unexplained. This ‘rest’
is a function of what the algorithm is able to capture from the data. Good algorithms capture more, if not almost all the data; whereas bad algorithms do not capture the data well, and others might add explanations that are not ‘true to the data’. The latter case is generically referred to as overfitting, but which machine learning algorithms can optimize (see Hastie et al. 2009 and Mueller and Massaron 2016 for identifying local optima and best fitting approaches for best-possible generalizations from ‘a learner’). These components are the essence of good algorithms and why there is such a chase in finding ‘the best’ ones (Fernando-Delgado et al. 2014). Now, ‘the best’ is usually explained as fitting the data, that is a mainstay of traditional forms of statistics (Hillborn and Mangel 1997; Zar 2010). But beyond the training data, better seems to be fitting the data for the best possible prediction, which only then allows for best-possible inference and generalization (sensu Breiman 2001a, b). The focus on best-predictions allows us to generalize from real-world situations, instead of just fitting in a narrow and traditional way some data onto a pre-defined theoreti- cal model assumption creating bias (McArdle 1988; Zar 2010). Machine learning leads the way in this regard, whereas linear regression is shown to be less powerful and usually is widely dismissed if it comes to ecological multivariate interpretations (McGarical et al. 2000, for tests and applications see Elith et al. 2006; Oppel et al.
2012) and high-performance applications such as predictions (Fox et al. 2017; Han et al. 2018). When re-arranging, combining, optimizing or linking machine learning algorithms with each other (i.e., ensembles), one can even get higher performance
Fig. 4.3 Linear regression data summary vs machine learning summary
(Araujo and New 2007, or boosting or bagging (essentially a specific form of ensembles; Hastie et al. 2009). Ensembles offer great opportunities due to being less biased due to their ability to make good use of many predictors across algorithms in a powerful way of computing (Elder 2003) (Fig. 4.4).
But any model that can capture data in ‘good’ terms is helpful for predictions, and we then tend to use it. Again, the prediction is the focus, not the algorithm per se. And why not? One would be ill-advised to ignore such an algorithm that provides progress.
Therefore, any model, any algorithm that provides progress is used, and others are usually dismissed (unless one finds value and insight for a comparison, let’s say). What really matters now is to show that the evidence provided is true and gen- eralizable. That is usually done by confronting the model with other data, alterna- tive data and evidence (Hilborn and Mangel 1997). If those match, two lines of evidence are similar and overall progress is provided. It is very difficult to reject such a case and such logic due to its evidence. Ideally, more evidence is located and the model gets compared with it. Multiple lines of evidence is ideal and widely done, see for instance Huettmann et al. (2011) with 5 lines of evidence and Kandel et al. (2015) with 7 lines of evidence. All of those lines of independent evidence are in good agreement with each other supporting the prediction and thus generaliza- tion, and the method overall outlined here. The model should be used and inter- preted then, but only then. This approach nicely follows a meta-analysis scheme (Schmidt and Hunter 2014) and proves to be rather successful for synthesis. That’s because all existing and available information is used and interpreted. This means that no other evidence is left to show different results than those presented, and hence the argument is difficult to reject because it is based on ‘best available’ science.
If that is all carried out in an open access framework, this form of reasoning is not only repeatable (Savalei and Dunn 2015) and transparent, but it also provides its products used for latest reasoning available as building blocks for people to use and
Fig. 4.4 Some simplistic and pragmatic reasoning like 1+1 = 2 cannot really be applied to ecology and natural resource management questions
to work from (Zuckerberg et al. 2011; Greenland 2012). Often, this consists of the best-available data and maps, let’s say. That way, science truly did its due diligence and decisions are made from it. While well-known (e.g. Huettmann 2007), unfortu- nately this concept is not applied much, and the prime bottleneck still sits in the policy arena (Magness et al. 2008, 2011)! (Fig. 4.5)