Introduction - From Data Mining with Machine Learning to Inference in Diverse and Highly Compl

From Data Mining with Machine Learning to Inference in Diverse and Highly Complex

4.1 Introduction

For a long time, scientists had limited data and they applied experiments in isolation to test and forward knowledge on a given subject (Salsburg 2001 and Conner 2005 for overview). It’s a western tradition that then became applied to modern questions.

It is from this perspective that frequentist statistics evolved (Chamberlin 1890;

Berkson 1942), its methods became established and promoted (Popper 1945) then widely taught as a ‘good practice’ (Zar 2010) with many specific statistical tests that could be carried out (e.g. Table 4.1 and references within). It culminated in a study

F. Huettmann (*)

EWHALE Lab, Biology and Wildlife Department, Institute of Arctic Biology, University of Alaska-Fairbanks, Fairbanks, AK, USA

e-mail: [email protected]

“The problem posed for biologists by the real world seldom, if ever, have an exact, correct statistical solution. The assumptions of nearly all techniques are violated to a greater or lesser extent by real data.”

McArdle (1988)

“In a Time of Universal Deceit — Telling the Truth Is a Revolutionary Act.”

Attributed to George Orwell

“With this method the dangers of parental affection for a favorite theory can be circumvented.”

T. Chamberlin (1890)

Table 4.1 Overview of some basic statistical tests for common data problems Data situation Recommended test Reference Comments Normal distribution Goodness of fit tests

(Z scores, Shapiro–

Wilk, Kolmogorov–

Smirnov Goodness- of-Fit Test, Lilliefors and Anderson–Darling tests)

Filliben (1975), Zar (2010), Razali and Wah (2011)

Nature is virtually never normally distributed.

Consequentially, those tests are purely theoretical without any relevant meaning for real life and conservation management applications (McArdle 1988).

Parametric: Test for differences among samples and experiments

Chi-square test (one way, or two way contingency table);

widely used

Quinn and Keough (2004), Zar (2010)

Should only be done with a valid hypothesis testing framework.

However, this assumes the underlying theory is correct and statistically met. It is possible to use a GLM as well (see below as well). There are many references that question the validity of the threshold (e.g. 0.05%) and others ague to “euthanize”

p-values all together (Anderson et al. 2000, Anderson and Burnham 2002, Concato and Hartigan 2016, Stang et al. 2010 for tyranny of p-values).

Regression slope Linear Regression Model (LM) (e.g. Zar 2010)

Zar (2010), McArdle (1988)

The existence of a slope is often seen as an ‘effect’, e.g. when compared to a flat line (no slope). However, these details are widely discussed. Usually LMs require a normal distribution of the errors and thus, they are not so realistic for natural processes.

Non-parametric: Test for differences among ordinal samples and experiments

Median Test, Mann- Whitney U-test, Kruskal- Wallis test, Wilcoxon- signed rank test

Venables and Ripley (2002)

There is a wider debate about the power of those tests.

Arguably, they are more powerful than the parametric tests, but their predictive performance tends to be poor.

And consequently, so is the inference, according to Breiman (2001a).

Interval/ratio data T-test, ANOVA (F-test)

Thompson (2004)

One of the most frequently used statistical test procedures.

However, its predictive performance is rather low, and so is the inference according to Breiman (2001a).

(continued)

Table 4.1 (continued)

Data situation Recommended test Reference Comments Multiple comparison

tests

Bonferroni, Tukey, Scheffe

Quinn and Keough (2004)

There is a wide debate about the validity of those tests. See Rothman (1990) and Perneger (1998) for what’s wrong with Bonferroni tests.

Advanced difference tests

MANOVA, ANCOVA, MANCOVA

Hillborn and Mangel (1997)

Those complicated tests rarely meet the real-world assumptions, unless data match the required research design and are free of interactions etc. (Nature is virtually never free of interactions and lacks such a research design)

Normal distribution of errors

(Heteroscedasticity)

White test, modified Beusch-Pragan test

Quinn and Keough (2004), Zar (2010)

There should be no

heteroscedasticity (i.e., variance of residuals should not increase with values fitted of response variable) and error bias in linear regressions. Often it gets rectified through Box Cox transformation (but which affects the original data for inference)

Confidence Intervals Confidence intervals and/or standard error

Gardner and Altman (1986), Fidler and Loftus (2009)

Unless confidence intervals are truly assessed with alternative data, they contribute more to unproven claims (so-called confidence trick; Salsburg 2001, Reinhart 2015)

Power of a test and effect size

Alpha and beta levels, simulations

Greenland et al. (2016)

A key question in experimental testing, for sample sizes required and sensitivity setting of a valid inference

Multiple Regressions (many predictors)

Generalized Linear Models (GLM)

Hillborn and Mangel (1997), Venables and Ripley (2002), Quinn and Keough (2004)

The ‘workhorse’ for many statistical applications and in model selection studies. Usually applied in a logistic and parsimonious setting (Burnham and Anderson 2002, Manly et al.

2002). However, GLMs and GAMs are known to predict poorly (Elith et al. 2006), and they have very strong assumptions Autocorrelation Moran’s I, Ripley’s

Venables and Ripley (2002)

There is a lot of debate about autocorrelation, e.g. to correct it or use it as a description (Swihart and Stade 1985; Betts et al. 2009).

template used for simplicity (Anderson et al. 2000; Burnham and Anderson 2002).

Whereas, similar to good detective work (e.g. Hilborn and Mangel 1997) reflection should still sit at the core of human inquiry and knowledge, specifically for ecology and natural resource management (Naess 1989; Romesburg 1991; Dodds 2001;

Stephens et al. 2007; Silva 2012; Stanton-Geddes et al. 2014) especially where modern computing now allows for more than just ‘simplicity/parsimony’. Instead, this enables us to carry out powerful predictions (Venables and Ripley 2002) for a more holistic approach (McGarical et al. 2000; Drew et al. 2011) to tackle man- kind’s problems.

But complex real-world data situations are unknown and difficult for us to com- prehend; much can be accomplished within those vast datasets and data cube (McGarical et al. 2000). This is an inherent characteristic of data from nature (McArdle 1988). It applies even more to multivariate problems when many predictors are employed (McGarigal et al. 2000; Drew et al. 2011). Despite many claims made (Zar 2010; see also Anderson and Burnham 2002; Manly et al. 2002; Silva 2012 or even Chamberlin 1890; Berkson 1942 and Popper 1945), there are no good and standard rules to generalize multivariate problems because every data set tends to be different and unique, requiring powerful analytical approaches instead.

Reflection is required. Nature is not symmetrical, not linear and not normal distributed; and that reality applies to ecology and natural resources, which has been known and expressed for decades (e.g. Naess 1989; Yoccoz 1991; Dodds 2001). In human medicine and psychology those details are also well known and have been discussed for long time (e.g. Salsburg 1985; Loftus 1996; Lambdin 2012; Rinehart 2015). This is also true in education (Thompson 2004; Ziliak and McCloskey 2009).

As early as 1988, McArdle (1988) had already stated clearly that “in the I-wish-it- were-so land of theoretical statistics …the data cannot be made to conform to the assumptions…” Thus, violations and surprises can easily occur in biological data sets (Elith et al. 2006; Hastie et al. 2009), and those data defy many other assumptions, theory and untested expectations such as parametric ones (Zar 2010). Making those data cases all equal, parametric, and putting them through the same analysis steps - without much reflection in a study template (as promoted by Burnham and Anderson 2002) - does them no good justice nor the many cases and realities they are representing. The peculiarities of the data records can be easily lost that way.

But on the bright side, if one manages to resolve those peculiarities, with accepted methods that lead to a valid generalization, then the inventors of those methods are likely to end up in permanent positions (for life!). This is what data miners are doing for a living (Breiman 2001a, b; Mueller and Massaron 2016) and likely why data science is the ‘hottest’ job of the early 21^st century: it’s a profession to turn marginal data into defendable information (i.e., extracting the signal from a very complex set of information). See Chap. 1 in this book for the evolution and historic details of this concept of machine learning.

In the following, I will elaborate on the task of data mining in real life applications and contrast it with the usual approach that is still taught in virtually all uni- versities as the gold standard (Ziliak and McCloskey 2009; Zar 2010; for natural resource management textbook see for instance in Silva 2012) (Fig. 4.1).

4.2 Model Selection with Many Predictors as an Analysis

In document Machine Learning for Ecology and Sustainable Natural Resource Management (página 102-106)