A Real–World Data and Analysis Workflow Example

From Data Mining with Machine Learning to Inference in Diverse and Highly Complex

4.4 A Real–World Data and Analysis Workflow Example

to work from (Zuckerberg et al. 2011; Greenland 2012). Often, this consists of the best-available data and maps, let’s say. That way, science truly did its due diligence and decisions are made from it. While well-known (e.g. Huettmann 2007), unfortu- nately this concept is not applied much, and the prime bottleneck still sits in the policy arena (Magness et al. 2008, 2011)! (Fig. 4.5)

Suggestion 2: Mine the data through in full detail (i.e., data exploration)

Justification: This first data mining exercise is less about an analysis and a quanti- tative understanding, but rather to get familiar with the data format and to see and find where the (technical) ‘speed bumps’ are. Learning how to handle data obtained, and how to do it efficiently is a key procedure and for effective data analysis (and associated time and budgeting to get the project completed). Often, those steps need to be repeated and the more one can do it, probably the better.

Suggestion 3: Understand the data, each column (i.e., covariate) and each combi- nation you can test and assess

Justification: Many data are not clean and need to be understood first. Often what looks clean is not clean and needs a test and run first; this applies to columns and their content. That is especially the case with large data sets, and where not all content can be overlooked right away but shows as an odd result instead.

Suggestion 4: Talk to the initial data collector and read all documentation available

Justification: Ideally, such a communication should occur when the data are handed over. However, communication with the data are more effective after one is ‘into’ the dataset and knows them intimately. Arguably, a good head start on data can be helpful, but most data miners can figure out data sets on their own first. The real crucial content questions come afterwards and continue once the entire data set is analyzed.

Suggestion 5: Address all NAs and data errors.

Justification: According to the public wisdom, wrong data tends to produce wrong results. Cleaning data -technically and scientifically - is an essential but big task;

there are different philosophies involved in how to approach it. It can determine the success and failure of such projects. Arguably, one wants ‘good’ data to run models.

But ‘life’ is not always clean, and an alternative but effective and often not so prob- lematic approach is to initially ignore the errors - use the raw data - and run the analysis and compare the findings subsequently. In machine learning, ‘majority vot- ing’ can for instance overcome data errors. And different model runs from data cleaned at several levels should be done for comparison. If that cannot be done on the entire data set, it’s worthwhile to run it on a subset of the data, or on a certain percentage of the data. The goal here obviously is to find patterns in the data, explain them, and how it all relates to NAs and errors, and what the impacts are on

‘cleaning’, e.g. to avoid running a self-fulfilling hypothesis with no new Information coming from the data analysis.

Suggestion 6: Predict the data onto themselves (create a great ‘learner’)

Justification: This step is an essential piece in understanding the inference, the data and the content. Usually, the more the training data can be ‘mimicked’ the better.

It allows to generalize beyond the data. An essential question is what algorithm to use to predict data onto themselves, what methods to use to describe outliers, and how to measure the variance. While these questions are covered elsewhere (e.g.

Fernandez-Delgado et al. 2014), it is suggested to employ different algorithms for a comparison. I highly suggest creating a great ‘Learner’ (as per Breiman 2001a, b;

Friedman 2002) instead of a model fitting mindset (Zar 2010).

Suggestion 7: Re-run the models without taking out all NAs and data errors (Step 4).

Justification: As mentioned in step 4, most data carry errors, and many cannot really be fixed in large data sets, especially when the data carry a legacy and with many people involved. However, to get a sensitivity test done on the real impacts of cleaning efforts, we suggest running a raw data model, and to see how bad the errors and bugs really are. We found that to be usually a very informative task to do, also defining where to put time and efforts for a cleanup of data and how this pres- ents an effective gain. Doing such runs helps to set priorities. In large data sets, this is not a trivial task because based on our experience ~10% of errors tend to be part of the process in large data holdings and where many owners occur; fuzziness becomes part of the game.

Suggestion 8 Think very hard about hypothesis and whether and what it brings to the discussion and to new information

Justification: No doubt, conceptually, hypothesis testing can provide progress and if done well and applied where suitable (Cushman and Huettmann 2010; Reinhart 2015). Hypotheses make for a great narrative, appear very sound, and are very appealing to communicate. But it applies primarily to experimental and theoretical settings. However, for nature and reality it shows us different (McArdle 1988;

Reinhart 2015). In complex multivariate situations of nature, those theoretical con- cepts are very difficult to apply so that all statistical assumptions are correctly met.

Alternatives exist, and we outline a good step forward (see below).This matters essentially for any applied conservation management question.

Suggestion 9: In complex multiple regressions and with many predictors, hypoth- esis probably do not bring much progress, and there are other, better ways to inference.

Justification: Traditionally, the widely used approaches in such cases are p-values, AIC, and perhaps Bayesian approaches. All of them are known to fail on various grounds (as published by Anderson et al. 2000; Guthery et al. 2005; 2008;

Whittingham et al. 2006 etc.) Authors like Breiman (2001a, b) and Hastie et al.

(2009) showed already powerful ways forward. See also Fig. 4.4, Strobl et al. 2007 and Drew et al. 2011. The key here is not to apply just a techno-fix but to break out of existing limitations and to embrace new and exciting ways to discover knowledge.

Fernandez-Delgado et al. (2014) shows options to explore, however there are many more options and new ones exist and found online.

Suggestion 10 Explore any statistical tools and approaches you can get hold of and know and test them in parallel.

Justification: There are many ways to carry out an analysis and to get to a conclusion.

Many software platforms exist and should be tested and run in parallel. This is not only insightful for the own learning but presents a ‘test’ and how robust current findings and knowledge are. Even more so, many algorithms have different versions and software implementations. For instance, a random generator in one software easily differs from another software which can affect outcomes. Even for simple and well-known implementations like ordinary least square regressions many ways exist to make it happen numerically. And not many agree with each other when bench- marked (Sawitzki 1994a, b). Looking then at disciplines like Demography and ones that use Distance Sampling, and Resource Selection Functions (RSFs) the diversity of tools used remains extremely narrow and not allowing diversity to be tested and to evolve. One should consider that much discussion exist already on dubious infer- ence in those disciplines (Yoccoz 1991; Rexstad et al. 1988; Stephens et al. 2007;

Arnold 2010).

Suggestion 11: Leo Breiman (2001a, b) offers a valid philosophy, an insightful and robust approach to be used and investigated by all means.

Justification: As we promote in this chapter, the works by Breiman (2001a, b) are very thorough, deep and progressive. That holds to this very day, more options exist and are developing. While progress is not well made in the field of natural resource management, by now, this work is well published and offers good soft- ware tools to be employed. It has reached a great foundation to work from.A key concept here is to assess and model data (as outlined in steps above), create a prediction, and have alternative data handy to assess the prediction for perfor- mance, and then infer accordingly. Ideally it requires a research design with training and testing data. That will help to overcome many of the problems described here.

Suggestion 12: Exposing all steps, raw data, raw code and details as metadata and ‘in the open’ supports the buy-in and transparency by the users and the public.

Justification: Since science is to be repeatable and transparent, this suggestion should not come as a surprise, but we find, it often is fully ignored. As Carlson (2011, 2013) and others found, data and project files are not shared, even less so, the actual underlying code (despite the wide use of R, Distance Sampling and MARK software packages easily allowing for it). Making code publicly available allows for feedback and improvements, and it helps others in their work. It should be seen as a collegial task and for serving the wider public good and being a scientist. For any of this, good metadata formats exist (Huettmann 2005, 2009).

4.5 Real World Tools and Minimum Approaches to Start

In document Machine Learning for Ecology and Sustainable Natural Resource Management (página 112-116)