LOS CONSEJOS ESCOLARES - El aprendizaje colaborativo

The provocative title of this section deliberately alludes to the seminal paper of Ioannidis (2005). Mainly in response to a replication crisis in medical research, this paper identifies high-level mechanisms which lead to a systematic increase in ‘false’ findings, that is, claimed effects which are subsequently refuted. The observations around evaluation practice can be transferred to other

5_{Conservative quantifiers implicitly restrict the quantification-relevant sets to the quantified noun. For instance,}

“Half the squares are red.”is equivalent to “Half the squares are red squares.”, since other “red” objects are not relevant to the statement’s interpretation.

scientific fields, and I think it is worth considering the following six corollaries postulated by Ioannidis (2005) in the context of deep learning research in recent years:

1. “The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.”

2. “The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.”

3. “The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.”

4. “The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.”

5. “The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.”

6. “The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.”

A few researchers have expressed their concern about research practice in machine learning related to these points: Sculley et al. (2018) noted a lack of “empirical rigour”, Lipton and Steinhardt (2018) commented on “troubling trends in ML scholarship”, Hutson (2018) recently even voiced the question that ‘hangs in the air’: is machine learning facing a replication crisis? Following a literal interpretation of “replication”, one may respond that machine learning is well guarded against such a crisis, given the fact that experiments can easily be repeated, particularly thanks to the increasingly common practice to release paper-accompanying code and data online6_{. I propose to look at “replication” from a different angle and thereby, I believe,}

capture the concerns about machine learning practice more faithfully: the type of replication crisis ML research may be facing is not due to an inability to reproduce the experiment, that is, the performance number of a model on a specific dataset, but to reproduce the implied/promised superior capabilities of this same model, which the ML paradigm implies. Ioannidis (2005) linked the amount of such spurious improvements to the “prevailing net bias” in the community. Indeed, continued experimental practice despite the range of findings reviewed in this chapter, which report weird model behaviour, transfer/downstream failure, dataset biases and inadequate performance metrics, can only be attributed to a strong belief in the abilities of deep learning.

6_{However, Hutson (2018) rightfully pointed out that: (a) the majority of papers still do not come with open-}

sourced code; (b) experiments are sensitive to minuscule aspects of the training conditions down to random seeds (Henderson et al., 2018) and hardware details; (c) the same level of significance may not be replicable due to flawed statistical methods (Szucs and Ioannidis, 2017; Reimers and Gurevych, 2018; Kir´aly et al., 2018); and (d) the scale of experiments coming from research groups in industry is simply unfeasible to replicate for academic researchers.

This belief is further testified by the common usage of anthropomorphising and (deliberately?) imprecise language like models learning to “understand”, “infer”, “attend”, “recognise”, as opposed to more technical terms related to optimisation, to describe model behaviour (Levesque (2014) and Lipton and Steinhardt (2018) mention the problem of language as well). Anthro- pomorphising language may have the effect of sustaining this belief in overall progress of the field, and at times fool even more cautious researchers into over-optimism despite doubts about a range of individual experimental results.

What is the reason for “prevailing net bias” to be able to cause a replication crisis? Belief in the potential for human-like abilities of deep neural networks lowers the threshold of willingness to accept results suggesting such capabilities. Consider the not infrequent situation where a qualitative analysis based on a few data points reveals both positive and, crucially, negative evidence – “here the model fails to. . . ” – but it is nonetheless concluded that the model performs better thanks to its superior capabilities, as confirmed by a few percent improvement on a benchmark. Do we really expect that the superior ability in question would improve performance by only, say, 1-3%? Instead of accepting the hypothesis of the model being superior, it should probably be questioned (referring back to the assumptions underlying the ML paradigm from the introduction of this chapter): (a) whether the dataset really is a good surrogate for the evaluated task; (b) whether an improved performance score is sufficient to support the claim of superior capabilities; and (c) whether the test set even requires the respective abilities to be solved.

The problematic dominant role of benchmarks for evaluation is referred to by Ioannidis (2005), as that “the high rate of nonreplication [...] is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study”. While benchmarks start off as a useful tool for comparative evaluation of different approaches to solve the same task, over time research focuses solely around them as the dominating factor for acceptance of results within the community (Sculley et al., 2018). By standardising training data and evaluation procedure, attention shifts primarily to creation and evaluation of new models for a task or, in other words, “machine learning for machine learning’s sake” (Wagstaff, 2012), or “mindless comparisons among the performance of algorithms”(Langley, 2011).

Why “mindless”? First, a single performance score provides very limited insights into the relative strengths and weaknesses of a model and, as a consequence, offers little guidance for the most impactful focus of future research (Langley, 2011; Sculley et al., 2018). Second, taking a dataset as the desired objective does not indefinitely reflect and challenge the interesting core abilities of the underlying task in a progressing field (Pinto et al., 2008; Torralba and Efros, 2011; Wagstaff, 2012). Third, comparatively little attention is paid to translating progress on a dataset into corresponding improvements on the real-world application that inspired the benchmark in the first place (Wagstaff, 2012; Chiticariu et al., 2013; Sturm, 2014). In the worst case, systems with improved application performance for certain instance types are not recognised due to the fact that overall dataset performance is not much affected.

Taken together, these aspects indicate the lack of a ‘regulariser’ for the process of introducing new models – that is, similar to the machine learning technique, a mechanism which keeps the model development process in balance by, for instance, requiring a certain degree of robustness, generalisation and transferability, to prevent unhindered community-wide benchmark overfitting. Consequences of the latter can definitely be observed: a large number of task-specific holistic architectures with wildly varying names and seemingly arbitrary variations of all parts of a system (see, for instance, the multitude of visual question answering models mentioned in 3.3), as opposed to generic network modules whose beneficial effect is uncontroversial, like batch normalisation (Ioffe and Szegedy, 2015) or residual connections (He et al., 2016). Moreover, the importance of hyperparameter search instead of robust learning processes, or the perception of model building as “dark art” versus the existence of a rich set of proven best practices.

As Torralba and Efros (2011) argued, better benchmark datasets are unlikely to let us escape the “vicious cycle” of dataset creation. To overcome the detrimental effect of monolithic benchmarks, interest has to shift from cheap model comparisons as the driving force, to detailed evidencing of model capabilities where benchmark scores play only a minor role as comparative “sanity checks”. Sturm (2014) identifies the lack of control over the content of evaluation data which a benchmark dataset can possibly offer as the fundamental problem, and illustrates this point vividly with the example of “Clever Hans”, a horse which supposedly exhibited extensive arithmetical, reasoning and language understanding skills (Pfungst and Rahn, 1911). Driven by scepticism towards the hypothesis of a ‘clever’ horse, a sequence of experiments testing alternative explanations for the apparent evidence under carefully controlled conditions eventually yielded a far more likely explanation of the observations: subconscious, nearly undetectable micro-cues given by the person posing the question to the horse.

In document El aprendizaje colaborativo (página 168-176)