• No se han encontrado resultados

4. RESULTADOS Y DISCUSIÓN

4.3 Análisis de absorción de macronutrientes

mance

The purpose of much of the research in this dissertation was to show that the combinatoric results were similar to the empirical results — it was not to obtain the best performance results. Among the main objects of interest were ranked data and the performance of various ranking methods with respect to this data. A central question was “How can it be determined that the combinatoric results are statistically similar to the empirical results?” A complicating matter was that the ranks are ordinal and the data typically did not fit any known distribution; therefore, the use of parametric statistics was generally inappropriate (much more about that topic is discussed later in this section). So, the question became “Given the nature of the data used inthis research and its research goals, how can statistical significance be determined? What are the appropriate significance tests to use for this research?”

The Kolmogorov-Smirnov (K-S) goodness-of-fit test (Conover, 1999) and the Mann- Whitney test (also known as the Wilcoxon signed ranks test) (Conover, 1999) were the two main significance tests used in this research. The K-S test was used for part of RQ

#1 (determining the characteristics of a combinatoric-based ASL performance measure) and both tests were used for RQ #2 (determining how well the results predicted by a combinatoric-based ASL matches up with the results obtained from actual document rankings). The example in Section 3.4 provides more information about the context in which this research employed the Kolmogorov-Smirnov test. The remainder of this section discusses general statistical significance issues in IR performance research.

Van Rijsbergen (1979) states that “[o]nce we have our retrieval effectiveness figures we may wish to establish that the difference in effectiveness under two conditions is statistically significant. It is precisely for this purpose that many statistical tests have been designed. Unfortunately, ... there are no known statistical tests applicable to IR. This may sound like a counsel of defeat but let me hasten to add that it is possible to select a test which violates only a few of the assumptions it makes.”

The use in IR experiments of formal statistical methods such as significance tests has been relatively unusual. This gap has to do in part with the difficulty of establishing the validity of particular tests or even of defining a suitable framework for such tests (IR experimental data is notoriously difficult to pin down in any neat statistical model). ... One problem that needs to be addressed when deciding on a statistical significance test, is what (if any) assumptions can be made about the shapes of the distributions. Many tests depend on strong assumptions about these shapes. Unfortunately, IR is notoriously difficult to pin down in this respect. Of course, the actual distribution will depend on which particular variable is being measured as well as the circumstances of measurement; but many authors have pointed to the difficulty of justifying any parametric assumptions. We are therefore lead towards nonparametric tests (Siegel, 1956). (Robertson, 1990)

An earlier article (Robertson, 1981) discusses some of the difficulties.

Harter and Hert (1997) remarks that “[t]he role of significance testing and other statistical issues related to retrieval evaluation have not been treated to any great extent in the retrieval literature. In part this has been because the assumptions underlying statistical treatment (independence, random sampling, assumptions of normality and the like) are rarely met by Cranfield instruments . . . .”

One implication of the three paragraphs above is that it may be hard to use parametric tests (e.g.,t-test,F-test, analysis-of-variance tests) for significance testing in information

retrieval research. Many of the hypothesis-testing procedures used in science and engi- neering for parametric statistics are based on the assumption that the random samples are selected from normal populations. Many of these tests are still reliable when there are slight deviations from normality, especially when the sample size is sufficiently large. If parametric tests are used, in general, one or more of the statistical assumptions that they are based on may have been violated and, depending on the degree of violation and the robustness of the test, the p-value may have a sizable amount of error. Walpole (2002) remarks that “this is particularly true for the t-test and the F-test.” Depending on the robustness of the technique and other factors, this may or may not be a problem. If it does turn out to be a problem, then researchers often have to resort to using nonparamet- ric (i.e., distribution-free) statistical methods. The primary downside of non-parametric tests is that “they do not utilize all of the information provided by the sample, and thus a nonparametric test will be less efficient than the corresponding parametric test. Con- sequently, to achieve the same power, a nonparametric test will require a larger sample size than will the corresponding parametric test” (Walpole, 2002).

Van Rijsbergen states, with respect to significance testing in IR, that “[o]n the face of it non-parametric tests might provide the answer”(van Rijsbergen, 1979). He mentions one particular case where there is a single set of queries that is used in different retrieval environments:

Therefore, without questioning whether we have random samples, it is clear that the sample under condition a is related to the sample under condition b. When in this situation a common test to use has been the Wilcoxon Matched-Pairs test. Unfortunately again some important assumptions are not met. The test is done on the difference Di =Za(Qi)−Zb(Qi), but it is assumed thatDi is continuous and

that it is derived from a symmetric distribution, neither of which is normally met in IR data.

It seems therefore that some of the more sophisticated statistical tests are inappro- priate. There is, however, one simple test which makes very few assumptions and which can be used providing its limitations are noted. This one is known in the literature as the sign test (Siegel29, page 68 and Conover30, page 121). It is appli- cable in the case of related samples. It makes no assumptions about the form of the underlying distribution. It does, however, assume that the data are derived from

a continuous variable and that theZ(Qi) are statistically independent. These two

conditions are unlikely to be met in a retrieval experiment. Nevertheless, given that some of the conditions are not met, it can be used conservatively. (van Rijsbergen, 1979)

One particular arena of applicability for nonparametric tests in IR research has to do with the fact that much of results evaluation in that area involves the comparison of ranked (i.e., ordinal scale) results. Parametric tests are ill-equipped to deal with these as the analysis of this ordinal data involves an analysis of ranks. This kind of analysis can, however, be very naturally handled by their nonparametric counterparts. Some IR literature examples of, or references to, the use of non-parametric tests in IR are the following: the Kolmogorov-Smirnov one-sample test for goodness-of-fit (Moon, 1993), the Wilcoxon-Mann-Whitney test (Keen, 1992), the sign test (Downie et al., 2005), McNemar’s test (Downie et al., 2005), and the Wilcoxon signed ranks test (Downie et al., 2005). These are just a sampling of the tests that were available for possible use in this dissertation. Generally, the tests that are used in a particular situation depend very much on the characteristics of the situation and the researcher’s goals.

Documento similar