• No se han encontrado resultados

Remark 1. If it is possible to control more than one factor in the determination of the algorithms, or a stratification of the instances is available, then we may consider a full fac- torial design. This arises, for example, when it is possible to characterise an algorithm by different parameters, or when instances are grouped, according to size or other features, in sub-groups called strata. The variables which identify the strata are not controlled, but they are supposed to have an effect on the experiment and their identification may be

17We observed that the confidence intervals produced by permutation tests with the scheme of Algorithm 3.5on ranks are always slightly larger than those obtained by using Equation3.19. Apparently, therefore, this algorithm is more conservative.

3.6 Design and analysis of experiments 63

Procedure Compute_all-pairwise_MSD(); Choose an estimated error  related with B; Start: Choose a positive number MSD;

forall (i, j) of the k(k − 1)/2 pairwise comparisons do

if Compute_pairwise_MSD(i, j) returns false then goto Start;

end

Return MSD.

Procedure Compute_pairwise_MSD(i, j);

1. Subtract ¯Xhi.− ¯Xhj.+MSD from every value of the data group relative to one

of the two algorithms, say i, obtaining the new vector Xhit(MSD) =

Xhit− ¯Xhi.+ ¯Xhj.− MSD, i = 1, 2, ..., r. This vector is combined with the

vector Xhjtand constitutes the pool of observations to permute.

2. Compute the statistic T (MSD) for the observed response

T (MSD) : T0(MSD) =Ph X¯hi.(MSD) − ¯Xhj.(MSD).

3. By rearranging the observations B times with synchronised permutations

within the b blocks (see Algorithm3.5) obtain the permutation distribution of

the statistic T∗(MSD) :P h X¯ ∗ hi.(MSD) − ¯X ∗ hj.(MSD).

4. Return “true” if the condition |#{T∗(MSD) ≤ T0(MSD)}/B − αP C/2| < /2is

satisfied, else return “false”.

Algorithm 3.6: An algorithm for computing simultaneous MSDs in the case of repeated

measures. The confidence level 1 − αP Cis set equal to 1 − αF W.

important to reduce the variance of the results. Testing for interaction of strata variable is more meaningful than testing the interaction of single instances because whenever we are faced with a new instance, we would know to which class it belongs. A full factorial

design is typically represented as in Figure3.7. Let A and B be two factors (or strata) with

respectively i and j levels and let b be the number of instances. For each combination of

factor levels r runs are collected. The response Xhijtis the result obtained in a replicate t

by the treatment combination i,j on the instance h.

The parametric tests for analysing a full factorial design are extensions from the two- way design. A series of rules that can be applied to derive each test statistic are reported

inDean and Voss,1999, page 202. Common computer packages, like the R environment,

have however these methods already implemented.

In full factorial designs, interaction plots are particularly helpful for the analysis. They are constructed by joining lines relative to levels of one factor at average response values of the levels of another factor and they are used to gain an idea of how different combi- nations of factor levels affect the responses. If the lines are not parallel, then a factor level performs differently with different levels of the other factor and an interaction may be

hypothesised (see Chapter 4 for examples of such plots, e.g., Figure4.28).

All-pairwise comparisons depend on the presence or non-presence of interactions be- tween factors. If there is no interaction, then pairwise comparisons between levels of one single factor may be considered. In the case of interactions, instead, the pairwise com- parisons should better involve all treatment combinations. This second choice entails a higher number of comparisons and hence requires a higher adjustment of the α-value.

Extensions of permutations tests to full factorial designs are currently restricted to only

64 Statistical Methods for the Analysis of Stochastic Optimisers

Block Factor Factor

(Instance) A B Observations 1 1 1 X1111, X1112, . . . , X111r .. . ... kB X11kB1, X11kB2, . . . , X11kBr 2 1 X2111, X2112, . . . , X211r .. . ... kB X21kB1, X21kB2, . . . , X21kBr .. . ... ... ... b ... ... ... kA 1 XbkA11, XbkA12, . . . , XbkA1r .. . ... kB XbkAkB1, XbkAkB2, . . . , XbkAkBr Figure 3.7.: Full Factorial with complete blocking

our knowledge, have not been applied to full factorial analysis. In both cases, however, it is possible to consider all combinations of experimental factors and treat them as levels of one single factor, thus re-conducting the analysis within the known cases. Alternatively, it is possible to focus separately on each single factor and repeat the tests for each factor on the same data. In this case, the family-wise level of confidence must be adjusted for the multiple use of the same data.

Remark 2. So far, we only considered balanced cases, that is, cases in which the number of observations on an instance are the same for every treatment combination. Extensions to unbalanced cases are possible although they require the adjustment of the formulas. Yet, collecting the same number of runs per algorithm on all instances should not be a real impediment in experiments with algorithms.

However, an experiment can be unbalanced also on the strata of the blocks. For ex- ample, having as strata variable the size of the instance, it may occur that the number of instances available at each size is not the same. If instances come from real life appli- cations or from benchmark libraries, this may be quite likely. Such a kind of unbalance may lead to a wrong inference because a large number of instances in one class in the analysis biases the results. If this effect is unwanted, because the sample is deemed not to be representative of the application, a possible kludge solution is given by the bootstrap method. In brief, a balanced design is bootstrapped from the original data preserving factorial combinations, that is, the observations in the ijk-th factorial combination of the bootstrap design are selected with replacement from the ijk-th factorial combination of the original unbalanced design. The resulting balanced design is analysed and the whole procedure repeated a number of times (indicatively 100 times should be enough). If the results are consistent with respect to the hypothesis tested in all the instances, then the corresponding conclusions may be drawn. If results are mixed, varying from bootstrap sample to bootstrap sample, then a larger number of instances should be collected in order to draw significant conclusions about interactions.

3.6 Design and analysis of experiments 65

tributions of solutions quality. In the case of minimisation problems, this distribution is bounded from below, should have a left-end point as close as possible to the unknown minimum, and should be skewed to the right (i.e., the lower end of the distribution is on the right). Therefore, the assumption of normality of data is often not appropriate and non-parametric tests are to be preferred.

The non-normality and asymmetry of the distributions also suggests that descriptive statistics such as the median and other quantiles are more appropriate for resuming, re- spectively, central tendency and variability. Sample mean and empirical variance are, indeed, efficient estimates only if the underlying distributions are close to normality or for sufficiently large sample sizes. Quantiles are, instead, preferable because they are

scale-invariant for many basic transformations (da Fonseca et al.,2001;Sheskin,2000).

Remark 4. Also the homoschedasticity of results among different algorithms seems ar- guable (consider, for example, the extreme case of a deterministic algorithm compared against a stochastic algorithm on one single instance). In this case not only the para- metric tests presented are inappropriate but also the permutation tests, which require homoschedasticity for the exchangeability of data. Transforming data into ranks permits in this case the elimination of outliers, which are unwanted sources of variability within distributions, and rank-based tests appear therefore as the most robust among the meth- ods presented, although they are also based on the assumption of homoschedasticity of the transformed data.

In all-pairwise comparisons with permutation tests, the simultaneous confidence in-

tervals provided by Algorithms3.4and3.6 are very conservative in the case of not ho-

moschedasticity because they are largely affected by the algorithms with higher vari- ance. An alternative procedure to compute the confidence intervals is however possible. It consists in using the procedure Compute_Pairwise_MSD with a confidence level 1 − α adjusted by the Bonferroni’s rule on each pairwise comparison, without then checking that the computed interval satisfies also the other pairs through the iterative procedure Compute_all-Pairwise_MSD. The advantage of this procedure is that the negative bias of an algorithm with an observed uncommon large variance remains confined to those com- parisons involving the algorithm itself without affecting the other comparisons in which that specific algorithm is not involved.

Unfortunately, different MSDs for each pairwise comparison exclude the use of the

graphical representation of results as shown in Figure3.3. Hsu (1996) suggests an al-

ternative representation, which allows to include confidence intervals of different length and maintains the unification of both practical and statistical information in a single plot.

This plot is shown in Figure3.8. It consists of a two dimensional space in which a 45◦line

represents the points satisfying ¯Xi = ¯Xj. At each point ( ¯Xi, ¯Xj), representing the sam-

ple means of two algorithms, a segment is drawn of slope −1, centred in ( ¯Xi, ¯Xj), and

of length MSD/√2. Statistical inference is derived by checking whether the line segment

crosses the 45◦ line. The practical assessment of mean differences is preserved, instead,

on the x-axis or y-axis. All the k(k − 1)/2 confidence intervals can be represented by

drawing only the segments with ¯Xi > ¯Xj, i.e., only intervals below the 45◦line.

Remark 5. Clearly, the set of tests presented is not exhaustive. There may be other test statistics that are more appropriate in particular situations. Two tests, worth mentioning, compare two distributions in distinct ways.

• The Binomial signed test counts the number of positive and negative differences and uses the binomial distribution to test if the number of positive (or negative) differ- ences is significantly different from an equal distribution. This test is appropriate

66 Statistical Methods for the Analysis of Stochastic Optimisers PSfrag replacements ¯ X3X¯2 ¯ X1 ¯ X1 ¯ X2 ¯ X2 ¯ X3 ¯ X3 ¯ X3X¯2/ √ 2 MSD/√2

Figure 3.8.: Graphical representation of confidence intervals in a two dimensional space. This mean-mean scatter plot introduced byHsu(1996) allows the representation of confidence inter- vals of different width.

to compare matched results of two algorithms when the only thing that matters is which one wins.

• The Kolmogorov-Smirnov two-sample test compares the empirical cumulative distri- bution function, CDF, of two samples. This test is able to detect more differences than all the tests previously introduced, because it is not based on mean values only. The test computes the maximal difference between the two curves, and exact quan- tiles or approximation quantiles for the distribution of this statistic are derived by permutation methods. Besides testing two algorithms, a variant of this test is also used to assess whether a sample comes from a known theoretical distribution.

In the rest of the thesis we conform to the experimental designs and the analysis here presented. In particular, we will design experiments according to design B and design C. The choice between the two will depend on the number of instances that are available. If the number is high, as in presence of a random generator, we will compare algorithms according to the scheme “one single run on various instances”; if the number of instances is not large we will collect several runs per algorithm on each instance according to the scheme “several runs on various instances”.

Documento similar