Statement of the objectives of the experiment
A list of the questions to be answered by the experiment must be made explicit. The interest may be merely on the comparison of different algorithms or it may be extended to investigate how instance characteristics affect the algorithms. Another possible interest may be the analysis of algorithms components and their effect on algorithm performance.
Identification of sources of variance
Source of variations are all the components that cause observations to have different nu- merical values. It is possible to distinguish two type of sources: those that are of particu- lar interest to the experimenter, called treatment factors, and those that are of no interest, called nuisance factors.
Treatment factors are high-level components whose influence on the data is to be stud- ied. They can be defined as numerical or cathegorical variables, this latter if there are two or more categories, but there is no intrinsic ordering to the categories. The levels of a treatment factor are the specific values or cathegories actually used for the treatment fac- tor in the experiment. For instance, the tabu length in a Tabu Search algorithm is known to have an influence on its performance; hence, it may be considered as a treatment fac- tor, while the specific numerical values assigned to it define the levels of the treatment. Besides tabu length, also the aspiration criterion has an influence; then we may want to consider an experiment with two treatment factors, namely the tabu length and the as- piration criterion. The levels of the aspiration criterion are categorical, i.e., “present” and “not-present”. The combinations of levels are usually called treatment combinations and an experiment involving two or more treatment factors is called a factorial experiment. In a single factor experiment the term treatment is commonly used to denote a level of the treatment factor.
Nuisance factors are factors not explicitly controlled in the experiment. In the context of optimisation, the instances of the problem to be solved constitute a nuisance factor as, for example, in the general case no a priori knowledge is given on their optimal solution or on their hardness. For each treatment combination, an instance must be chosen for
3.6 Design and analysis of experiments 45
testing, thus defining thoroughly the experimental unit. Usually, units of a nuisance factor are distributed at random (randomisation principle). For example, each algorithm could receive a different, randomly chosen, instance. A different approach, called blocking, is however more appropriate in the context of comparisons of algorithms. Each single in- stance is identified as a block and the response of each algorithm is observed the same
number of times on each block. This gives rise to a complete block design.1 Rardin and Uz-
soy(2001) point out that well designed computational experiments for algorithms always
block completely on instances.
It may be possible that we identify more than one nuisance factor, as there are two or more sources of variations that had been supposed to have an effect but that may not be controlled. In this case nuisance factors become stratification variables and blocks arise from their combinations. In the analysis of algorithms, two typical stratification variables are size and structure of the instances. The instances remain the blocks of a
single blocking factor but they can be “stratified” by the size and structure variables.2
With stochastic algorithms, successive trials can produce quite different outcomes. To avoid being mislead by results which are very different from expected, it is common
practice to collect replicates3of the observations in each experimental unit, that is, several
runs are performed with different random seeds. As we discuss next, replicates may not
be necessary if it is possible to have many instances. McGeoch(1992) points out that, in
order to reduce the variance of results, a common random seed among algorithms should be used for each replicate.
Definition of the test instances
Rardin and Uzsoy(2001) distinguish four kinds of instances: real world instances, ran-
dom variants of real instances, online libraries and randomly generated instances. Real world instances are particularly appropriate for real applications but usually only few data are available and the test for the efficacy of the algorithm proposed remains re-
1More generally, in experimental design the objects to which treatments are applied in the experimental
unit are referred to as subjects. In a between-subject design, different subjects serve in each of the ex- perimental conditions while in a within-subject design, each subject serves under all the experimental conditions. A within-subject design corresponds to a complete blocking design. A design involving matched-subjects is also treated as a within-subjects design. In a matched-subjects design, each subject is paired with one or more other subjects who are similar with respect to one or more characteristics that are highly correlated with the response variable. In this sense, an instance solved under all algorithmic conditions may be considered a matched-subject. A complete block design is only possible when the number of homogeneous subjects, which constitute the single block, v, are multiples of the number of treatments, k. Historically, the case with k = v is called randomised complete block design, while the cases with larger v are called general complete random design. We do not need these distinctions here, since the same instance can be reused on all treatments within the single block.
2In the more general context of experimental design, two nuisance factors may be crossed, if the experiments
to be carried out are too many and running a complete design is too costly. A design with crossed nuisance factors is represented by a Latin square. In a Latin square, the rows and the columns of the matrix correspond to the two nuisance factors, and each cell, which corresponds to a block, receives a treatment level. The number of resulting experiments is considerably reduced, although some good characteristics are maintained. Indeed, the particularity of Latin square design is that, if column headings are ignored (if a blocking factor is removed), the design looks like a randomised complete block design; similarly, if the row headings are ignored, the design with columns as blocks looks like a randomised complete block design. Note that generating Latin squares implies solving a graph colouring problem, as we will see in Chapter 4. In a Latin square design, all treatment and nuisance factors, must have the same number of levels. A possible generalisation are Youden Designs (Dean and Voss,1999).
3In the language of experimental design, the number of replicates is given by the number of times an orig-
inal observation is replicated. Hence, it does not include the original observation itself. In this thesis however, we assume that the number of replicates corresponds to the number of runs without distin- guishing between original observation and replicate.
46 Statistical Methods for the Analysis of Stochastic Optimisers
stricted to the particular context with small general relevance. Random variants of real instances try to maintain the main structure of the instance varying the details. This possibility received however little attention in practice. The vast majority of the litera- ture measures the efficiency of algorithms on benchmark instances from online libraries. Such libraries are an invaluable tool since they allow immediate comparisons with other published works. However, they may have the following pitfalls: (i) they might not be representative of any real application, (ii) they might have been designed for illustrating pathological behaviours, (iii) they might have been chosen because they are particularly suitable for some algorithms, and, (iv) they may induce an over-tuning of algorithms on those instances, reducing the effort to study the behaviour of the algorithms under var- ious situations. Completely random instances are the last alternative when not enough data for the problem under study are available. Random instances have several advan- tages: (i) they can be generated in large number, (ii) their characteristics may vary, thus widening the study of algorithm behaviour, (iii) if generators are well documented the features of the instances may be known thus allowing to discover relationships between algorithms and problems, and (iv) in some cases, optimal solutions or lower bounds may be known from the construction process. On the other side, they are not representative of any real application and they may induce an over-tuning of algorithms over the par- ticular generation process.
Birattari(2004b) investigates the problem of testing stochastic algorithms on problem
instances from a machine learning perspective. He points out that the practical relevance of testing algorithms is to forecast the performance on new future instances. However, he observes that in real applications instances are not all equally likely to appear. He then introduces a formal definition for the concept of classes of instances. This definition is based on a probabilistic model, in which each instance is identified by its probability of appearing. Following this model, the performance of an algorithm over a class of instances is described by a stochastic variable determined by the probability distribution of results on a given instance and the probability distribution of the instances in the class. The expected performance of an algorithm on a class of instances is then the average performance attained on each single instance weighted by the probability that an instance has to occur.
Two important consequences arise from this probabilistic model. First, algorithms should be selected and configured on classes of instances which are representative sam- ples of the distribution of instances that the algorithm will be called to solve in practice. The identification of this representative sample of instances is an important aspect for a realistic assessment of algorithm performance and must be taken into account when “engineering” the algorithm. Secondly, the best experimental setting for estimating the expected performance of a given algorithm, on the basis of a given number of experi- ments N , can be derived analytically. Contrary to a popular belief, there is no trade off between the number of runs and the number of instances. The setting “one single run on N different instances” guarantees that the variance of the estimate is minimised. Any other experimental setting fails being equally efficient in terms of the reduction of the
variance. This result is proven inBirattari(2004a).
Selection of the combinations of factor levels to test
Depending on the computational power available and the duration of the experiment it must be decided between the two alternatives: full factorial experiments and fractional facto- rial experiments. The former alternative consists on running all combinations of algorithm factors on all the instances. Usually, computer experiments are not too costly in terms of
3.6 Design and analysis of experiments 47
time and this design is more likely to detect statistical differences. The latter alternative
is mainly used in engineering (seeMontgomery,2000). It consists in choosing a subset of
factor level combinations such that effects are not confounded. We will not consider this design in this thesis.
Refinement of the experimental design
Running a pilot experiment is a good practice which may help to better define the experi- ment. A pilot experiment also helps to identify ceiling or floor effects. Ceiling effects arise when test instances are insufficiently challenging, while floor effects arise in the opposite case of instances which are equally hard to solve. In these cases, it is very hard to gather any statistically significant conclusion and those instances may be removed. It may also become clear that some levels of factors do not have any impact on the observations or that some values assigned to the levels must be better rescaled. In experiments involv- ing algorithms, pilot experiments are also useful to make sure that no bug is present and that all algorithms work under the same conditions. For example, detecting a correla- tion between the number of runs and the performance of the algorithm may indicate that memory is not correctly deallocated between repeated runs. Finally, from pilot exper- iments it is possible to estimate the number of replicates that are necessary to attain a desired level of statistical power.
Outline of the analysis
From the definition of the factors involved in the experiment it is possible to hypothesise a model to put in relation the response variable with the sources of variation. Commonly a linear relation is assumed. The simplest model arises in a single factor design and is expressed in the form
Response = constant + effect of treatment + error. A complete block design is instead expressed in the form
Response = constant + effect of block + effect of treatment + error.
The models are used for testing the statistical hypotheses. The outcome of this kind of analysis is an indication whether the treatments or other factors have statistically sig- nificant influence on the response variable. If the interest is at a finer level of analysis, meant to establish differences among specific treatments, then a different procedure for multiple comparisons is undertaken.
The kind of analysis depends on the assumptions concerning the populations under analysis. If there are reasonable elements to assume a certain probability distribution, then a parametric analysis may be appropriate. In contrast, if those assumptions are uncertain, a non-parametric analysis is safer. This choice may have an impact also on the number of replicates to collect in the experiment. For example, collecting many replicates (more than 30) makes a parametric analysis more practicable.
Once the experiment has been designed and the data has been collected, results must be analysed. In the following sections we define the statistical methods for the correct analysis.
48 Statistical Methods for the Analysis of Stochastic Optimisers