Marco legislativo sobre el proceso de convergencia hacia el EEES

2.3. LA EDUCACIÓN SUPERIOR EN EL NUEVO MODELO DE SOCIEDAD: PERSPECTIVA SOCIOLÓGICA,

2.3.2. PERSPECTIVA LEGISLATIVA EN EL CAMBIO DE PARADIGMA EN EL MODELO DE EDUCACIÓN

2.3.2.1. Marco legislativo sobre el proceso de convergencia hacia el EEES

Miller and Thomson [87] demonstrated results where the minimum computational effort (E) was being calculated at generations where very few of the runs had found a solution. They detailed 24 experiments with the artificial ant on the Santa Fe trail; each was run 100 times. When calculating the minimum computational effort statistics they found that half the values were obtained at generations where fewer than 10 runs had found a solution. They concluded that were their experiments to be run again, the results were “likely to vary enormously”.

Luke and Panait [85] pointed out the same issue: “changes in the Individuals to be Processed measure and its derived Computational Effort measure are both greatly exaggerated when small changes occur in ideal solution counts [number of hits]”.

Niehaus and Banzhaf [92] showed that as the probability of success decreased, the range of observed values for computational effort increases. Thus the confidence in the accuracy of the minimum computational effort should be reduced whenever the probability of success is low.

Unfortunately, some quoted minimum computational effort statistics do not state the number of runs that were successful (or even the generation at which they were obtained—which along with the population size would allow the cal- culation). The elimination of this information means that readers are not able to form even a feeling for the confidence they should have in the quoted statistic.

2.3.5 Number of Runs

Niehaus and Banzhaf [92] also demonstrated the impact of the number of runs. As would be expected, the greater the number of runs, the smaller the range of observed minimum computational efforts. However, they also showed that if the probability of finding a solution is low, a small number of runs can result in an enormous range of observed minimum computational efforts. For 50 runs with a probability of success of 0.2, they showed a range of observed values of more than 5-fold the theoretical minimum computational effort. Doubling the number of runs to 100 resulted in more than halving the observed range. They concluded that “calculating effort based on only 50 [runs] may lead to values quite off the theoretical values, and that even 200 [runs] often are not sufficient.”

2.3.6 Confidence Intervals

As Angeline [8] pointed out, a key problem with Koza’s computational effort statistic is that, as defined, it is a point statistic with no confidence interval. Without a confidence interval, comparisons are inconclusive.

Keijzer et al. [69] used resampling statistics to calculate confidence intervals on two problem domains. They used a bootstrap sample of 10,000 where they had executed 100 and 500 runs. However, they did not find the results very useful, “for the Santa-Fe problem . . . the width of the confidence interval (i.e. the uncertainty around the statistic) is nearly as large as the value of the computational effort itself. The confidence intervals clearly show that a straightforward comparison of computational effort, even differing in an order of magnitude, is not possible.” Methods to generate confidence intervals for minimum computational effort are discussed and studied in chapter 3. The study includes the methods that Kei- jzer et al. may have used. We also offer methods to produce confidence intervals for the difference and for the ratio of two minimum computational efforts.

2.4 Mean Fitness

Mean fitness, as a measure of performance, vies with minimum computational effort as the most popular measure in the genetic programming field [85]. It is popular perhaps because the statistical issues surrounding the use of a mean are well understood. The measure and its confidence intervals may well have been introduced to the GP field by Angeline’s 1996 paper [8].

Mean best-of-run fitness is the sum of the fitness scores for the best individual in each run up to a specified generation, divided by the number of runs executed. The statistic is frequently measured at the final generation of each run, but it is possibly most common to see it graphed for every generation. Just like success proportion, when shown per generation, it is important to note that the results are typically not independent across generations.

There are at least three variations of the theme: mean average-of-generation fitness (also called mean population fitness [21]), mean best-of-generation fitness, and mean best-of-run fitness (where all the generations are considered and which we truncate tomean best fitness). Further, the variance of fitness is also a popular measure [13, section 8.4.3].

Mean fitness is also termed “mean number of hits”, as Koza defined “hit” as success in a portion of the given problem [71]. Consequently, mean best fitness may also be termed “mean best number of hits”.

Mean average-of-generation fitness has been shown to converge much more quickly. Christensen showed it to be more than three times as precise as mean best-of-generation fitness [21, page 84]. He also suggested that researchers may have preferred the measure given that “much of population genetics and GA theory refers to the behaviour of the mean fitness of the population”. However, he concludes that “we are usually interested in finding the most successful individuals” and “the behaviour of an auxiliary set of solutions used during the searching process is not normally of great interest” [21]. Finally, he showed mean average-of-generation fitness is not a good predictor of mean best-of-generation fitness.

2.4.1 Confidence Intervals

Given that all forms of the measure are based on the mean, the same method can be used for the formation of confidence intervals. We will use mean best fitness as an example.

Mean best fitness is normally distributed (from the Central Limit Theo- rem [24]), but for the small sample sizes available from GP runs, it has been considered more appropriate to use a t-distribution [8, 24]. The parameters of the distribution can be approximated with those observed from the sample. A 1₋α confidence interval can be obtained with the formula:

mean(f)_±t(n−1,α)

sd(f)

√

where mean(f) is the mean best fitness, t(n−1,α) is the t-statistic for the t-

distribution withn₋1 degrees of freedom and a cumulative probability of 1₋α, sd(f) is the standard deviation of the fitness scores, and n is the number of runs executed.

2.4.2 Variations

Luke suggested a method “to calculate the expected maximum best-fitness-of run forN total runs”, however he accepted the measure as a point statistic and thus did not offer a method to generate confidence intervals [83, 85].

Finally, just as with success proportion, an interested reader might also like to consider Christensen’s effective mean best fitness which allows comparison of runs with different population sizes [21, chapter 3].

2.5 Mean Generation

When the vast majority of runs complete successfully, mean best fitness is not a useful statistic for differentiating between the performance of two GP variations. When this has occurred mean generation has been the preferred statistic [8, 25]. The mean generation is the sum of the generations at which termination occurred (irrespective of success or failure) divided by the number of runs that were executed.

2.5.1 Confidence Intervals

The mean of generation-to-termination follows a normal distribution (from the Central Limit Theorem [24]), however for sample sizes as small as the typical number of runs in a GP experiment, a t-distribution has been considered more appropriate [8, 24]. An approximate 1₋α confidence interval can be obtained with the formula:

mean(g)_±t(n−1,α)

sd(g)

√_n (2.9)

where mean(g) is the mean generation,t(n−1,α)is thet-statistic for thet-distribution

with n₋1 degrees of freedom and a cumulative probability of 1₋α, sd(g) is the standard deviation of the generations-to-termination, and n is the number of runs.

An alternative method to generate confidence intervals for generation-to- termination was used by Clegg et al. [25]. They used a Mann-Whitney U test (also known as a Wilcoxon rank sum test) [24] that effectively ranked the runs by generation. They gave no indication as to why they elected not to use the more traditional normal (or t-distribution) approximation. We do not recommend the use of the rank-sum or U test for this measure (unless the number of runs is very small) as it is statistically less powerful than tests based on the normal or

t-distribution [24, page 397].

In document La evaluación formativa y compartida en educación superior: un estudio de caso (página 66-68)