• No se han encontrado resultados

Sector Agrícola

PRODUCTOS DE EXPORTACION A CUBA

in these tasks. Ion has 34 features; this is the largest number of features from the tasks and represents a very large search-space of classifiers for GP, given that the maximum program depth of GP classifiers is restricted to 8. Adjusting the evolutionary parameters to allow GP to more effectively explore this search- space, such as increasing the maximum GP program depth parameter (e.g., to 12) or the population size (e.g. to 1000), should improve GP performances.

To test this hypothesis, i.e., whether the average AUC performance for GP can be further improved using a maximum program depth of 12 and population size of 1000, the experiments are repeated for Ion and Spt using these new parameters. Only the GP fitness functionsCorr and AucF for Ion and Spt, respectively, are considered with the new evolutionary parameters as these fitness functions have the highest average AUC on these tasks in Table 4.7.

New GP Results for Ion and Spt

For the new GP experiments, the average AUC (±standard deviation) forCorr

on Ion is 0.91 (± 0.03), and the best AUC is 0.98 (over 50 runs). Likewise, the average AUC for AucF on Spt is 0.82 (± 0.02), and the best AUC is 0.84 (over 50 runs). This shows that on average, the AUC for GP on Ion is as good as NB, and only slightly lower than SVM. Similarly, the best AUC achieved by GP in Ion (over 50 runs) is substantially higher than both NB and SVM. Likewise, the average AUC for GP on Spt is only slightly lower than NB (and much better than SVM), but the best AUC achieved by GP (over 50 runs) is much better than both methods.

These new GP results confirm the hypothesis discussed above that the original not very good AUC results in Table 4.7. are more due to the complexity of these problems than the class imbalance factor. When the search-space is increased in GP (by updating the evolutionary parameters), performances are improved as the new evolved classifiers are more competitive in terms of AUC compared to NB and SVM in these two tasks.

4.6

Results for Weighted-Average Fitness Function

This section investigates whether different configurations in the weighted- average GP fitness functionW ave(Eq. 4.4) significantly affects the AUC of the evolved solutions. In other words, we check whether different configurations

Table 4.7: Average (± standard deviation) AUC for weighted-average fitness function Ave (Eq. 4.2) on the tasks. The SR denotes the significance rank (s- rank) for a weight value and beats denotes other s-rank(s) with a (statistically) significantly poorer AUC.

Weight AUC Stat. Test AUC Stat. Test AUC Stat. Test (W) SR Beats SR beats SR Beats

Ion Spt Ped 0.2 0.83±0.05 1 {2-3} 0.70±0.09 3 0.80±0.09 3 0.3 0.82±0.05 1 {2-3} 0.73±0.05 1 {4-5} 0.86±0.06 1 {3} 0.4 0.82±0.05 1 {2-3} 0.72±0.05 2 {5} 0.86±0.05 1 {3} 0.5 0.80±0.06 1 {2-3} 0.71±0.05 3 0.87±0.04 1 {3} 0.6 0.80±0.05 1 {2-3} 0.69±0.06 4 0.86±0.03 1 {3} 0.7 0.76±0.07 2 {3} 0.69±0.06 4 0.85±0.05 2 0.8 0.71±0.09 3 0.68±0.06 5 0.82±0.05 3 p=1.7×10−26 p=0.0013 p=1.1 ×10−9

Yst1 Yst2 Bal

0.2 0.76±0.07 3 0.92±0.05 1 0.61±0.13 3 0.3 0.78±0.05 2 0.92±0.05 1 0.72±0.14 1 {3} 0.4 0.80±0.04 1 {3} 0.93±0.04 1 0.72±0.13 1 {3} 0.5 0.79±0.03 2 0.93±0.04 1 0.71±0.15 1 {3} 0.6 0.78±0.05 2 0.92±0.04 1 0.69±0.14 2 0.7 0.77±0.06 3 0.93±0.04 1 0.67±0.14 2 0.8 0.73±0.10 3 0.92±0.05 1 0.67±0.14 2 p=3.2×10−3 p=0.49 p=5.5×10−5

affect how well the class outputs are separated with respect to each other in the evolved solutions. Table 4.7 shows the average AUC of the evolved GP classifiers experimental using the seven different weighting configurations inW aveon the tasks. The weighting configurations forW are between 0.2 and 0.8 at intervals of 0.1. Recall that in W ave, W specifies the weight for the minority class accuracy and1−W for the majority class accuracy.

Similar to the previous experimental results, an ANOVA F-test is first used to statistically test the null hypothesis (i.e., no difference in AUC for the differentW

values over 50 runs) at a 5% level of significance. The p-values from the F-test, shown in Table 4.7 for each task, are lower than than 0.05 in all tasks except Yst2

(wherepis 0.49). This indicates that the null hypothesis is rejected (at a 5% level of significance) in these tasks except Yst2. In Yst2, all weighting configurations

show very similar AUC results (that are not statistically significantly different). Tukey’s HSD test [166] is also used as the multiple comparisons test, to find the statistically significant differences between AUC values in the tasks (except

4.6. RESULTS FOR WEIGHTED-AVERAGE FITNESS FUNCTION 105 Yst2). An s-rank is also assigned to each W value in Table 4.7 to summarise

which weighting configurations have statistically significantly better AUC values than other configurations. In Table 4.7, theSR denotes the s-rank for a givenW

configuration andBeatsdenotes other s-rank(s) with a (statistically) significantly poorer AUC. For example, the first line in Table 4.7 for Ion (for “Stat. Test”) shows thatW = 0.2achieves the best s-rank of 1, and that this AUC is significantly better than s-ranks 2 and 3 (W values of 0.7 and 0.8, respectively).

4.6.1

Analysis of Results

According to Table 4.7,noconfiguration ofW whereW 6= 0.5shows a statistically significantly better AUC than an equal weighting (W = 0.5) on the tasks. This means that no other configuration of W where W 6= 0.5 improves the AUC sufficiently to be statistically significantly better than an equal weighting configuration. In fact, theW configuration with the highest average AUC (on a task-by-task basis) is still not as good as the AUC-based and new fitness functions from Table 4.4 in any of the tasks. This suggests that the tweaking the weighting configuration in W ave does not significantly improve the AUC in the evolved solutions on these tasks.

As expected, “extreme” weighting configurations (such as W of 0.2 or 0.8) generally show the worst AUC results. In four tasks (Spt, Ped, Yst1 and Bal), the

W configuration with the highest average AUC is statistically significantly better than these two extreme W values. This is not surprising as extreme weights in W ave favour biased solutions which have high accuracy rates on one class alone. Only in Ion does the extreme W value of 0.2 show good AUC results, most likely due to the relatively low level of class imbalance in Ion. Weighting configurations slightly favouring majority class accuracy over minority class accuracy (0.3 < W 0.5) produce slightly better AUC than the opposite case (W >0.5), but this difference is only statistically significant in one task (Spt).

These results show that the choice of weights inW ave will not significantly affect the AUC in the evolved solutions unless “extreme” weights are selected.

However, it must be mentioned that while the AUC of the evolved solutions are not statistically significantly different (except for extreme weights), the main advantage ofW aveis thefrontierproduced by the evolved solutions, as shown in Figure 4.5. These figures show the performances of the evolved solutions on the minority and majority classes (on the test set) when these solutions are evaluated using zero as the class threshold, for three tasks (Ped, Yst2 and Bal). These

60 80 100 60 70 80 90 100 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Minority Accuracy Majority Accuracy Ped 75 80 85 90 86 88 90 92 94 96 98 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Minority Accuracy Majority Accuracy Yst2 20 40 60 80 75 80 85 90 95 100 0.2 0.3 0.4 0.50.60.70.8 Minority Accuracy Majority Accuracy Bal

Figure 4.5: Minority and majority class accuracies (on the test sets) for weighting coefficientW in fitness functionW ave(axis scopes are different in each figure). performances represent the average performance over 50 GP runs for the different

Wconfigurations; and the vertical and horizontal axis in these figures correspond to the minority and majority class accuracy, respectively. The remaining tasks are omitted for space constraints but these show very similar frontiers to Ped and Yst2 (in fact, theW avefrontier for these tasks are shown and discussed in more

detail in the next chapter).

However, an major limitation of W aveis that multiple GP runs are required (each with a different W configuration in the fitness function) to produce the frontiers shown in Figure 4.5. This can be a time consuming process, e.g., Figure 4.5 needed a total of 350 GP experiments (assuming 50 GP experiments are used for eachW configuration). Another limitation of W aveis that there is no guarantee that the points along the frontier (i.e. for the differentW values) will be uniformly “spread out” along the two objectives, as seen for Bal in Figure 4.5. HereW values between 0.5 and 0.8 produce a point along the frontier that is very similar in objective-space.

Documento similar