We compare AUC, minority-class recall and precision for each pair of learning models, and
output their win-tie-lose numbers based on T-test at confidence level of 95% over the 15
data sets. For a clear observation, we further calculate the percentages of win-tie-lose cases
on average for each model compared with the others based on the paired comparison. The
full results of paired comparison can be found in the Appendix. C section. AUC describes
the general ability of a learning algorithm to separate the minority and majority classes.
Minority-class recall and precision provide information of which performance aspect is
improved or reduced on the minority class. Separate discussions are given to tree-based
Tree-Based Models
Table 5.8 shows the average percentages of win-tie-lose cases of tree-based models in total
210 pairs of comparison (15 data sets * 14 pairs of learning methods). According to the
table, we have following observations:
Table 5.8: Percentages of win(w)-tie(t)-lose(l) cases (in ‘%’) for each experimental method in 210 pairs of comparison using tree base learners, including AUC, minority-class recall and minority-class precision. The results are based on student T-test with confidence level of 0.95. The highest percentage of win is in boldface.
Method AUC Minority-class Minority-class recall precision w t l w t l w t l OrSg 12 9 79 16 31 53 39 35 26 OvSg 6 8 86 39 22 39 32 29 39 UnSg 18 12 70 74 20 6 12 14 74 OrAda 35 40 25 14 33 53 51 40 9 OvAda 33 43 24 30 29 41 43 38 19 UnAda 44 35 21 78 17 5 17 18 65 OrNC2 52 36 12 8 24 68 60 30 10 OrNC9 42 39 19 2 7 91 28 20 52 OvNC2 54 34 12 28 25 47 51 40 9 OvNC9 61 30 9 67 11 22 39 23 38 UnNC2 49 34 17 81 19 0 19 20 61 UnNC9 36 26 38 80 19 1 16 17 67 SMB 37 43 20 47 17 36 48 33 19 JSB 20 14 66 10 26 64 38 28 34 RAB 32 34 34 12 30 58 44 41 15
1) OvNC9 is superior to the other learning models in terms of AUC with the highest
win rate of 61% and the lowest lose rate of 9%.
To understand how OvNC9 can achieve a high AUC score, let’s see recall and precision
for the minority class. It’s not surprising that models using undersampling always produce
the highest recall values, since some data information is abandoned from the majority
class. However, their classification precision is sacrificed greatly, where their win rates
are all lower than 20%. It means that more majority class examples are misclassified.
are labelled correctly. Among the models without using undersampling, OvNC9 attains
the highest win rate (67%) of minority-class recall without losing too much precision.
Thus, it shows the best AUC. In other words, it balances the between-class performance
very well with improved recognition rate of minority class examples.
2) Regarding other methods, OvSg presents the worst AUC with the lowest win rate,
which implies that oversampling could reduce the single tree’s performance. JSB does not
perform well either, which is only better than the single tree models and worse than the
others. SMB is better than RAB and JSB, but still worse than OvNC ones.
3) Regarding the training strategy of resampling, we can see that random oversam-
pling does not really improve AUC of the single tree and the conventional AdaBoost.
However, AdaBoost.NC integrated with oversampling improves AUC and minority-class
recall greatly, especially whenλ is high. It tallies with our results on the artificial data.
It suggests that this combination can discriminate the minority class from the majority
class better by encouraging diversity on the minority class aggressively. The overfitting
problem is lessened. Random undersampling does not work well on AdaBoost.NC, be-
cause undersampling itself is able to tackle overfitting by introducing some randomness
into data space. Undersampling tends to cause over-generalization with low minority-class
precision obtained.
4) For the ten defect prediction data sets in our experiment, it is useful to know
that, the average rate of finding defects ≈ 60% was reported at the 2002 IEEE Metrics panel (Shull et al., 2002) in the software engineering community. A recent work (Menzies
et al., 2007) had the average recall reach 71% over some PROMISE data sets with data
cleaning and feature selection applied. In our experiments, OvNC9 produces recall for
the defect class higher than 60% in 5 out of 10 data sets, 2 of which exceed 71%. For the
space consideration, the raw outputs were omitted here.
In summary, when the decision tree is used as the base learner, AdaBoost.NC using
random oversampling is more effective than other “resampling+ensemble/single” learn-
separability between the minority and majority classes and less overfits the minority class
without modifying training data. A largeλ is preferable.
NN-Based Models
According to Table 5.9, we have following observations for the NN-based models in total
300 pairs of comparison (15 data sets * 20 pairs of learning methods):
Table 5.9: Percentages of win(w)-tie(t)-lose(l) cases (in ‘%’) for each experimental method in 300 pairs of comparison using NN base learners, including AUC, minority-class recall and minority-class precision. The results are based on student T-test with confidence level of 0.95. The highest percentage of win is in boldface.
Method AUC Minority-class Minority-class recall precision w t l w t l w t l OrSg 43 44 24 9 18 73 45 25 30 OvSg 39 34 27 56 19 25 43 38 29 UnSg 43 36 21 70 19 11 25 26 49 OrAda 40 35 25 24 23 53 58 34 8 OvAda 41 38 21 39 23 38 54 33 13 UnAda 39 42 19 76 18 6 24 24 52 OrNC2 38 33 29 26 17 57 49 37 14 OrNC9 28 39 33 8 11 81 24 35 41 OvNC2 54 29 17 40 18 42 52 31 17 OvNC9 43 33 24 46 19 35 45 27 28 UnNC2 34 37 29 74 16 10 21 22 57 UnNC9 25 32 43 70 20 10 18 21 61 OrCELS 39 28 33 18 18 64 47 29 24 OvCELS 36 39 25 64 17 19 39 28 33 UnCELS 37 35 28 64 20 16 29 25 46 OrNCCD 26 28 46 9 15 76 16 23 61 OvNCCD 19 21 60 41 16 43 33 24 43 UnNCCD 30 31 39 77 11 12 11 16 73 SMB 23 30 47 20 19 61 36 37 27 JSB 2 8 90 11 16 73 38 29 33 RAB 33 36 31 22 21 57 48 38 14
1) OvNC2 presents better AUC with the highest win rate of 54% and the lowest lose
sets.
2) With respect to the single NN and NN-based AdaBoost models, they show less
sensitivity to the class imbalance than the tree-based ones. For example, the NN-based
OrSg wins in AUC with the rate of 43%, whereas the tree-based OrSg only has the rate
of 12%. It agrees with our results on the artificial data sets and the existing findings in
the literature (Japkowicz and Stephen, 2002; Khoshgoftaar et al., 2010).
3) AdaBoost.NC seems more effective in improving AUC on the real-world data sets
than on the artificial ones. In section 5.2, the single NN is generally better than Ad-
aBoost.NC, but this is not the case here. As we have explained, although the NN is less
affected by the class imbalance than the decision tree, its robustness gets weaker as the
training data becomes more complex (Japkowicz, 2000b). Hence, the single NN may be
more suitable for simpler data sets, such as the artificial data we have shown, whereas
AdaBoost.NC is a better choice for real-world problems.
4) As to the CELS and NCCD models, the CELS ones present better AUC than the
NCCD ones, but are still worse than OvNC2 and OvNC9. An unexpected observation
is that OvCELS achieves very good minority-class recall, which is competitive with Un-
CELS and better than the other methods without using undersampling. It supports our
initial idea of using NCL to better recognize rare examples. CELS is more effective in
improving minority-class recall than AdaBoost.NC. The possible reason is that ensem-
ble diversity is encouraged through the error function of the neural network, which is
more straightforward for training a NN than modifying the weights of training data in
AdaBoost.NC.
5) For the three class imbalance learning solutions, SMB, JSB and RAB do not show
any advantage in both AUC and minority-class performance. Especially, JSB appears to
be the worst.
The above observations suggest that AdaBoost.NC using random oversampling can
produce better overall performance than other methods, when the neural network is used
in finding minority class examples, which means that it less overfits the minority class
without removing any data information.