Multiclass classification presents considerably greater accuracy and performance chal- lenges than binary-class problems since the probability of making an error is higher. A common approach has been to decompose multiclass problems into a series of more man- ageable binary subproblems instead of applying explicit multiclass-algorithms directly [94]. There are a number of methods available for achieving this decomposition.
Multiple Binary Classifiers
Traditionally, the most popular decomposition approaches have been the one-against-all (OAA) and one-against-one (OAO) strategies [3, 86]. The OAA approach constructs k
number of classifiers for a k-class problem. Each classifier is trained to distinguish each classifrom the remaining classes in which case a binary-class boosting algorithm suffices. At detection time, the ideal outcome of the combination of classifiers is a unique vote for a single predicted class. Since this is often not the case, ties are either broken arbitrarily [58], or an additional value that signifies confidences of the classifiers is derived and used to resolve conflicting predictions [77]. A known disadvantage of this method is the unbalanced nature of the training sets for the classifiers, which may complicate the induction of accurate class decision boundaries [62, 72, 134].
The OAO approach, known also as the round robin classification, proceeds to binarize the multiclass problem into a series of pairwise classification sub-problems. Each classiis trained to differentiate between all other individual class instances. This results in k(k2−1) number of classifiers being generated. Although OAO generates a total number of binary classifiers which is in the order ofk2, the actual training runtime is usually not excessive
since the training task of differentiating only two classes is much simpler. Due to the fact that there are fewer instances being considered, the inducer has a much greater freedom in fitting the decision boundary between two classes than between the entire set of classes [50].
The final prediction of OAO classifiers is most commonly achieved through a majority voting scheme [77]. A crucial shortcoming with the majority voting scheme in this context, is that many votes are generated by classifiers which have been trained on pairs of irrel- evant class labels. Consequently, the votes from irrelevant classifiers are also integrated into the final decision which can adversely influence the final result. Platt et al. [117] devise a hierarchical strategy to integrate the decision making process of OAO classifiers through directed acyclic graphs (DAG), which resemble binary decision trees. The DAG- based strategy works by propagating decisions through the tree by executing a classifier to differentiate class ifrom class j at each node. The decision at each node is concluded
as not class i, whenj is predicted and vice versa until the leaf nodes are reached. Platt
et al. [117] show that the DAG approach is efficient and accurate at aggregating decisions and demonstrate how an uncompromised accuracy can be achieved without executing all classifiers within an ensemble.
2.2. Boosting 17
Error Correcting Output Codes
Dietterich and Bakiri [30] applied principles from coding theory in the form of error- correcting output codes (ECOC) usually associated with communication across noisy channels, to the problem of multiclass decomposition and classification resolution. ECOCs have since experienced widespread use and success on many applications.
The ECOC method constructs a coding matrix, an example of which can be seen in Table 2.1. The matrix provides a scheme for training classifiers. The scheme specifies the different grouping combinations of positive and negative sets that the k >2 classes are to be reduced to. For a kclass problem,krows are created for each class. n > kcolumns are generated, where each column represents one classifier fi. The bit values in each column
i define a training scheme for relabeling the set of multiclass labels into binary classes. Each row then becomes a unique codeword for a given class label. Given an unseen input x, each binary classifier corresponding to a column with outputfi(x)∈ {0,1}, is executed.
Their combined output produces a binary codeword. This output code is then matched against the rows in the coding matrix to provide the resolution of the predictions. The class with the closest matching codeword according to the Hamming distance becomes the classified label.
Table 2.1: Example of a coding matrix for a four class problem. classifiers class f1 f2 f3 f4 f5 f6 c1 1 1 1 1 1 1 c2 0 0 0 0 1 1 c3 0 0 1 1 0 0 c4 0 1 0 1 0 1 output f1(x) f2(x) f3(x) f4(x) f5(x) f6(x)
The quality of the coding matrix is crucial to the success of the method. The ef- fectiveness of the coding matrix can be assessed using two criteria. The first is the row separability. The length of the codewords n must be larger than the minimum required to uniquely differentiate between all the classes. The extra redundancy in the coding vectors at each row enables the classes to be more separated from each other according to the Hamming distance. This redundancy and distance allows for the error correction or compensation to take place in the presence of misclassifications by several classifiers. The second is the diversity of the columns, whereby the columns are required to be as uncorrelated as possible.
Dietterich and Bakiri [30] initially showed how the error correcting properties of code- words could be effectively unified with multiclass training, whereby for each column of the coding matrix, the classifier learns to discriminate between two subsets of all classes. Allwein et al. [3] then generalized the approach further by demonstrating how the two subsets of each bit vector no longer had to represent all possible classes.
Multiclass Boosting
Initial multiclass extensions of AdaBoost, resulted in AdaBoost.M1[44]. The primary obstacle to initial forms of multiclass AdaBoost, was its requirement that the weak learner
generate weighted errors at each round that are better than 12. For most multiclass problems of k > 2, this requirement is much harder to satisfy than random guessing of
1
k. Consequently, stronger and more computationally intensive weak learners like C4.5
had to be used. As a byproduct, the training runtimes increased. More recently, Eibl and Pfeiffer [35] and Zhu et al. [199] proposed AdaBoost.M1W and SAMME respectively. These modifications enabled AdaBoost to operate with weak learners that produce errors which are slightly better than random guessing of 1k.
Freund and Schapire [45] also proposed a powerful boosting algorithm called Ad- aBoost.M2 which is designed to overcome the rigid error demands of its early predeces- sors. They designed an algorithm whose strength lies in that it not only learns the difficult samples, but it also learns from incorrect class labels, where a distribution is introduced for mislabeled instances to couple the learning and the boosting process. This is achieved by introducing the idea of a plausible set µwhere µ⊂Y and Y is the entire set of class labels. The weak hypothesis can then make a binary prediction for a given sample x, in which it only has to predict whether a sample belongs to the plausible set or not. The output of the weak hypothesis is a pseudoloss measure. The hypothesis with the lowest pseudoloss over all possible combinations of the plausible set, becomes the hypothesis for roundt. In this manner, AdaBoost.M2 guarantees that the best discriminating hypothesis between all classes is selected at each round.
Due to its strength as a multiclass learning algorithm, AdaBoost.M2 is used as one of the control methods in this research. Algorithm 2 lists the steps where the notation JπK
used here and subsequently, is defined as 1 if propositionπ holds, otherwise 0.
Algorithm 2: AdaBoost.M2
Given: (x1, y1), ...,(xm, ym) wherexi∈X, yi ∈Y to make uniform over all
incorrect labels
Output: HypothesisHf inal(x) = arg maxℓ∈Y
PT t=iαtJℓ∈˜ht(x)K Initialize ˜D1(i, ℓ) =Jℓ6=yiK/(m(k−1)) 1 fort = 1 to T do 2
Train weak learner using pseudoloss defined by ˜Dt.
3
Get weak hypothesis ˜ht:X→2Y
4 Let ˜ǫt= 12 Pm i=1 P ℓ∈Y D˜t(i, ℓ)·(Jyi ∈/h˜t(xi)K+Jℓ∈˜ht(xi)K) 5 Letαt= 12ln(1−˜˜ǫtǫt) 6
Update ˜Dt+1(i, ℓ) = Z1t ·D˜t(i, ℓ) exp(αt(Jyi∈/ ˜ht(xi)K+Jℓ∈h˜t(xi)K))
7
whereZt is the normalization factor so that ˜Dt+1 will sum to 1.
8
The first three steps of Algorithm 2 are tightly bound and represent the enhanced communication between the weak learner and boosting. For each round of boosting, a weak hypothesis is generated for each permutation of a plausible setµt∈2Y. The goal of
the learner is to minimize the pseudoloss defined as: ˜ ǫt= 12 Pm i=1 P ℓ∈Y D˜t(i, ℓ)·(Jyi ∈/h˜t(xi)K+Jℓ∈h˜t(xi)K)
where the loss measure penalizes the hypothesis for not including the correct label yi
belonging to sample xi in the plausible set as yi ∈/ ˜ht(xi), and additionally penalizes
2.2. Boosting 19
is calculated as in binary AdaBoost, after which the algorithm proceeds to update the distribution using the same approach used for calculating the pseudoloss measure. The final hypothesis classifies an instance according to the label which occurs most frequently in the plausible sets that is selected by the weak hypotheses.
Though experiments with AdaBoost.M2 have indicated that it performs very well, there are some drawbacks to it. The first is its training runtime [44, 58]. From line 2 in Algorithm 2, it becomes clear that the number of possible plausible sets rises exponentially with the number of class labels in the training set. An exhaustive search over the entire possible range of permutations of the plausible set rapidly becomes unfeasible for datasets containing class labels k >10. The second disadvantage is encountered in the process of implementing the algorithm. Since the boosting and the weak learner are tightly coupled, it requires modifications to the weak learner in order to enable it to use the pseudoloss measure which complicates the implementation.
A common criticism of the two-stage decomposition approach to training classifiers in- dependently and then combining their prediction votes as seen in OAA, OAO and ECOC is that the binarizing step is done a priori without considering the properties and char- acteristics of a dataset [3, 72, 94]. Furthermore, Allwein et al. [3], Jin and Zhang [72] point out that in respect to ECOC, some artificially created binary sub-problems may be difficult to learn due to complicated multimodal decision boundaries. Finding an optimal coding matrix for ECOC methods has been shown to be NP-hard [27]; however, some heuristics have been proposed. The most popular of these are AdaBoost.OC (output cod- ing) [44] and AdaBoost.ECC (error-correcting code) [58], which have effectively unified boosting with ECOC [134]. Due to their widespread usage and the still ongoing research surrounding them [72, 86, 134, 154], both boosting algorithms are implemented in this research as baseline models.
Algorithm 3: AdaBoost.OC
Given: (x1, y1), ...,(xm, ym) wherexi ∈X, yi ∈Y to make uniform over all
incorrect labels
Output: Hypothesis Hf inal(x) = arg maxℓ∈Y
PT t=iαtJht(x) =µt(ℓ)K Initialize ˜D1(i, ℓ) =Jℓ6=yiK/(m(k−1)) 1 fort = 1 to T do 2 Compute colouring µ:Y → {0,1} 3 LetUt= Pm i=1 P ℓ∈Y D˜t(i, ℓ)Jµt(yi)6=µt(ℓ)K 4 LetDi = U1t ·Pℓ∈Y D˜t(i, ℓ)Jµt(yi)6=µt(ℓ)K 5
Train weak learner on examples (x1, µt(y1)), ...,(xm, µt(ym)) weighted according
6
to Dt
Get weak hypothesis ht:X→ {0,1}
7 Let ˜ht(x) ={ℓ∈Y :ht(x) =µt(ℓ)} 8 Let ˜ǫt= 12Pmi=1Pℓ∈Y D˜t(i, ℓ)·(Jyi∈/ ˜ht(xi)K+Jℓ∈˜ht(xi)K) 9 Letαt= 12ln(1−˜ǫ˜tǫt) 10 Update ˜Dt+1(i, ℓ) = Z1t · ˜ Dt(i, ℓ) exp(αt(Jyi ∈/ ˜ht(xi)K+Jℓ∈˜ht(xi)K)) 11
whereZtis the normalization factor so that ˜Dt+1 will sum to 1.
12
overcome significant computational drawbacks of AdaBoost.M2. The ECOC-based boost- ing methods iteratively generate columns of the coding matrix so that the confusion be- tween classes is reduced at each boosting round. Algorithm 3 describes the detailed steps of AdaBoost.OC, while theoretical proofs for it can be found in [44].
AdaBoost.OC proceeds by first generating a coding function that the authors refer to as the colouring µ. The colouring function generates the columns of the coding matrix and maps all class labels Y → {0,1}. Similar to AdaBoost.M2, AdaBoost.OC maintains an additional distribution ˜Dat each roundt. By applying the colouring functionµtto ˜D,
valueUtis generated which represents the error correcting ability of the column matrixµt.
µt, Ut and the distribution Dt are then combined to compute the weight of distribution
Dt. A binary classifier ht is subsequently learned on the adjusted distribution Dt. This
differs from AdaBoost.M2 which learns on ˜Dt. The pseudo error ˜ǫt is then used as in
AdaBoost to produce the confidence valueαt. Lastly, the distribution ˜Dis updated as in
AdaBoost.M2 withαt. The final hypothesisHon samplexis computed as being the class
label l, which receives the highest weighted vote from all class labels returned byht(x).
AdaBoost.ECC resembles AdaBoost.OC in many respects (Algorithm 4), but differs in that it does not calculate theαtvalue based on the pseudo-error at each roundt. Instead,
it selects αt and βt values at each round which represent the positive and negative votes
of the hypothesis X → {−1,1} on the binary problem. According to its authors [58], it thus represents a more direct reduction of multiclass learning to the binary form. At classification, the function gt(x)·µt(l) returns the vote of the value for a class label l.
The final hypothesis H(x) outputs the class label y which scores the maximum sum of all the constituent hypotheses gt(x). The authors maintain that AdaBoost.ECC is a
computationally simpler algorithm, which in their research has displayed superior results over AdaBoost.OC.
Algorithm 4: AdaBoost.ECC
Given: (x1, y1), ...,(xm, ym) wherexi∈X, yi ∈Y to make uniform over all
incorrect labels
Output: HypothesisHf inal(x) = arg maxℓ∈Y
PT t=igt(x)µt(ℓ) Initialize ˜D1(i, ℓ) =Jℓ6=yiK/(m(k−1)) 1 fort = 1 to T do 2 Compute colouringµ:Y → {−1,1} 3 LetUt=Pmi=1 P ℓ∈Y D˜t(i, ℓ)Jµt(yi)6=µt(ℓ)K 4 LetDi = U1t · P ℓ∈Y D˜t(i, ℓ)Jµt(yi)6=µt(ℓ)K 5
Train weak learner on examples (x1, µt(y1)), ...,(xm, µt(ym)) weighted according
6
to Dt
Get weak hypothesisht:X→ {−1,1}
7
Compute the weight of positive and negative votesαt and βt respectively
8 Define: gt(x) = ( αt ifht(x) = 1 βt ifht(x) =−1 9
Update ˜Dt+1(i, ℓ) = Z1t ·D˜t(i, ℓ) exp{(gt(xi)µt(ℓ)−gt(xi)µt(yi))·12}
10
where ˜Zt is the normalization factor so that ˜Dt+1 will sum to 1.