4. RESULTADOS 139.
5.2. COMPARACIÓN CON OTROS ESTUDIOS 177.
5.2.1. ESTUDIOS DE REGRESIÓN DE MVI 177.
We use the methodology described by ( ) for the comparison of multiple classifiers on multiple datasets. First, we perform a Friedman test ( , ) to determine whether the classifiers all perform similarly. Letk be the number of tested classifiers, let rij be the ranks of the observationssij on the j-th test case,sπj1. . .sπjk, ri
j=x↔πx =i(cf. Eq. , we assign average ranks in case of ties) and letri =avgjrij
be the average rank over allN test cases for classifierhi. The null hypothesis now states
that all classifiers perform equivalently and so their average ranks should be equal. Under this null hypothesis, the Friedman statistic
χF2= k12N (k+1) X i ri− k+1 2 2 = k 12N (k+1) X i (ri)2− k(k+1) 2 4 !
has a chi-squared distribution with k−1 degrees of freedom, when N and k are suffi- ciently large.
If the null hypothesis is rejected, we can determine which classifiers are significantly better than others with a post-hoc test. ( ) proposes to use the Nemenyi test
( ). The test entails that the performance of two classifiers is significantly different if the difference between their average ranks is at least the critical difference:
C D=qα
r
k(k+1)
6N
whereqαare critical values based on the Studentized range statistic divided byp2. also describes the more powerful Bonferri-Dunn, Holm, Hochberg and Hommel tests which are based on pairwise comparisons between classifiers and used in compar- isons against a control classifier. They all rely on the test statisticz= (ri−rj)/
Æ
k(k+1) 6N and
a corresponding division of the resulting p-value. However, ( ) legitimate these tests also for performing all pairwise comparisons and in addition rec- ommend the more powerful and complex Shaffer and Bergmann-Hommel tests. Details on the computation can be found in the cited publications.
3 Decompositive Approaches to
Multilabel Classification
Decompositive approaches transform an original problem into several subproblems of a different type which can hopefully be solved more easily, more efficiently, or more effectively. In multilabel learning, the main purpose is to enable the use of existing state- of-the-art base learners. However, the choice of the decompositive approach certainly has an impact on efficiency, scalability and effectivity.
The predominant approach in multilabel classification is binary relevance learning (cf. Section ). It tackles a multilabel problem by learning one classifier for each class, using all objects of this class as positive examples and all other objects as negative ex- amples. Pairwise decomposition in contrast learns one classifier for each pair of classes. These pairwise classifiers are only trained in order to distinguish the corresponding two classes (cf. Section ).
( ) and ( ) classify both methods as problem transformation approaches. This also includes approaches which transform the problem into one single "sub"-problem, which is however different from the original one. We additionally make the following two distinctions: A transformational (or decompos- itive) approach is abinarizationtechnique if the resulting subproblem(s) are (is) binary (cf. Section ).
A further interesting property of decompositive approaches is the extent of the result- ing subproblems, specifically whether they contain all points in the original data or only a subset. In binary relevance decomposition e.g. the union of the positive and negative ex- amples of the subproblems results in the whole training set. This type of approaches are dedicated to separating a subspace from the whole instance spaceX, whereby the sub- space is represented by a set of positive examples and all the remaining known examples are assumed to be outside of this particular subspace, i.e. in the inverted subspace. Typi- cally the presence of a particular property or characteristic is associated with the positive examples. We may hence consider the decomposition intocomprehensive subproblemsas a formalization ofconcept learning. This connection is further discussed in Section .
The most well-known counter-example for non-comprehensive subproblem generating approaches is pairwise decomposition, where a subproblem contains examples of two classes, and only these. The second class is not the inversion of the first class, as in binary relevance, but represents itself an explicitly given subspace. Hence, the subproblems only cover a subspace ofX.
Error correcting output codes (ECOC, Section ) produce comprehensive subprob- lems, while the more general ternary ECOCs do not (cf. Section ). Section dis- cusses tri-class learners and variations which are also non-comprehensive. An overview
Table 3.1:Overview of transformation approaches and properties. The total number of labels is denoted withnand the number of the examples in the training set withm.
transformation approach number of type of extent of Section subproblems subproblems subproblems
binary relevance decomposition linear (n) binary comprehensive (m) label powerset transformation one multiclass comprehensive (m) error correcting output codes arbitrary (≥n) binary comprehensive (m) ternary ECOCs arbitrary (≥n) binary subset (<m) pairwise decomposition quadratic (n(n2−1)) binary subset (<m)
of the main properties of the different approaches presented in this chapter is given
in Table . ( ) and ( ) list some addi-
tional transformation methods based on training example omission and repetition which are commonly not employed in practice (anymore). Section and Figure also shortly cover one of these methods (MC) in the context of feature selection.
3.1 Binary Relevance Decomposition
In thebinary relevance(BR) decomposition, also known asone-against-all(OAA) orone- against-the-rest/others particularly for multiclass classification, a multilabel training set with n possible classes is decomposed into n binary training sets of the same size m = |Train|that are then used to trainnbinary classifiers.
So for each input example (xj,yj) in the original training setTrain,ndifferent examples of the form(xj,yj,i)withi=1 . . .nare generated resulting in the binary sets
Traini =〈(x1,y1,i), . . . ,(xm,ym,i)〉, i=1 . . .n (3.1) with the new binary label output space yj,i ∈ Ybin. Note again that all of these n de-
composed training sets are of the same size as the original training set. A brief visual description of this technique is available in Figure .
Hence,n differenthi =hTi
rain binary base classifiers are trained in order to determine the relevance of λi, i.e. to recognize if an instance is included in their respective class
λi. In consequence, the combined prediction of the binary relevance classifier for a test
instancexwould be the output vector
ˆ
y:= (h1(x), . . . ,hn(x)) (3.2)
or alternatively, in set notation,
ˆ
P:={λi
hi(x) =1} (3.3)
As already pointed out in Section , many base classifiers produce some type of relevance scores, which can be used to compare and rank classes. No assumption has to
Figure 3.1: Subproblems in binary relevance classification for multilabel classification: original three-class problem (green, blue and black classes, shown as overlapping clouds in left picture) is divided into green vs. rest (second picture), black vs. rest (third) and blue vs. rest two-class sub- problems. Separating hyperplanes, denoted by red lines, have to respect all examples (inside the clouds). Clouds of negative examples have dotted lines.
be made at this point about the value range of the score predictions, though mostly the range is either[0; 1]or [−1; 1]. We will denote these predictions withh0:X →R. The binary predictionhis obtained by means of thresholding at the middleθ of the range
tθ(s):=I[s> θ] ,s,θ ∈R (3.4)
and hence, we can define the following relationship for scoring classifiers
h=tθ◦h0 (3.5)
If the context is clear, we may simply writehinstead ofh0. Consequently, we define a prediction score vector as
ˆ v:= (h0
1(x), . . . ,h0n(x))∈Rn (3.6)
and the corresponding sorted label ranking as
r=〈π1, . . . ,πn〉 ∈Πnn,n ∀πi,πj,i< j. h0πi(x)>h0πj(x) (3.7)
Note however, thatθ must not be0.5, although it is the most natural value since it was the objective to decide at that point. But, in fact, many approaches exist in order to select deviating thresholds and even to choose a differentθi for each classifierh0i. A review can
be found in Section .
There exists a strong connection between binary relevance decomposition and concept learning. As already detailed in Section , concept learning is dedicated to learning
23 The use of>or≥is arbitrary in practice, however, since we predict whether a label is relevant, it is
more consistent to interpret a tie as neither relevant nor irrelevant than as relevantandirrelevant, and this is achieved with the present definition
24 Without loss of generality, we assume that no ties happen in order keep the more elegant definition of ras a total order. In practice, we may break ties randomly or according to the prior probability of the labels.
the presence or absence of a specific concept among instances. When several target con- cepts are possible or given for the same set of instances, we formally have a multilabel problem. In fact, there was an early understanding of the multilabel setting in the re- search field of concept learning. The first multilabel classification system known to the author, namely theConstrue topic identification systemused by Reuters (
), followed the paradigm of concept definitions and used a separate rule base for recognizing each label. Binary relevance could be seen as a solution to multilabel learning using concept learning since it decomposes a multilabel task into several sub- problems which can be semantically considered concept learning problems. However, this is not the only valid possible interpretation. E.g., remind the example of the recognition of the color of an object in Section . Let the first (binary) class y1 in Y determine whether an object is red or blue, and let the second class y2 represent either a circular or a rectangular object, etc. The resulting subproblems using binary relevance decom- position cannot be interpreted as concept learning problems anymore since there are no clear positive examples. Hence, binary relevance should be seen as a formally defined method in order to decompose multilabel problems in binary problems, regardless of the base learner used and the semantics of the labels and of the resulting binary problems. Decomposition according to concept learning is an instantiation of it which may be used if convenient.
The system of ( ) is among the first works known to the author which explicitly employed the generalized binary relevance decomposition approach with sup- port vector machines for the binary subproblems. This method was later calledthe binary approachby ( ) and (curiously) cross-trainingby
( ) until ( ) coined the term used in this work.