• No se han encontrado resultados

1. INTRODUCCIÓN 17.

1.5. DIAGNÓSTICO DE LA HIPERTROFIA VENTRICULAR IZQUIERDA 79.

1.5.2. DIAGNÓSTICO ECOCARDIOGRÁFICO DE HIPERTROFIA

1.5.2.2. Cálculo de la Masa Ventricular Izquierda 89.

In a survey on multilabel classification, ( ) distinguish be- tween two categories of approaches, namely, those which employ a transformationinto a set of binary classification problems and approachesadaptingexisting methods to han- dle multilabeled data directly. As it turns out, it is sometimes difficult to separate both, since many adapted approaches consist in transforming the original problem into a series of easier problems. On the other hand, a series of learners have been developed in the meanwhile that are conceptually at least not directly based on existing approaches but were specifically developed with the multilabel setting in mind.

Thus, we use the following (not exhaustive) coarse division in this work: transforma- tionalandholistic approaches. The first group corresponds to the notion of transforma- tion and decomposition techniques and this work is specifically dedicated to one of the possible techniques. The following section provides a basic description. A profound ex- planation and analysis is provided in Chapter . The latter group of learners try to solve the global problem directly without transforming it into subproblem(s). An overview of existing techniques is given in Section .

As described above, it is sometimes difficult to draw the boundary and some of the developments described in the following may not fit this categorization perfectly well. The sections after Section review some of the main, older and newer, techniques encountered in multilabel learning. Since many of the cited works combine multiple techniques, some of them could indeed be categorized into several classes – a multila- bel categorization. We refer to the discussion sections in the corresponding chapters of this thesis for further relevant and interesting literature. Particularly Section for hierarchical learners, for generally applicable thresholding techniques, for par- ticularly scalable methods, for approaches dedicated to label correlations, for stacking, and for pairwise approaches and Section for a comparison of these with one-against-all decomposition. A further source of information are the excel-

18 Note however that the optimal prediction is always defined for all metrics, namely predicting exactly

the correct labelset. But a Hamming loss optimized learner will approach the optimum in a different way than an accuracy optimized one. Graphically, on a graph with HAMLOSSon the one axis and ACC

on the other, one type of learner would approach the optimum in the upper left corner from below the diagonal and the other from above, possibly in form of a curve.

lent surveys of ( ) and ( ) and, for the particularly interested reader, the respective related work sections in the cited literature.

2.8.1 Transformational Approaches

We focus on three basic transformation schemes, namely binary relevance, label pow- erset and pairwise decomposition. The main idea of these methods is to transform the problem into non-multilabel tasks, particularly binary tasks (except for label powerset, which requires one further step), which can then be solved separately with existing bi- nary learners. Section e.g. revises the basics of linear learners such as support vector machines (SVM) and perceptrons. Other learners used in this work include C4.5 decision trees and Naive Bayes. We refer to ( , ) for explana- tions of these and other approaches. In the case of transformation into several subtasks, the approach is denoted to be decompositive.

In the binary relevance (BR) or one-against-all (OAA) method, a multilabel problem with n possible classes is decomposed into n binary problems. For each subproblem, a binary classifier is trained to predict the relevance of the corresponding class. In the pair- wise decomposition approach, one classifier is trained for each pair of classes, i.e., the original problem is decomposed into n(n2−1) smaller subproblems. The binary classifiers are trained on examples with a clear preference for one of the two classes. During clas- sification, each base classifier is queried and the prediction is interpreted as a vote for one of its two classes. In the label powerset approach (LP), a meta multiclass problem is constructed where each appearing label combination Pi is interpreted as one separate class. The meta problem is then solved with a normal multiclass algorithm or with the previously presented decomposition methods. More thorough explanations and analyses are given in the next chapter.

2.8.2 Holistic Approaches

We considerholistic approaches to be approaches that try to solve a multilabel problem globally and jointly. This often involves solving one single optimization problem to find the decision function(s). Therefore, this strategy is called single-machine by

( ) orall-at-onceby ( ). Holistic approaches are often able to learn risk minimizing models with respect to a particular metric.

compares solving n independent optimization problems, each finding a function hi for classλi, to solving one global problem containing the n functionshi. The findings indi- cate that the local approach is more advantageous for the particular case of using SVMs on multiclass problems, although this is not directly transferable to the multilabel case (cf. Section ).

The work of ( ) on adapting the SVM algorithm to the multila- bel case is probably one of the most frequently cited works in MLC. A global optimization problem is formulated in order to minimize the ranking loss (RANKLOSS), i.e. that no

irrelevant labels are ranked above relevant ones (Rank-SVM). The hypothesis space con- sists of nhyperplanes in the input space (cf. Section ), just as it would be the case for a BR ensemble of SVMs. BP-MLL is a popular neural networks algorithm for multi- label data which is trained in order to minimize the number of incorrectly paired labels in the output ranking ( ). To this end, RANKLOSS is reformulated as a differentiable version, which is a prerequisite for being used in back propagation. BP-MLL is similar to MMP, but more than one hidden layer is used, similarly to MLPP (cf. Chapter and Section ). The multiclass multilabel perceptron algorithm (MMP) ( , ) learns similarly to Rank-SVM one prototype for each label by globally optimizing an arbitrary ranking loss (cf. Section ). But opposed to Rank-SVM, this is done incrementally for each training example (cf. Section ).

For the general case, we can also consider so called cased-based or lazy approaches and rule-based learners such as decision trees as holistic methods, as long as they are not based on binary models.

Case-based

The k-nearest neighbor approach (cf. , Sec. 3.8) to multilabel classification (ML-kNN) is inspired by Bayesian reasoning ( ). It combines the label distributions of thek neighbors and the a priori distribution in order to make a prediction. The concept that is followed in ( ) is closely related to the calibration technique described in Section . Each training instance is assumed to be associated with a tri-partite rankingPxxλ0xNx. The rankings of the k nearest neighbors are aggregated to one predicted ranking by computing the average ranks of the labels among thektri-partite rankings.

Rule-based learner

An associate multilabel rule learner with several possible labels in the head of the rules was developed by ( ). These labels are found in the whole training set, while the multilabel lazy associative approach of ( ) generates the rules from the neighborhood of a test instance during prediction. Hoeffding trees were adapted for learning large multilabel data streams by ( ). These trees are incre- mentally trainable and it is proven that they approximate non-incremental decision trees trained with infinite training examples. Unfortunately, they are not suitable for streams with concept drifts. Both ( ) and ( ) propose to use ensembles of random decision trees (RDTs) for multilabel classification. The main idea of these trees is described in more detail in Section .

2.8.3 Generative Approaches

The early work of ( ) constructs generative models for labelsets, which consist of a mixture of topic based word distributions. Parametric mixture models consist of probabilistic generative models for each label, in form of class prototypes (

class prototypes. The greedy approach successively adds one label to the currently most probable labelset until it finds the maximum. The approach of

( ) assumes that documents are drawn from superpositions ofndistributionsXi, one for each label. Unlikely label combinations are discarded during the prediction process in order to reduce the number of comparisons2n needed by the brute force approach. Latent Dirichlet allocation (LDA) is an unsupervised approach usually applied in order to generalize whole document corpora which assumes that a document is sampled from a mixture of word distributions, the (virtual) topic models. ( ) estimate these word distributions directly on the training data adopting the labels as topics. An additional LDA process on top of the labels is applied in order to model dependencies between the labels. More details on this approach can be found in Section .

2.8.4 Ensembles

One of the first multilabel ensemble techniques was boosting: The text and speech classi- fication system Boostexter ( ) is based on the multilabel exten- sions of AdaBoost described by ( ). Rakel builds an ensemble of mlabel powerset classifiers ( ). Each LP learner focuses on a ran- domly chosen subsetLi Lwithk=|Li|, i.e. thei-th learner receivesPx∩Lias training signal for an instancex. Rakel is hence potentially able to model label dependencies from 1 tok-order breaking the limitation of pure LP, which is only able to detect simultaneous absences or presences of label combinations{Pi}given in the training set. Interestingly, thekparameter can be seen as a fader between BR and LP, from fully independent to fully connected classes. ( ) follow a very similar approach with their ensembles of pruned sets (PS). PS uses the LP transformation but prunes away infrequent labelsets P or decomposes them into more frequent subsets P0 ⊂ P. The ensemble is created by applying PS on random subsets of the training data and these models seem to outperform Rakel ensembles, which used similar training time. Ensemble variants of probabilistic and simple classifier chains ( , , ) ensure a cer- tain robustness towards different sequences of the models. further analyze the combination with bagging, i.e. using random subsets of training instances and features. He finds that around 40% of the instances and features are sufficient in order to obtain comparative accuracy. This also holds for ensembles of BR classifiers and would likely also apply to pairwise ensembles. See Section for a discussion on general classifier chains.

The following authors use a neural network alignment of the model, but essentially use ensemble techniques in order to determine the hidden layer. ( ) builds a (non-linear) neural network with one hidden layer (ML-RBF). The neurons in this layer are simply a predetermined number ofk centroids of each class, computed via k-means clustering (cf. , Sec. 4.7). They achieve consistently better results than another neural network (BP-MLL, see above), boosting approaches and Rank-SVM

on scene, yeast and yahoo. The very recent LIFT algorithm similarly selects centroids

(byk-means) in the positive and negative examples of each one-against-all problem and

then replaces the original features of an instance by the distances to these representatives, separately for each BR subproblem ( ). ( ) trains an ensemble of ML-RBF networks and shuffles the prototypes in the hidden layers using an evolutionary operator by optimizing simultaneously accuracy and diversity of the predictions. They outperform the baseline, ensembles of classifier chains and Rakel (see above) onyeast,

scene andyahoo.

2.8.5 Instance Input Space Transformations

Several approaches rely on enriching or even replacing the input instance attributes by new features that are expected to provide additional information. The hope is often to be able to encode characteristics that help to exploit (conditional or unconditional) label dependencies. An example is the approach in Chapter , which adds features indicat- ing the presence of exceptional local feature-labelset combinations. Stacking, i.e. the use of classifier predictions as features for the main learner at the bottom, is another pop- ular approach in this context and several techniques will be revised in more detail in Section . A special group of stacking approaches organize their learners in chains so that the prediction for a particular class depends on the predictions for the previ- ous classes. Classifier chains (CC, ) are more exhaustively discussed in Section . Probabilistic CC generalize this concept by using probabilistic base classi- fiers ( ). Their approach ensures a Bayes optimal decision ac- cording to the conditional label dependencies, since Eq. is indeed approximated by P(y|x)≈h0(x)Qn

i=2h0i(x,y1, . . . ,yi−1)in their setting.

Some kernel based approaches are presented in the following: ( ) adds label specific features to the input-space and uses a specialized kernel in order to measure the label vector similarities. During prediction, a test example is subse- quently enriched with the features for all possible label combinations and the highest scoring labelset is predicted. The authors derive an algorithm which is more efficient than this brute-force approach, but still requiresO(nm3). Similarly, the general frame- work of SVMs for structured output spaces also supports the MLC setting by defining an appropriate kernel ( ). Here, too, a heuristic is necessary in order to circumvent the prohibitively high costs of enumerating all possible outputs.

2.8.6 Label Output Space Transformations

Transforming the label output space Y into a substituting, lower dimensional Y0 is a relatively new idea. It is based on the observation that the output space is usually very sparse in multilabel problems (cf. Section ) and relies on techniques such as principal component analysis, which ”compresses“ by projecting to lower dimensional spaces. The objective of these approaches is not the improvement of accuracy, but the reduction of computational costs caused by the often observed direct dependence of training and testing time on the number of possible classes. This dependency is one of the leitmotifs

in this work and hence the development of this topic is followed with special interest by the author. Unfortunately, the works observed so far have to use regression algorithms since the reduced spaceR|Y0| is not binary anymore. Hashing, and particularly semantic

hashing such as spectral hashing ( ), is a promising solution in order to be used in the (strictly binary) frame of this work.

Compressed sensing ( ) exploits the sparsity of the output space and uses a compressing linear matrixA, which is filled with Gaussian, Bernoulli or uniformly distributed values. The binary output space is transformed (AY Rk) into a k = O(d) dimensional continuous space. Reconstruction techniques known from error correcting output codes(cf. Section ) map the prediction back into the original label space.

( ) in contrast interpret the label output space as an-dimensional hypercube with the labelsets at the vertices. A singular value decomposition is used in order to de- termine new principal directionsei. The ei hence form the projection matrix Aand the newy0 contain the projections of the original y on the|Y0|directions ei. Again, regres- sion algorithms are then used to learn(x,y0). ( ) focus on hierarchical problems. The label output space is preprocessed in order to consider the hierarchical information and then compressed via kernel PCA (kernel dependency estimation) to an alternative space of only 50 dimensions. Their experiments on biological datasets with more than 4000 classes showed that their 50 regressors consistently improved time costs and accuracy compared to several hierarchical learners cited in Section .

( ) compare linear dimensionality reduction of the output space applied on the whole problem, solving a single optimization problem, with applying it to each of the pairwise subproblems. It is found that pairwise decomposition is more beneficial in terms of accuracy.

2.8.7 Alternative Structures and Formulations

Known multilabel problems often result from a simplification of more complex origi- nal tasks, such as hierarchical classification (cf. Section ). It is often also possible to induce certain structures on multilabel data which may help the learning process. The fol- lowing approaches have in common that they rely on such an extended (re)formulation. ( ) consider a multilabel task as a hypergraph with the instances as nodes and n-ary edges for each label connecting all associated instances. This repre- sentation aims at exploiting label correlations by analyzing the hypergraph spectrum. ( ) see MLC in the context of collaborative filtering and interpret labels as users and documents as items. The label matrix (yi)i is hence used in order to compute a probabilistic latent semantic analysis model which encodes the unconditional inter-label dependencies. Feeding a transductive SVM with this additional information outperforms the baseline SVM. ( ) similarly uses kernels on the label output space in order to exploit label correlations on multilabel graph data. They claim that different kernels cover differentk-order dependencies and subsequently find that (infinite-order) RBF kernels substantially outperform polynomial kernels of dif- ferent degrees and the linear kernel. The work of ( ) relies on relational

multilabel data. Additional features are assumed that connect instances in the training set, e.g. co-authorship for author objects or common directors for movies. These con- nections are used to define a vicinity of the instances. The feature set for each training example is then extended by aggregating the neighbors’ input and also output features in order to capture intra and inter-instance label dependencies. During prediction, a greedy process begins with the original features and subsequently adds features from the ex- panded neighborhood. This collective approach clearly outperforms classifier chains and BR.