3. PREGUNTA PROBLEMA
6.3. Evaluación institucional: la columna vertebral para el éxito de la escuela
In this chapter we have proposed a supervised feature selection approach for multi-class classification problems using frequent subgraphs. Since we use a submodular selection criterion, we can provide optimality guarantees for the set of selected features obtained by greedy forward selection. Additionally, we have explained how to integrate this criterion directly into the subgraph
44 2.5. Summary and Outlook
mining process by exploiting an upper bound for pattern-growth extension miners like gSpan. Moreover, we show how to use this bound on a set of pre-mined subgraphs, allowing for more flexibility in the choice of the type of subgraph used.
Similar to information theoretic criteria used for decision trees, CORK measures the quality of a set of features by means of its ability to separate target classes. In our experiments on classification benchmark datasets, the features selected by CORK reach the best accuracies among the filter meth- ods. Among the wrapper methods, CORK outperforms MbT and DT MbT
in all but one cases. The LAR-LASSO method still achieves a more accurate classification, however, CORK has runtime advantages on pre-mined patterns and large subgraphs.
A strategy to further improve the runtime of our approach is to store the DFS search tree for a set of previously mined frequent subgraphs [154]. When restricting the mining procedure to a fixed minimum support value, this entails much shorter mining times, since gSpan effectively only has to be called once per feature selection step and not several times. Still, the feasibility of this approach obviously depends on the size of the DFS tree that has to be stored. The goal of future research is to find optimality guarantees for the hori- zontal leap search strategy for pattern mining proposed in [176], and to speed up CORK by employing this search strategy while maintaining its attractive theoretical properties. Another exciting question is whether our results on the optimality of supervised feature selection can be transferred to techniques for unsupervised feature selection on frequent subgraphs [21] (S. Nijssen, personal communication (2008, 2009)).
Finally, with regard to the overall scope of this thesis, we would like to explore imaging applications of the graph theoretic insights gained in this work.
Chapter 3
Similarity Estimation using
Bayes Ensembles
Similarity search and data mining often rely on distance or similarity functions. Queries using these functions should detect instances which are considered to be similar on an intuitive level. Mostly, the underlying object representations, e.g. image features or laboratory measurements, do not reflect this intuition when being queried with standard distance measures like Lp norms. This
problem is also called the semantic gap. To bridge this gap between feature representation and object similarity, the distance function has to be adjusted to the current application context or the current user.
In [54], we have therefore proposed a probabilistic framework for estima- ting a similarity value based on a Bayesian setting. Our framework provides a train-able distance function for real-valued feature vectors. This function consists in an ensemble of weak Bayesian learners, each corresponding to a di- mension of an implicit feature space. In order to find this implicit feature space with independent dimensions of maximum meaning for the current context, we apply a space transformation based on eigenvalue decomposition.
In our experiments, we demonstrate that our new method shows promising results compared to related Mahalanobis learners on several test datasets w.r.t. nearest-neighbor classification and precision-recall-graphs.
3.1
Introduction
Learning similarity functions is an important task for image retrieval and data mining in general. In data mining, distance measures can be used in various algorithms for classification and clustering. In order to improve classification, learned distance measures can be plugged into any instance-based learner like a k-NN (k-nearest neighbor) classifiers. Though clustering is basically an un-
46 3.1. Introduction
supervised problem, learning a similarity function on a small set of manually annotated objects is often sufficent to guide clustering algorithms into grouping semantically similar objects.
Adaptive similarity measures provide a powerful tool to bridge the semantic gap between object representations and user expectations. In most settings, the similarity between two objects cannot be described by a standardized dis- tance measure fitting all applications. Instead, it is often a matter of appli- cation context and personal preference. Thus, two objects might be similar in one context while they are not in another. For example, assume an image collection of various general images of persons, vehicles, animals, and build- ings. In this context, a picture showing a red Ferrari will be considered as quite similar to a picture of a red Volkswagen. Now, take the same images and put them into a different context like a catalogue of rental cars. In this more specialized context, both pictures will most likely be considered as dissimilar. An important assumption in this work is that there is no exact value specifying object similarity. Instead, we consider object similarity as the probability that a user would label the objects as similar.
Learning a distance or similarity function requires a general framework for comparing objects. In most established approaches to similarity learning, this framework is provided by using Mahalanobis distances or quadratic forms. In general, a Mahalanobis distance can be considered to be the Euclidean distance in a linear transformation of the original feature space. Thus, Mahalanobis dis- tances are metric distance functions guaranteeing reflexivity, symmetry and the triangular inequality. Furthermore, the computed dissimilarity of two objects might be increased infinitely. We are going to argue that these mathematical characteristics are unnecessarily strict and sometimes even against intuition when trying to construct a similarity measure.
As an example, it is known from cognition science that humans do not distinguish dissimilar objects to an infinite degree. A human would not care whether objecto1 ismore dissimilarto the query object qthan objecto2 after
having decided that both objectso1, o2have nothing in common with the query
objectq. On the other hand, in most feature transformations, it is possible that two different objects are mapped to the same feature representation. Thus, even if we can guarantee that two objects having a zero distance are represented by the same feature description, we have no guarantee that the corresponding objects should be considered to be maximally similar as well.
Hence, inspired by an approach of [97], we describe similarity in a different way by considering it as the probability that an object o is relevant for a similarity query objectq. The core idea of our similarity estimation approach is to consider each feature as evidence for similarity or dissimilarity. Thus, we can express the implication of a certain feature dimension i to the similarity
of objects o andq as a probability p(similar(o, q) |(o[i]−q[i])). To calculate
this probability, we employ a simple one-dimensional Bayes estimate (BE). However, in order to build a statement comprising all available informa- tion about object similarity, we do not build the joint probability over all features. We argue that in most applications, considering a single feature is not sufficient to decide either similarity or dissimilarity. Thus, to derive a joined estimation considering all available features, we average the probabili- ties derived from each BE. Our new estimate is basically an ensemble of weak Bayesian learners. Therefore, we call our new dissimilarity function Bayes En- semble Distance (BED). A major benefit of BED is that dissimilarity is very insensitive to outlier values in a single dimension which is a drawback of clas- sicalLp-norm based measures. The major factors to successfully employing an
ensemble of learners are the quality and the independence of the underlying weak classifiers. Therefore, we will introduce a new optimization problem that derives a linear transformation of the feature space, allowing the construction of more descriptive BEs. To conclude, the following sections will provide:
1. A discussion about Lp-norms and Mahalanobis distances for modelling
object similarity.
2. A new framework for similarity estimation that is built on an ensemble of Bayes learners.
3. An optimization method for generating a linear transformation of the feature space that is aimed at deriving independent features which are suitable for training high quality weak classifiers.
The rest of this chapter is organized as follows. In Section 3.2, we discussLp
norms and Mahalanobis distances for modeling object similarity. Our frame- work for modeling object similarity is described in Section 3.3. In Section 3.4, we introduce an optimization problem to derive an affine transformation that allows the training of more accurate Bayes estimates. Section 3.5 briefly re- views related similarity learners. Afterwards, Section 3.6 illustrates the results of our experimental evaluation comparing our new method with related metric learners on several UCI classification datasets and two image retrieval datasets. Finally, we will close Section 3.7 with a summary and some directions for future work.