Because of the many high-throughput techniques that can easily generate biological data, there is a need for efficient techniques that can extract useful knowledge from biological data. Relational learning and data mining techniques provide the strong expressiveness that is necessary to deal with biological tasks involving structured data, but often suffer from efficiency issues. The overall goal of this thesis is to improve the efficiency of these methods, as well as their applicability to real-life applications from biology and chemistry.
We will try to achieve this goal at two different levels. First, we will represent the biological data with graphs, as these have proved a promising alternative to more complex data structures such as logic programs. Molecules are a perfect example as their atom-bond structure matches the structure of a graph naturally. But also other biological data can be easily represented by graphs, such as the sec- ondary structure of mRNA or networks of gene interactions. Moreover, as graphs and their properties have been thoroughly studied in the field of mathematics, much of this research can be used to improve learning algorithms. If certain prop- erties in the graphs that represent biological data can be exploited, more efficient algorithms can be developed. For example, graphs that represent molecules are known to have a bounded degree or are planar.
Second, we will align the learning methods better with the biological task under consideration. For example in gene function prediction, multiple classes need to be predicted, and these classes have a relationship between each other, so it makes sense to develop learning methods that take into account these relationships.
The main contributions of this thesis can be summarised as follows. In the first part of the thesis, we focus on the application of functional genomics. Here, the task is to predict the function of unknown genes found in the genome of an organism. It is known that a gene may have multiple functions. Moreover, biologists have organised these functions into hierarchies. This setting is known in machine learning as hierarchical multi-label classification (HMC). It is a variant of classification where instances may belong to multiple classes at the same time and where these classes are organised in a hierarchy. In this context, we focus on decision trees because of their interpretability, since we believe it is important for biologists to find out why certain functions are predicted.
1.2 Motivation and contributions 9 • The first contribution is the introduction of three different learning ap- proaches for decision trees in the context of HMC, as well as an empirical study of their use in functional genomics. We compare learning a single HMC tree (which makes predictions for all classes together) to two approaches that learn a set of regular classification trees (one for each class). The first ap- proach defines an independent single-label classification task for each class (SC). Obviously, the hierarchy introduces dependencies between the classes. While they are ignored by the first approach, they are exploited by the second approach, named hierarchical single-label classification (HSC). We compare the three approaches on 24 yeast datasets using as classification schemes MIPS’s FunCat (tree structure) and the Gene Ontology (DAG structure). We show that HMC trees outperform HSC and SC trees along three dimen- sions: predictive performance, model size, and induction time.
• The second contribution is an extensive comparison of the proposed HMC decision tree method to state-of-the-art methods for functional genomics. We show that our HMC trees obtain clearly better predictive performance than the trees found by previously proposed decision tree methods. Moreover, we also introduce ensembles of HMC trees, which obtain better predictive performance than single trees and are competitive with statistical learning and functional linkage methods. Moreover, the ensemble method is compu- tationally efficient and easy to use.
In the second part of the thesis, we focus on the application of learning structure- activity relationships (SAR), where the task is to predict properties of molecules based on their atom-bond structure. Since we want to preserve the structural information of the molecules, we will represent them as graphs.
• The third contribution involves the introduction of a polynomial algorithm to compute a maximum common subgraph (MCS) of two outerplanar graphs. By using the BBP subgraph isomorphism, which is especially designed for use in the SAR context, we show that this algorithm is significantly faster than a state-of-the-art MCS algorithm using the general subgraph isomorphism. • The fourth contribution concerns a metric for structured data that is
based on the proposed MCS algorithm. We evaluate the metric as a similarity measure for molecules and we show that it obtains state-of-the-art results for several SAR tasks. More generally, we show that using the BBP subgraph isomorphism as matching operator improves the predictive performance of graph mining methods.
• The fifth contribution is the application of the MCS algorithm to feature generation. Rather than mining for all optimal local patterns, we sample features from the set of pairwise maximum common subgraphs. We show
that using simple sampling strategies we obtain significant gains in speed while at the same time improving the quality of the extracted features. We observe a significant increase in predictive performance when using maximum common subgraph features instead of frequent or correlated local patterns on 60 benchmark datasets from NCI. Moreover, with a much smaller set of features, it is possible to reach the same predictive performance as methods that exhaustively enumerate all possible patterns.