II. REGLAMENTO Y OBLIGACIONES
5. Requisitos de Salud
In many settings, frequent subgraph mining is followed by a feature-selection step. This is to ease subsequent processes such as graph classification [CYH10a] and to identify the most significant features. The different proposals use various objective functions for feature selection. Besides others, Cheng et al. [CYH10b] have identi- fied this two-step approach of mining and selecting to be the computational bottle- neck in many graph-mining applications: On the one side, generating large numbers of frequent subgraphs to choose from is expensive and in certain applications even infeasible. On the other side, the selection process can be expensive as well.
A number of studies investigate scalable subgraph-mining algorithms [CHS+08, RS09, SKT08, SNK+09, TCG+10, YCHY08]. They deal with the direct mining of subgraphs satisfying an objective function, instead of following the two-step ap- proach. In other words, the subgraph sets mined might be incomplete with regard to the frequency criterion, but contain all (or most) graphs with regard to some other objective function. One can consider these functions to be constraints, as they nar- row down the mining results. However, they do not necessarily fall into any of the constraint classes introduced in Section 2.3.3. Objective functions are either based on their ability to discriminate between classes or numerical values associated with the graphs [SKT08, SNK+09, TCG+10], on some other measure of significance [CHS+08, RS09] or leave this choice to the user by allowing for interchangeable mea- sures [YCHY08]. In the following, we look at the approaches mentioned in a little more detail.
Boosting-Based Approaches
The approach from Saigo et al. [SNK+09], gBoost, builds on a boosting technique with decision-stump classifiers. In each iteration, they search for the most promising classifier, consisting of a single discriminative subgraph. These promising subgraphs are found by repeatedly calculating structural objective functions measuring the dis- criminativeness. They do so in a pattern-growth search space similar to the one from gSpan [YH02] (see Section 2.3.3). The authors use their discriminativeness measure to refine pruning bounds in the search space in each iteration.
Saigo et al. [SKT08] refine their approach in the gPLS algorithm. It makes use of the same boosting technique and pattern search space, but relies on partial least- squares regression (PLS) to prune the search space and to select the most promising subgraphs.
A Leap-Search-Based Approach
Yan et al. [YCHY08] present the LEAP algorithm. It allows for the integration of different kinds of objective functions that are not anti-monotone. The idea of the al- gorithm is not to prune the search space, but to leap in this space. This is in contrast to performing a (pruned) stringent depth-first search as done by algorithms such as gSpan [YH02] (see Section 2.3.3). Thereby it makes use of the observation that structurally similar subgraphs tend to have similar support values and statistical sig- nificance scores. Therefore, the authors rely on a strategy that mines with an expo- nentially decreasing minimum support threshold. This leads to a fast discovery of (near-)optimal subgraphs. In the evaluation, the authors use the G-test as well as information gain as objective functions. The G-test is a measure of statistical signif- icance, and the information gain measures the discriminativeness of a subgraph (see Definition 2.7). They successfully apply their technique to several datasets from the
chemical domain. Furthermore, Cheng et al. [CLZ+09] employ the LEAP algorithm for call-graph-based defect localisation (see Chapter 5).
Mining with Optimality Guarantees
Both boosting-based approaches [SNK+09, SKT08] as well as the LEAP algorithm [YCHY08] have proven to work well in the respective settings and evaluations. How- ever, they do not provide optimality guarantees. Thoma et al. [TCG+10] present an approach, CORK, which integrates an objective function into the pattern-growth- based frequent subgraph miner gSpan [YH02] (see Section 2.3.3). They use this function to greedily prune the search space. The distinctiveness of their approach is that the objective function has the submodularity property, and the authors show that such functions used for pruning ensure near-optimal results. This is, CORK provides the optimality guarantee that almost all discriminative subgraphs useful for classification are found.
A Partitioning-Based Approach
Ranu and Singh [RS09] investigate a setting that relies on significance (with respect to the statistical p-value measure) rather than on the ability to discriminate between classes. They observe that significant subgraphs might have any support value. In particular, significant subgraphs might have a support that is too low to be mined effi- ciently. This is as frequent-subgraph-mining algorithms roughly scale exponentially with decreasing minimum support values. Based on this observation, they develop the GraphSig technique which builds on two main steps: In the first step, they par- tition all graphs into sets such that all graphs in a set are likely to contain a common significant subgraph with a high support. They do so by using a technique similar to a sliding-window approach on the graphs, based on random walks. This generates a set of feature vectors for each graph. The authors then mine closed subfeature vectors which are significant and use them to group all graphs containing a subfeature vector into a group. In the second step, the authors make use of these groups of graphs. As these groups are relatively small, they apply a frequent-subgraph-mining technique on every set of graphs with a very small minimum support value. This procedure allows for finding significant subgraphs with a low support which cannot be discov- ered by traditional techniques due to scalability issues. In the evaluation, the authors demonstrate that their significant subgraphs are well-suited for graph-classification applications.
Mining Representative Subgraphs
Chaoji et al. [CHS+08] do not measure the significance of subgraphs nor their dis- criminativeness. They are concerned about finding subgraphs that are representative
for the complete set of frequent subgraphs (i.e., not similar to the graphs in the result set) with regard to the graph structure. To this end, the authors introduce parame- ter α∈ [0, 1]: Frequent subgraphs have to have a similarity to graphs in the result set below value α. Furthermore, they introduce parameter β∈ [0, 1]: For every frequent subgraph that is not part of the result set, there has to be at lest one subgraph in the result set having a similarity of at least value β. In the ORIGAMI algorithm, the authors measure the similarity between two graphs by calculating the relative size of their maximum common subgraph. For mining frequent subgraphs which comply with the restrictions defined by the two parameters α, β, the authors mine a set of subgraphs in a first step. Instead of enumerating the complete set of such graphs, they adopt a random-walk approach which enumerates a subset of diverse subgraphs. In a second step, they extract the result set complying with the parameters. They do so by mapping the problem to a maximum-clique problem which they again solve with a randomised algorithm.
Subsumption
Various researchers have studied scalable mining of subgraph patterns, with much success. However, they have not taken weights into account. In this dissertation, in particular in Chapter 8, we use measures building on edge weights as objective functions, to decide which graphs are significant. The usage of weights allows for a more detailed analysis as compared to the graph structure only. Like the previous approaches, ours does not necessarily produce graph sets which are complete with regard to frequency or some other hard constraint.