A. Caracterización de los prestadores
IV. Conclusiones
FSG was initially proposed 2001 byKuramochi and Karypis(2004). It implements Al- gorithm 2.1 (i.e., it is a levelwise algorithm) for mining all frequent subgraphs in graph transaction databases. To compute the support of a candidate pattern, FSG stores the sup- port set of each frequent pattern and intersects the support sets of parent patterns to re- duce the number of explicit subgraph isomorphism tests to be evaluated for any candidate pattern: The downward closure property ensures that a candidate can only be subgraph isomorphic to those graphs where all of its subpatterns are present and hence only such graphs must be explicitly evaluated using the embedding operator. This method requires space that is proportional to support set of each frequent pattern in two consecutive lev- els of the pattern lattice (the current and the previous level). The authors do not disclose the implementation details or a reference for their embedding operator. They neither mention additional storage requirements for storing embeddings explicitly, which might indicate that they use an algorithm that does not require such knowledge. The authors evaluate FSG on chemical and artificial graph datasets. They do not describe the par- ticular generation of the artificial graphs. Their graph database generator, however, is used by several other authors to evaluate their approaches (e.g.Yan and Han,2002;Zhao and Yu,2008). There are graph databases where the performance of FSG drastically de- creases (compare Chapter 4). Notably, the algorithm was used byDeshpande et al(2005) to first show the impressive predictive performance of frequent subgraph based learners on chemical graph datasets.
3.2. Algorithms for the FCSM Problem
Borgelt et alpropose MoSS, a frequent subgraph miner specifically suited for chem- ical graph databases (Borgelt and Berthold,2002;Borgelt et al,2005). Their algorithm implements special domain knowledge (e.g., handling of aromatic bonds) and is a depth- first search over a pattern space that can be “seeded” with a chemically meaningful core pattern that will be contained in all frequent patterns to be found. The authors use em- bedding lists to compute the support count; their approach, however, suffers from a miss- ing graph canonicalization scheme. Hence patterns are enumerated multiple times (and their support is computed multiple times). Without giving the details, the authors claim that multiple output of equivalent patterns can be suppressed (which would require de- ciding the isomorphism problem for pairs of patterns). The authors show experiments in which they qualitatively analyze the patterns found using their approach on the NCI-HIV dataset.
gSpan byYan and Han(2002) mines frequent subgraphs using a depth-first traversal of the pattern space. To avoid multiple enumeration of the same candidate pattern (up to isomorphism), it applies an inclusion-exclusion principle on frequent edges. That is, a pattern is extended with an ever shrinking set of frequent edges. To compute the support of a candidate pattern, the algorithm recursively works on the support sets of the pat- terns being extended, resulting in a reduced number of calls to the embedding operator. Though the authors do not cite or mention it in the paper, the acknowledgments suggest that gSpan uses the subgraph isomorphism algorithm byCordella et al(1999). They show experiments on the datasets used byKuramochi and Karypis(2004) and show that their algorithm outperforms FSG. In (Yan and Han,2003) the authors extend their algorithm to mine closed frequent subgraphs.
Huan et al(2003) propose FFSM, an algorithm that also mines frequent subgraphs us- ing a depth-first traversal of the pattern space. They use a novel canonical representa- tion of arbitrary graphs that has size O (n2) for a graph on n vertices and propose ex- tension and join operators that generate all frequent patterns. However, these operators may generate patterns multiple times, not necessarily in canonical form. Without giv- ing details, the authors claim to be able to decide whether a representation is canonical, and hence that the algorithm is correct (i.e., each pattern is printed exactly once up to isomorphism). They use embedding lists to store all possible embeddings of the frequent patterns in canonical form and show how their extension and join operators can use the embedding lists to only output frequent patterns. The authors later extend their work to maximal frequent subtrees, resulting in the SPIN algorithm (Huan et al,2004).
Gaston (Nijssen and Kok,2004,2005) is the fastest frequent subgraph mining system on chemical graph databases (Wörlein et al,2005). Their algorithm mines frequent pat- terns in three stages: First, all frequent paths are generated. In the second stage, tree can- didates are grown from the frequent paths. Finally frequent cyclic graphs are grown from the frequent trees and frequent paths by adding edges between existing vertices. Hence Gaston can be seen as both a specialized frequent subtree mining algorithm and as a fre- quent subgraph mining algorithm: Without overhead, the generation of cyclic graphs can be avoided by stopping after the tree generation step. Candidate generation is based on an efficient canonical representation of graphs that is based on depth-first sequences; only extensions of patterns that are in canonical form are further expanded. This property
can be checked in constant time for trees and paths, yielding a very fast enumeration of candidate patterns; for cyclic graphs, however, this property is more difficult to check. Gaston traverses the pattern space in a nonstandard postorder: The support of all exten- sions of a frequent pattern in canonical form is evaluated before calling the search func- tion recursively for the first (frequent) extension. In this way, the number of allowed ex- tension operations can be restricted efficiently, yielding a smaller number of candidate extensions in subsequent steps. There are two variants of Gaston that differ in their sup- port counting subroutine. The first variant uses embedding lists, the second computes the subgraph isomorphisms “from scratch” for each candidate pattern. The authors are not very specific on the details of the latter. They describe it as a backtracking algorithm that has exponential worst-case running time in the size of the pattern and the transac- tion graphs involved. They evaluate their algorithm on an artificial tree dataset and on three large molecular datasets.
Horváth and Ramon(2010) propose an algorithm that mines all frequent connected subgraphs in transaction databases consisting of graphs of bounded tree-width. Impres- sively, their algorithm runs in incremental polynomial time, while the embedding oper- ator by itself is NP-complete (compare Section 3.1). That is, the SubgraphIsomorph- ism problem is NP-complete for transaction graphs with tree-width at most some con- stant k if the vertex degree of the pattern is not bounded by a constant, as well. This result is up to our knowledge the only existing result that describes an efficient algorithm for a problem in the upper left quadrant of Figure 2.2: The HamiltonianPath problem can be solved in polynomial time due to the result ofMatoušek and Thomas(1992), as paths have vertex degree at most two. Their algorithm identifies a polynomially sized subset of
non-redundant iso-quadruples that are stored for each frequent subgraph and each transac-
tion. Such iso-quadruples represent partial subgraph isomorphisms but – in comparison to explicitly storing all possible embeddings from the patterns to the transaction graphs – may represent multiple embeddings of the pattern that are in some sense equivalent. Their embedding operator extends ideas from (Hajiaghayi and Nishimura,2007) to the case that the vertex degree of the pattern is unbounded. Interestingly, the approach of
Horváth and Ramonrequires a breadth-first traversal of the pattern space to result in an efficient algorithm. They show that almost all (>99.9%) of the graphs in a large chemi- cal graph database have tree-width at most 3, and hence that their result is practically relevant but don’t give any empirical evaluation of their algorithm.Horváth et al(2013) extend these techniques to mine all frequent induced subgraphs in transaction databases consisting of bounded tree-width graphs with unbounded vertex degree in incremental polynomial time.