6. Diseño de la propuesta
6.5. Un viaje a la imaginación y la creatividad
6.5.1. Población a la que va dirigida
Several algorithms have been defined for finding frequent subgraphs and their embeddings in one large graph or in a data set of graphs. We distinguish algorithms for ‘graph data set mining’ that work on a data set of graphs, and algorithms for ‘large graph mining’ that discover frequent motifs in one large graph. While frequent subgraph algorithms that work on large graphs can directly be applied to data sets of graphs, the other direction is more complicated. However, by splitting a large graph into subgraphs, one can still use a graph data set mining algorithm for frequent subgraph discovery on the large graph (albeit some subgraphs might be lost by the splitting).
8.2.1
Graph Data Set Mining
For the graph data set mining task, approaches can be broadly divided into two classes, apriori-based approaches and pattern-growth based approaches. AGM (Apriori-based Graph Mining)[86] determines subgraphs G0 in a data set DS of graphs that occur in at least minsup percentage of all graphs in the data set. AGM works on graphs with edge and node labels. In principle, AGM uses the famous apriori [1] principle of iterated candidate generation and candidate evaluation. Candidate generation means that candidate sub- graphs are created by joining subgraphs that have been shown to be frequent in earlier iterations. In the candidate evaluation phase, these candidates are tested, i.e. it is checked whether their frequency is greater than minsup, and then the whole process is iterated until all frequent patterns have been found. AGM uses a canonical form and a normal form to represent subgraphs to reduce runtime cost for subgraph isomorphism checking.
Similar to AGM, FSG (Frequent SubGraph Discovery) [100] uses a canon- ical labeling based on the adjacency matrix. Canonical labeling, candidate generation and evaluation are sped up in FSG by using graph invariants and
8.2 Related work 171
the Transaction ID principle, which stores the ID of transactions a subgraph appeared in. This speed-up is paid for by reducing the class of subgraphs discovered to connected subgraphs, i.e. subgraphs where a path exists be- tween all pairs of nodes.
The most well-known member of the class of pattern-growth algorithms, gSpan (graph-based Substructure pattern mining), discovers frequent sub- structures efficiently without candidate generation [175]. Tree representa- tions of graphs are encoded using a Depth First Search (DFS) code, amongst which a minimum DFS code is chosen according to some lexicographic or- der. Pre-order DFS-tree search is then conducted to find the complete set of frequent subgraphs in a set of graphs. gSpan is efficient, both w.r.t. runtime and memory requirements, making it one of the best state-of-the-art algo- rithms for graph data set mining.
CloseGraph [176] extends gSpan by limiting the search to frequent complete graphs, i.e. subgraphs without supergraphs that have the same support, thereby increasing the efficiency of mining substantially.
8.2.2
Large Graph Mining
Unlike graph data set mining, large graph mining intends to find subgraphs that have minsup embeddings in one large graph.
Miloet al. [118] was the first to define network motifs and find them. By exhaustively enumerating all subgraphs this method is limited to subgraphs with three or four nodes.
SUBDUE [36] tries to minimize the minimum description length (MDL) of a graph by compressing frequent subgraphs. Frequent subgraphs are re- placed by one single node and the MDL of the remaining graph is then
172 8. Motif Discovery in Brain Networks
determined. Those subgraphs whose compression minimizes the MDL are considered as frequent patterns in the input graph. The candidate graphs are generated starting from single nodes to subgraphs with several nodes, using a computationally-constrained beam search.
GREW [102] and SUBDUE are greedy heuristic approaches to frequent graph mining that trade speed for completeness of the solution. GREW it- eratively joins frequent pairs of nodes into one super-node and determines disjoint embeddings of connected subgraphs by a maximal independent set algorithm. Similarly, vSIGRAM and hSIGRAM [101] find subgraphs that are frequently embedded within a large sparse graph, using “horizontal”(h) breadth-first search and “vertical” (v) depth-first search, respectively. They employ efficient algorithms for candidate generation and candidate evalua- tion that exploit the sparseness of the graph.
In [106], different sampling methods for a single large graph are compared. These algorithms sample subgraphs from a large network and estimate their frequency rather than resorting to heuristics.
Kashtan et al. [93] presented a sampling method for subgraph counting by picking a random edge from the network and then expanding the correspond- ing subgraph by successively picking neighboring random edges. In contrast to picking only one single edge, Zhou et al. [182] presented a method for frequent subgraph mining using the sampling method ’random areas selec- tion sampling’ that randomly selects one area of the large graph. FANMOD, the algorithm by [172] uses a randomized enumeration strategy for sampling subgraphs. If one does not want to resort to any heuristics and miss out on any embeddings of a frequent subgraph, one may employ this enumeration strategy for exhaustively enumerating all subgraphs rather than randomly sampling from this enumeration.