TRABAJO A NÁLISIS POLÍTICO , MEDIÁTICO Y DE LA REALIDAD
2. Los objetivos específicos del proyecto de investigación
We now discuss the significance of our results in a broader context and outline some di- rections for future work. For a more detailed overview of the technical contributions, we refer the reader to Section 1.2.
7.1. Discussion
In this thesis we have proposed a system that allows to apply distance-based learning methods to arbitrary graph databases. This has been achieved by considering frequent subtrees as patterns and by relaxing the requirement on the completeness of the min- ing process and the embedding operator. In particular, we have defined probabilistic frequent subtrees and shown that they can be mined with polynomial delay in arbitrary graph databases. As a complementary contribution, we have shown how to quickly com- pute feature vectors for arbitrary graphs, given a set of tree patterns that span the feature space.
With these two steps, we have provided the required tools to apply probabilistic fre- quent subtrees in real life learning scenarios. Here first a suitable feature representation of an unknown graph distribution can be learned from a sample using the results from Chapter 4 and 5. Second, a model can be learned that is based on a suitable similarity measure on this feature representation. Finally, the model can be applied to new unseen graphs, by computing first a feature representation using the results from Chapter 6 and then feeding it to the model.
Our methods do not assume any structural or other restriction on the graph databases at hand. That is, the guarantee of polynomial delay holds for arbitrary graph databases and allows to mine frequent probabilistic subtrees also in such graph databases where state-of-the-art exact frequent subgraph and frequent subtree mining algorithms fail to produce any output in practically feasible time. Most of these algorithms were devel- oped for chemical applications and stop working for even slightly more complex graph databases. Our method hence allows to compute frequent subtrees for graph databases where these patterns could not be computed previously. Furthermore, the efficient com- putation of feature representations for arbitrary graphs presented in this thesis allows not only to inspect such patterns qualitatively, but to use them in real-world machine learning applications.
Recall that the FCSM and FTM problems are computationally intractable. Hence any practical algorithm has to trade-off among speed, correctness, and general applicability. We have decided to maintain the general applicability and speed (in the sense of com- putational complexity) by giving up the correctness of the algorithm. Interestingly, as a
byproduct of our methods, we obtain a positive result on the complexity of exact frequent subtree mining for locally easy graphs. That is, we propose a result that maintains the correctness and speed properties, but gives up the general applicability.
Locally easy graphs restrict the number of spanning trees in certain subgraphs of a graph without assuming any global structure. Its definition restricts only the block de- gree of the transaction graphs, and allows an arbitrary number of bridges to be incident to any vertex in the pattern and the transaction graph. Hence, we obtain the first positive result on the SubtreeIsomorphism problem that we are aware of, which allows un- bounded vertex degree of the pattern tree for transaction graphs beyond forests. The ver- tex degree of the pattern is an important parameter of the complexity of the Subgraph- Isomorphism problem (cf.Marx and Pilipczuk,2014). With this result, we conjecture to be very close to the border between tractable and intractable restrictions of the FTM as well as the SubtreeIsomorphism problem.
7.2. Outlook
We have proposed two embedding operators, one which samples global spanning trees and the boosted algorithm which samples local spanning trees. Both can guarantee poly- nomial delay mining of frequent tree patterns, but differ in their runtimes and recall be- haviors. It remains an open problem to decide which of the two algorithms should be cho- sen for a given graph database, or possibly even individually for each graph in a database. This idea can be extended even further: While we have shown that our methods are supe- rior to exact frequent subgraph miners on complex graph databases, Gaston and the like are superior on chemical graphs (and probably on some other simple graph databases as well). It is possible to combine the potentially inefficient embedding computation with our method to get the best of both worlds: Introducing a new parameter, we allow a min- ing algorithm to store at most a certain number of embedding lists per graph. If a can- didate pattern results in too many embeddings for a given graph, we discard them and switch to our probabilistic embedding operator for this graph. As the number of embed- dings of a pattern is polynomial in the number of patterns of its predecessor for any given graph, this can be implemented efficiently. This adaptive algorithm could preserve the speed of Gaston on chemical graph databases, respectively that of our algorithm on other graph databases (with a small overhead). As this is mainly an engineering problem, we leave it for future work.
Another line of investigation would be to adapt the mining algorithm to the pattern class at hand. We have opted to consider only tree patterns in this thesis. Other classes of patterns might, however, allow faster algorithms: Using depth-first search, there is an immediate O (∣V (H)∣ ⋅ ∣V (G)∣) time algorithm to decide whether a pattern path H is subgraph isomorphic to a forest G, improving on the runtime of the algorithm for tree patterns. This implies that probabilistic frequent subpaths can be found faster than prob- abilistic frequent subtrees. Regarding an extension of our work in the other direction, there might be other simpler pattern classes, for which probabilistic frequent pattern mining can be solved efficiently.
7.2. Outlook
Other directions for future work of course include novel application areas of frequent subtree mining in nontrivial graph classes that were previously inaccessible to frequent subgraph mining. For example, it might be possible to infer new nontrivial connections between input and output variables of a learning problem by finding frequent patterns in several neural networks that were trained independently for the same learning problems. For example, multiple echo state networks (Jaeger,2002) can be trained easily using the same training data. Echo state networks have a fixed set of input and output vertices and a random hidden layer, only the weights of the edges containing vertices from the out- put layer are trained. Frequent tree patterns that contain output as well as input vertices might indicate relevant connections between the input and output variables. To this end, the algorithms presented in this thesis most likely need to be adapted to be able to process continuous edge labels, instead of only discrete edge labels.
Finally, the computational complexity of other restricted FCSM and FTM problems should be explored. While the complexities of the various pattern mining problems are closely related to the complexities of the corresponding embedding operators, there re- mains an interesting gap in our knowledge: For graph classes where the Hamilton- ianPath problem can be solved in polynomial time, but the SubgraphIsomorphism (resp. SubtreeIsomorphism) problem is NP-complete it is not clear whether a partic- ular corresponding mining problem can be solved with polynomial delay, in incremental polynomial time, or if it cannot be solved in output polynomial time, unless P = NP. It would be interesting to see whether there are additional parameters (apart from the com- plexity of the HamiltonianPath and the complexity of the embedding operator) that influence the computational complexity of the pattern mining. Further results, both neg- ative and positive would be important for a deeper understanding of the computational difficulties of frequent pattern mining.