III. ASIGNACIONES
1. Manutención y Alojamiento
Weighted graphs are ubiquitous in the real world. For instance, think of transportation networks, where numerical weights attached to edges might stand for the load, the average speed, the time, the distance etc. As well software call graphs as investigated
in this dissertation for defect localisation can be attached with weights: Edge weights might represent call frequencies or abstractions of the dataflow. However, we are only aware of a few studies analysing weighted graphs with frequent subgraph mining. Most studies focus on the specific analysis problem, rather than proposing general weighted-subgraph-mining techniques. In the following, we review some work based on discretisation, and we discuss approaches building on the concept of weighted support.
Discretisation-Based Approaches
Logistic Networks. Jiang et al. [JVB+05] investigate frequent subgraph mining in logistic networks where edges represent single transports and are annotated with several weights such as distance between two nodes and the weight of the load. With each weight, a different weighted graph can be constructed. In order to derive labels which are suited for graph mining from the edge weights, the authors use a binning strategy. Each weight is partitioned into ranges of the same size, giving a few (7 to 10) distinct labels. The binning strategy for discretisation may curb result accuracy, for two reasons: (1) The particular scheme does not take the distribution of values into account. Thus, close values may be assigned to different bins. (2) The discreti- sation leads to a number of ordered (ordinal) intervals, but the authors treat them as unordered categorical values. For example, the information that ‘medium’ is between ‘low’ and ‘high’ is lost.
Image Analysis. Nowozin et al. [NTU+07] do discretisation as well before it comes to frequent subgraph mining. They study image-analysis problems, and im- ages are represented as weighted graphs. The authors represent each point of interest by one vertex and connect all vertices. They assign each edge a vector consisting of image-analysis-specific measures. Then they discretise the weights, but with a method more sophisticated than binning. The weight vectors are clustered, resulting in categorical labels of edges with similar weight vectors. However, the risk of los- ing potentially important information by discretisation is not eliminated: (1) It might still happen that close points in an n-dimensional space fall into different clusters. (2) Even when value distributions are considered, the authors do so in the context of the original graphs. When frequent subgraph mining is applied afterwards, the distri- butions within the different subgraphs can be very different, and other discretisations could be more appropriate.
Subsumption. In this dissertation, we deal with software call graphs that are weighted. For the analysis of these graphs, we propose two kinds of approaches that are different from discretisation: In Chapters 5–7 we investigate a postprocessing ap- proach, in Chapter 8 a constraint-based mining approach. Both proposals avoid the
shortcomings of discretisation mentioned. They analyse numerical weights instead of discrete intervals.
Weighted-Frequent Subgraph Mining
The Approaches by Jiang et al. Jiang et al. [JCSZ10] deal with a text-classi- fication task, formulated as a weighted-frequent-subgraph-mining problem. This is based on the concept of weighted support formulated by the authors. This concept builds on the assumption that certain edges within a graph are considered to be more significant than others, and that the significance is reflected in the edge-weight values (i.e., a significant edge displays a high value)1. Concretely, the authors calculate the
weighted support wsup of a subgraph g as follows: wsup(g) ∶= sup(g) ⋅ ∑ e∈E(g)
w(e)
This is, the weighted support of a certain subgraph is high when it has a high support and contains edges having high weight values. Correspondingly, weighted- frequent-subgraph mining as defined by the authors discovers subgraphs satisfying a certain user-defined minimum-weighted-support threshold. However, the mini- mum weighted support criterion is not anti-monotone and can therefore not be used to prune the search space in pattern-growth-based frequent-subgraph-mining algo- rithms. The authors therefore make use of an alternative but weaker concept to prune the search space and implement their technique as an extension of gSpan [YH02] (see Section 2.3.3). In [JCZ10], the authors present variations of the approach, in- cluding two further weight-based criteria that are anti-monotone.
Using their approaches, the authors achieve well results not only in the text-classifi- cation application [JCSZ10], but also applied to (medical) image-analysis problems [ECJ+10, JCSZ08] and certain problems from logistics [JCZ10].
The Approaches from Shinoda et al. Shinoda et al. [SOO09] present an ap- proach similar to the ones from Jiang et al. [JCSZ10, JCZ10]. They consider graphs with weighted nodes and edges (referred to as internal weights), and their graphs themself are assigned with a weight as well (referred to as external weights). They define the internal weighted support wsupintsimilar to Jiang et al., but they consider
the total internal weight of the graph database D (in the denominator):
wsupint(g) ∶= sum of all internal weights of g in all graphs d∈ D where g ⊆ d sum of all internal weights of all graphs in D
If there are several embeddings of g∈ d, the one with the maximum weight is chosen.
1Jiang et al. consider only weighted edges, but claim that their concepts can be easily transferred to
The authors define the external weighted support wsupext similarly as follows:
wsupext(g) ∶= sum of the external weights of all graphs d∈ D where g ⊆ d sum of all external weights of all graphs in D
Finally, they define a general weighted support wsupgen, based on a user-defined parameter λ (0≤ λ ≤ 1):
wsupgen(g) ∶= λ ⋅ wsupext(g) + (1 − λ) ⋅ wsupint(g)
Based on a user-defined minimum general-weighted-support value and parame- ter λ, the authors define the general-weighted-subgraph-mining problem. Their so- lution to this problem is similar to the one of Jiang et al. [JCSZ10]: As the minimum general-weighted-support criterion is not anti-monotone, they rely on a weaker prun- ing criterion for mining with a pattern-growth-based subgraph-mining algorithm. The authors also propose a related problem, mining external weighted subgraphs under internal weight constraints, which is solved similarly within the same framework. In their experiments, Shinoda et al. [SOO09] achieve well results with synthetic data, communication graphs and chemical compound graphs.
Subsumption. While mining for weighted frequent subgraphs (or mining using the variations from Shinoda et al.) is adequate for certain applications, it relies on the assumption that high weight values identify significant components. This does not hold in every domain. For instance, in software-defect localisation, high (or low) edge-weight values are in general not related to defects. Therefore, weighted- frequent-subgraph mining cannot be used for every problem and offers less flexibility than constraints on arbitrary measures as investigated in Chapter 8 of this dissertation. Furthermore, to our knowledge, the weighted-frequent-subgraph-mining techniques presented in this section have never been evaluated systematically nor compared to alternative approaches.