CAPITULO IV: DOULAB: LA RUEDA
4.3 Resultados de la obra, comentarios y retroalimentación
SemRep uses a form of breadth-first search to find paths from a to b. This approach is motivated by the fact that short paths are generally more relevant and more likely to be correct than longer paths. Besides, shorter paths can be calculated much faster than longer paths, as the number of paths increases exponentially with the path length.
The repository uses a constraint l that defines the maximal path length calculated by SemRep. It holds l ∈ {1, 2, 3, 4}, i.e., SemRep can calculate paths up to a length of 4. Such a constraint is necessary to keep the execution time low, since calculating all possible paths from a specific node would require too much time, given the huge number of nodes and edges the repository consists of.
If valid paths from a to b are found within a distance of d, there seems to be no reason to calculate further paths of length d + 1, even if it holds (d + 1) ≤ l. This default con-figuration is called First Paths, so the breadth-first search stops as soon as paths of length dare found. Another configuration is All Paths, in which all paths up to a length of l are calculated. The two configurations are referred to as termination mode, as they specify when the path search has to stop.
Let us assume that the configuration is l = 3 and the relation type between CPU and PC has to be determined, as shown in Fig. 9.5. Starting from the concept CPU, SemRep cal-culates all paths of length 1. Since the target node PC cannot be found, SemRep calcal-culates all paths of length 2. There are two results now:
p1= CPUpart-of Laptop is-a PC
p2= CPUpart-of Computer inverse is-a PC
If the configuration is First path, SemRep would stop at this point. Only if the configura-tion is All Paths, it would continue to calculate paths of length 3 and find the following
Figure 9.5: Sample query for (CPU, PC).
p3= CPUpart-of Laptop equal Notebook is-a PC
p4= CPUpart-of Computer inverse is-a Notebook is-a PC
SemRep would stop at this point, because of the restriction l = 3. If l was 4, SemRep would go on to calculated paths of length 4.
Implementation
The classic breadth-first search, as applied in the above example, would start at node A and calculates all paths of length 1, 2, ..., n until it finds the target node B. Such an imple-mentation is not optimal, though, because of the generally high node degrees. Common concepts found in mappings have a node degree G of about 100, although more general concepts can have some thousands of outgoing edges and less general concepts can have just a few outgoing edges. Let us assume that all paths up to length 4 have to be cal-culated. Theoretically, the number of paths is Gp = 1004 (100 million). In practice, this number is much lower, because cyclic paths are ignored and longer paths can quickly reach more specific areas of the repository where the average node degree is lower. Still, the number of paths of length 4 can be several millions, depending on the input concept, which results in long, intolerable execution times.
Therefore, the first breadth search was extended to a bidirectional search. Instead of calculating all paths of length p from A to B, SemRep calculates all paths of length p2 from A and from B, resulting in two sets of paths PA, PB. Subsequently, for each p ∈ PA
and p0 ∈ PBit is checked whether the target node of p is the target node of p0, which we call connector node. If this is the case, the two paths p, p0express a path from A to B.
An example of this approach is illustrated in Fig. 9.6. Let us assume that a path from node Ato E is searched and that it holds l = 4. Thus, A is the start node and E the target node.
Starting from both nodes, paths of length 2 are calculated, resulting in p = A − B − C and p0 = E − D − C as illustrated in sketch a). Node C is the connector node that allows to combine these two paths to the final path pf inal= A − B − C − D − E. In order to obtain a correct path object, all relation types in Path p0 have to be inverted, which is indicated by the double-arrows in sketch b).
Using this techniques, all possible paths from a to b can be determined with only 2 × Gp/2 calculations, which would reduce the number of comparisons from some millions
Figure 9.6: Path combination after bidirectional breadth rst search.
to 2×1002 = 20, 000. However, since the connector nodes have to be determined, for each p ∈ PAand each p0∈ PBit has to be checked whether (p, p0)share such a node. Assuming that is holds |PA|, |PB| = 10, 000, this step requires 10, 0002 (100 million) comparisons, which is equivalent to the theoretic number of comparisons if the original breadth-first search was used. In this case, the bidirectional would not reduce any effort.
To circumvent this issue, all target nodes of p ∈ PAare stored in a hash set H. To find the connector node, all target nodes of p0 ∈ PB are iterated and it is only checked whether Hcontains b. If this is the case, the respective path object is retrieved and the full path is built. In the above example, this means that 10, 000 contains-operations have to be carried out.
The contains-operators of hash sets is extremely fast. Experiments showed that the op-erator requires only fractions of milliseconds to determine whether H contains a specific node concept, even if H contains some 10, 000s of concepts. As a consequence, this adap-tation makes the bidirectional search much faster than the original approach.
In detail, the algorithm to find paths up to length 4 works as follows:
1. All direct paths of concept A are calculated and stored in a set PA. The target concepts of each path are stored in a hash set H(A). If it holds B ∈ H(A), a path of length 1 was found.
2. All direct paths of node B are calculated and stored in PB. The target concepts of each path are stored in a hash set H(B). If there is a concept node C for which holds C ∈ H(A), C ∈ H(B), there is a path of length 2.
3. All outgoing paths of length 2 are calculated for A and added to PA. The target concepts of each path is stored in a hash set H0(A). If there is a concept C for which holds C ∈ H0(A), C ∈ H(B), there is a path of length 3.
4. Finally, H0(B)is calculated. If there is a concept C for which holds C ∈ H0(B), C ∈ H0(A), there is a path of length 4.
Figure 9.7: Overview of the dierent path types and the used terminology.