• No se han encontrado resultados

190 Para solicitar la visa se tienen que presentar todos los solicitantes juntos, ya que la petición es para una visa

In order to evaluate the proposed matchers, we use the matching results obtained from running the matchers. The results are then compared to a set of reference alignments. In this study, we use a set of predefined alignments that have been manually prepared by domain experts. For each alignment, we determine the true positive, false positive and false negative matches. Accordingly, the values of precision, recall, and f -measure were computed according to Equations 6.6, 6.7, and 6.8 respectively. The outcome of these equations is between zero and one, the higher the value the better the result.

Precision = |{ReferenceAlignments} ∩ {MatchResults}|

|{MatchResults}| (6.6)

Precision and recall are widely accepted measures for evaluating the results in infor- mation retrieval. Precision measures correctness; how many of the matches discovered by

the matcher are correct in comparison to the total number of matches retrieved. A higher precision means less wrong matches.

Recall = |{ReferenceAlignments} ∩ {MatchResults}|

|{ReferenceAlignments}| (6.7)

On the other hand, recall measures completeness; how many correct matches are re- trieved in comparison to the total number of correct matches in the reference alignment. A higher recall means more matches have been found.

f −measure = 2 · precision · recall

precision + recall (6.8)

In practice, maximizing precision could lead to a fewer retrieved matches (lower recall), while maximizing recall may result in a poor precision. To have a better perception of the quality of the different matchers, the values of both precision and recall should be combined. One way to combine both of them is through the f -measure, which is the harmonic mean of precision and recall. The f -measure provides a global measure of the matching quality. However, the f -measure gives equivalent weights to both precision and recall. Weighing precision and recall of a tool the same is not correct. As described by Berry [15], the human effort to find a true positive manually measured by recall is different from that to reject a false positive measure by precision. If the human effort to find a missing match is β times the efforts to vet a matching result, then recall should be weighted β times the precision in calculating the f -measure. The weighted f -measure (fβ) can be caculated using Equation6.9.

fβ = (1 + β2) ·

precision · recall

(β2· precision) + recall (6.9)

In Liberate, the effort of finding a missing match is approximately 10 times the effort to vet a match. Accordingly, in the evaluation, we use the weighted f -measure equation fβ with β = 10.

Experiment1: Web vs. Linguistic Matchers

Figure 6.8 shows the results of applying the WSM, WN and NS matchers to public and private cloud schemas respectively.

(A) Public Schemas (Azure vs. GAE)

(B) Private Schemas (StratusML vs. TOSCA) Figure 6.8: Experiment 1 - Results

Figure6.8Ashows that the WSM approach tends to provide higher precision and recall. The value of f10-measure is also higher in comparison to NS and WN matchers for public clouds. In particular, as the evidence threshold increases the precision increases. The precision of the NS matcher starts higher than the WSM. It increases until the evidence threshold reaches 20% then it hits zero at 30% evidence threshold. This sudden change can be reasoned by reviewing the recall figure, which shows very poor recall at 20% that drops to zero at 30% evidence. The relationship between the number of matches found and the evidence threshold is inversely proportional. This is expected, because when the evidence is low the matcher retrieves more results; both correct and incorrect. The reason WSM has the highest recall is due to the fact that it is a statistical measure. Given the large number of pages on the Internet, in most cases, the value of the WSM is greater than zero, which indicates some statistical relatedness. However, this is not the case for linguistic based similarities, where the similarity can easily hit zero.

Based on the f10-measure results, the NS matcher is the worse performer when applied to public cloud schemas, while the WSM is the best performer. The f10-measure results in Figure6.8A shows a proportional relationship between the WSM results and evidence threshold up to 0.8 evidence. At this point, the f10-measure value drops due to the sharp drop in the recall (i.e., very few values retrieved, some of them are correct mappings).

Figure 6.8B shows the results of repeating the same experiment for private cloud schemas. This time, the WSM scored the worse in comparison to both linguistic matchers (i.e., NS and WN). The reason might be due to the limited content online that refers to the concepts in these schemas. The WSM produced very poor matching precision and high recall that sharply fall down to zero at 30% evidence threshold. On the other hand, the behavior for both the NS and WN seems to be consistent for both public and private cloud schemas, in both cases, they had poor performance. This is expected, as both matchers do not incorporate domain knowledge in the matching process, and the concepts in both public and private cloud schemas do not share a linguistic similarity. The f10-measure con- firms that none of the matchers are suitable for private clouds with linguistic approaches being more favorable.

Experiment2: Domain Knowledge Matchers

Figure 6.9 shows the results of evaluating the performance of the WSM, GBM and DKM matchers as they are applied to public and private cloud schemas respectively.

Figure6.9Ashows that the WSM is still the dominant matcher for public cloud schemas. WSM performs better than both GBM and DKM. In particular, the GBM shows poor precision, moderate recall, and relatively lower f10-measure when the evidence threshold is below 0.6. Nevertheless, the matching outcome of the GBM still outperforms the results obtained by the pure linguistic matchers used in the first experiment.

As expected the results obtained by the DKM is between those obtained by the GBM and WSM. This is because the DKM combines the results of both matchers. However, since more weight was given to the GBM, the DKM results are closer to the GBM results. Figure6.9Bshows the results of applying the same matchers on private cloud schemas. As in the first experiment, the WSM fails as a matcher for private cloud schemas. On the other hand, the GBM shows promising results when applied to private cloud schemas. The precision of the GBM increases as the evidence threshold increases. The precision value reaches its peak at 0.6 threshold. The value then dropped to zero as the total number of retrieved matches (recall) dropped to zero at 0.7 evidence threshold. The f10-measure score shows that the GBM outperforms both the DKM and WSM. On average the GBM results

(A) Public Schemas (Azure vs. GAE)

(B) Private Schemas (StratusML vs. TOSCA) Figure 6.9: Experiment 2 - Results

were 8% higher than DKM and 70% higher than the WSM when applied to TOSCA and StatusML schemas.

Experiment3: Structural Matcher

The third experiment aims at examining the effect of structural similarity. Figure 6.10 shows the results of applying SF++ to public and private cloud schemas. The results are compared with the DKM results obtained from the second experiment.

The observations from Figure 6.10A, and 6.10B are that:

(i) The public cloud schemas used in this experiment are structurally dissimilar. This is clear from the zero values of precision, recall and f10-measure.

(A) Public Schemas (Azure vs. GAE)

(B) Private Schemas (StratusML vs. TOSCA) Figure 6.10: Experiment 3 - Results

(ii) The private cloud schemas used in this experiment expose some kind of structural similarity, but over all they still have significant structural differences. As shown in the precision results in Figure6.10B, the SF++ improved the precision of the DKM, particularly at a higher evidence threshold (e.g., 0.6).

(iii) The SF++ seems to have a negative effect on recall. While this was not that obvious for private clouds, it was the main factor that brought the precision and f10-measure values to zero in public clouds.

The SF++ can have negative impact on the initial matching results when the schemas do not expose structural similarity. This is due to the effect of the fix point computation, which affects the similarity magnitude of the initial matches. Degrading the initial simi- larity magnitude affects the number of true positive values, which is the reason behind the poor recall values.

Discussion

In this subsection we revisit our questions based on the experimental results.

Q1. How will the proposed Web Similarity Matcher perform when applied to public and private schemas?

The first experiment shows that the WSM has higher precision recall and f10-measure values than the pure linguistic approaches, when applied to public cloud schemas. Unfortunately, all the matchers used in the first experiment, including the WSM, preformed badly when applied to private cloud schemas.

Q2. How will the proposed Domain Knowledge Matcher perform in comparison to the WSM approach?

The results of the second experiment show that the WSM workes better to detect and retrieve matches of the public cloud schemas, while GBM works better for private clouds. To combine the benefits of both matchers, the DKM measure can be used in both cases and produces decent results. The DKM is more favorable in scenar- ios where there is limited knowledge about the schemas. For example in scenarios where we are not sure how much online content is available, or if we have incomplete glossaries.

Q3. How will the structural based matching affect the results?

As shown in the results of the third experiment , for the selected schemas the SF++ did not add value to the matching process. However, using the SF++ can reveal important knowledge about how much two schemas are structurally deviated, which can serve as an indicator on the manual efforts required for schema alignment, and transformation template generation. Moreover, the SF++ can be used in cases of schema evolution (i.e., to discover mismatches between different versions of the same schema).

It is worth to be noted that, although the conclusions were the same; the results reported in this thesis, which were obtained using the weighted f10-measure, were stronger than those obtained using f1-measure. This confirms Berry’s [15] observation.