• No se han encontrado resultados

Sistemas de información en tiempo real y el teatro

CAPÍTULO III: EL TIEMPO Y SU PERCEPCIÓN DURANTE EL GOCE, EL TRANCE

3.4. Sistemas de información en tiempo real y el teatro

5.2.1 Overview

Fig. 5.2 depicts at a high level the general workflow processed by STROMA. As already illustrated, a match tool calculates a mapping between two given ontologies in step 1, which is the input for the mapping enrichment conducted by STROMA (step 2). The workflow of mapping enrichment consists of two phases: Relation type detection and se-lection (mapping repair). The two phases are carried out consecutively, i.e., the sese-lection phase only commences when the type detection phase is accomplished. In this thesis, the main focus is on the relation type detection, while mapping repair is a subordinate issue.

In the first step, type detection, STROMA iterates through all correspondences and each correspondence is passed to six different strategies that independently try to determine the relation type of the correspondence at hand. Each strategy returns either a specific relation type, likeis-a, or undecided if no type can be determined. By default, all strate-gies are enabled, though it is generally possible to disable stratestrate-gies (e.g., to only use background knowledge). Since the six strategies can return different types, the results have to be analyzed and a final relation type has to be determined, which is carried out by the Type Computation component (see Section 5.2.2). The final type is then verified for

Figure 5.2: Illustration of the two-step approach and the mapping enrichment workow.

plausibility, i.e., different techniques drawing on contextual information are used to cor-roborate the type or to reject it (see Section 6.7). The correspondence is then denoted by one specific relation type or byundecided.

The second phase is the selection phase, in which all correspondences in the mapping are once again iterated and checked for linguistic plausibility. If the relation type of a correspondence with an already low score could not be calculated, or if there is enough linguistic evidence that a correspondence is not true, it will be removed from the map-ping.

In both steps and in all phases, background knowledge like dictionaries, thesauri or domain-specific ontologies can be applied. While this is optional in Step 1, depending on the tool that is used for the initial mapping, STROMA practically exploits background knowledge in all phases and sub-steps it performs.

5.2.2 Type Computation

Let S be the set of strategies used by STROMA, and R the set of relation types STROMA is able to determine (in this subsection we treatundecided as a type, too). Each strategy s ∈ Sreturns exactly one relation type r ∈ R.

Internally, the result for a specific correspondence is represented as an S × T matrix, be-cause each strategy can return any of the 7 relation types. To determine the final relation type, it appears natural to use the type returned by the majority of strategies. Though STROMA is generally based on this notion, this approach is too simple, because the dif-ferent strategies differ slightly in their general reliability. For instance, Compound and Background Knowledge are quite reliable strategies that normally return satisfactory re-sults. The strategy Word Frequency is more heuristic and seems less reliable, though, and its results should have less impact on the overall result determination. Therefore, each strategy s has a specific weight w(s), which specifies how much reliability is as-signed to it. By default, STROMA uses the weights 1.0 for Compound and Itemization, 0.9for Background Knowledge, 0.8 for Multiple Linkage, 0.7 for Structure and 0.6 for Word Frequency. The result matrix would look as depicted in Table 5.1. Compound and Background Knowledge return bothis-a, and according to the strategy weights, a score of 1.0 resp. 0.9 is achieved for each strategy. The Structure Strategy returnsequal, but according to its weight, only a score of 0.7 is achieved for this type.

Eventually, the overall score for each type is the sum of the scores produced by each strategy. Forundecided, no specific score is calculated, because a type is assigned to a correspondence as soon as it has an overall score above 0 – this makes scores for the result type undecided unnecessary. In the given example, STROMA would eventually decide on the typeis-a, because it obtained the highest score.

Strategy equal is-a inv. is-a part-of has-a related undecided

Compound 1.0

Table 5.1: Sample matrix for a type result of processed correspondence.

Using different strategy weights for each strategy reduces the risk of draws, in which two relation types achieve the same score. Still, a draw is theoretically possible, e.g., Background Knowledge and Word Frequency together achieve a score of 1.5, which is also achieved by Structure and Multiple Linkage. In this very rare and unlikely case where a draw between two types is attained, the type with the highest preference will be applied. In STROMA, the preference of types is defined as follows (descending):equal, is-a, inverse is-a, part-of, has-a, related. This order is based on the distribution of relation types in most mappings. The typesequal, is-a and inverse is-a normally predominate, while the other types occur less often. Therefore, STROMA will assign the type which is more likely to hold.

Occasionally, neither of the six strategies is able to calculate any relation type, i.e., each strategy returnsundecided and each type will have a score of 0. STROMA allows two configurations in this case: It can denote the correspondence with the label undecided and let the user decide on the correct relation type (manual interaction), or it denotes the correspondence with equal by default. Since many match tools tend to detect equiva-lence relations, and since this type is often the most frequently occurring one in map-pings, equal seems to be the most likely type to hold if no type could be calculated.

By default, this second configuration is applied, which we call undecided-as-equal in the evaluation of mappings. The opposite configuration is undecided-as-false, i.e., a correspon-dence not having any type assignment is treated as falsely typed. A list of all weights and default configurations used in STROMA is also provided in Appendix A.

5.2.3 Selection

STROMA can use mapping repair techniques to remove false correspondences, which can lead to a better mapping precision. However, since STROMA works on a given map-ping, there is practically no chance to achieve a better recall. Therefore, it is advised to use relaxed match configurations in step 1, which generally leads to larger mappings containing more correspondences. Such mappings will have a better recall, but a lower precision, which STROMA tries to augment by means of different repair techniques.

Figure 5.3: Sample mapping with the two thresholds θ and θ0.

The match tool COMA 3.0 was used in the initial phase for the mappings processed with STROMA. It normally uses a selection threshold θ = 0.4, which means that correspon-dences between two concepts must achieve a score of 0.4 or higher in order to be accepted for the mapping. In the context of this research, a lower threshold θ0 = 0.2is used, which consequently results in larger mappings with a better recall, but worse precision. How-ever, correspondences with a score between 0.2 and 0.39 are only treated as "conditionally accepted". If there is enough linguistic evidence that a correspondence from this range is correct, it is accepted to the final mapping, otherwise it is rejected.

This approach is also illustrated in Fig. 5.3. This initial mapping produced with COMA contains 4 correspondences above the threshold θ (3 of them are correct, while the third one is false), which will appear in the enriched mapping for sure. The 3 yellow correspon-dences following below are the conditionally accepted corresponcorrespon-dences that are further

verified for correctness and are possibly kept in the enriched mapping. They have scores between θ and θ0, and apparently two of them are valid, while one is false (the third one).

Finally, the red correspondences having a score below θ0are generally rejected. They will not be part of the enriched mapping. Note that such correspondences are usually not part of the input mapping in the first place and are only shown for illustration. However, there would be no problem to use such an input mapping, as STROMA automatically ascertains the correspondences that are below θ0and removes them instantly.

In the example, it can be seen that the lower threshold leads to two further correct respondences, which increased the mapping recall. There is also a further incorrect cor-respondence (Carpets, Computers), which STROMA is able to detect, though. Thus, in this simplified example two further correct correspondences are found, but no further false correspondence would be part of the enriched mapping. The relaxed configuration thus led to a better recall without impairing the precision. On the opposite, the precision is even higher now, as 5 of 6 correct correspondences are in the final mapping (83.3 %) compared to 3 of 4 (75 %) correct correspondences if no lower threshold is used.

The threshold θ0 is freely adjustable. If it holds θ = θ0, no additional correspondences are regarded and no mapping repair will be carried out. Given a correspondence with a confidence c so that c ≥ θ0 and c < θ. This correspondence will be finally accepted if these two conditions are fulfilled:

1. STROMA could determine a specific relation type for this correspondence, i.e., the result is notundecided.

2. There is no evidence that the correspondence is a result of sloppy lexicographic matching.

The first point is relatively simple, as it seems natural to only accept critical correspon-dences if a specific relation type could be determined. Unfortunately, a type can also be found for irrelevant correspondences, especially by using background knowledge. If there is a correspondence (chair, table), this might be a false correspondence, because (chair, seat) would be the correct correspondence. Still, STROMA could determine a type between chair and table, which might berelated. Therefore, a score of at least 1.0 has to be reached to finally accept this correspondence, which entails that either 2 strategies have to return a specific type for this correspondence, or Compound or Itemization (which are quite reliable strategies). This score is called minimal type confidence for acceptance.

The second point is more intrinsic. As already mentioned in the introduction of this thesis, most match tools are based upon lexicographic matchers and often disregard lin-guistic laws. As we have already explained, a similar spelling of words is no hint at all for any semantic relatedness, as long as there is no linguistic evidence like inflection, compounding, derivation or shortening. In all these cases, changes only refer to the end of words, e.g., (Computers, Computing) or (Book Shelves, Book Shelve Ladders). Therefore, if the first fragments of the matching concept do not overlap, they are most likely unre-lated, as in (stable, table) or (page, cage). This mapping repair technique checks whether the first letters of the matching words overlap. If this is the case, the correspondence is

the two concepts have to overlap to accept the word. Thus, (telephone, television) would be rejected (only the first 4 letters overlap), while (computer, computing) will be accepted (6 letters overlap). As in many cases, this default configuration is based on experiences and can be freely adjusted in STROMA. If words are of length 4 or less, they are usually simplex words consisting of only one morpheme. The arbitrariness of language suggests that those words are normally not in any relevant relation and can thus be rejected.

It would be generally possible to extend this second technique by more complex tech-niques, e.g., by also handling similarly spelled compound words like (city map, city coun-cil) (mismatch) or (city hall, town hall) (match), which is quite a difficult undertaking, though. Currently, open and hyphenated compound words are never rejected, as there is a considerable likelihood that the two matching concepts are somehow related. Since mapping repair is not the primary focus of this thesis, as it is a much too complex re-search field, only the basic techniques described above are applied in STROMA. As we will show in the evaluation, they still allow further improvements of the mapping quality.