Capítulo VII: Implementación Estratégica
7.7 Gestión del cambio
Given a free morpheme (simplex word), the arbitrariness of language suggests that there is practically no correlation between its representation (spelling or pronunciation) and its meaning [10, 54, 112]. This implies that the meaning of a morpheme cannot be predicted by its representation. For example, nothing in the simplex word tree indicates any relation to the actual concept tree, nor does the constructtree resembles any tree. The different words used in other languages corroborate this theory, like Baum in German or arbre in French refer to the same object, but are completely differently spelled and pronounced. Therefore, simplex words appear to be completely arbitrary and only complex words carry semantics, like blackboard, which indicates a compound between the two arbitrary words black and board, or politely, which indicates the adverb form of the arbitrary word polite.
The knowledge about this arbitrariness can help in ontology enhancement, especially in the discovery of false matches. Many match algorithms assume that concepts are related if they have a notable overlap in spelling and most lexicographic strategies like Edit Dis- tance, Trigram or Jaccard depend on this assumption. Yet the arbitrariness of languages suggests that there is generally no relation between two similarly spelled simplex words, as there is no semantic relations between cable, fable, gable, stable, or table, though they are similarly spelled and pronounced. Words that are in a semantic relations, like chair and seat or house and building have normally quite different representations and cannot be determined by lexicographic strategies alone. Only if at least one word consists of more
than one morpheme, a similar spelling can hint to some relatedness, though it does not has to be equality, as in (cookbook, book) or (database, data record).
In the field of ontology matching, classic lexicographic strategies can discover the follow- ing equivalence relations:
1. Relations between words that have different spellings, like color and colour.
2. Relations between words with different inflections, like computer, computers and computing.
3. Relations between shortenings or abbreviations like lab and laboratory.
4. Relations between words that have a high lexicographic overlap and are indeed related, like hotel and hostel (which are rather rare, though).
They can also discover typos (like employee – employe), but such spelling errors usually do not appear in well-developed schemas or ontologies. In case 1 and 2, differences never appear at the beginning of the two words, but only in the middle or at the end. Also, most shortenings and abbreviations start with the same letter or sequence of letters compared to their counterpart (case 3). It can thus be assumed that if two words are very similar in spelling but do not start with the same letter, they are probably mismatches, just as in the sample correspondence (stable, table). In such a case, the existence of such corre- spondences could be seriously doubted and possibly removed to increase the mapping quality.
However, if two concepts start with the same sequence of characters, no unique answer to the question of relatedness can be given, and the correspondence could be either cor- rect (as in the four examples above) or wrong (as in furniture and furnace). However, examining the two words in more detail, would lead to the conclusion that the corre- spondence is rather a mismatch. In this case, the difference between furniture and furnace is too large for case 1 (different spelling). Since neither of the two concepts ends with any English inflection (case 2) and is obviously no shortening of the opposite concept (case 3), it appears very likely that the correspondence is false. However, as in case 4, this is an assumption that does not generally hold and there is a chance that the two words are ultimately related. Classic lexicographic strategies can also discover is-a relations like (high-school, school) andpart-of relations like (bed, bedroom), but they are unable to determine the semantic type (and may simply assume that all relations are of typeequal). The arbitrariness of language is a significant aspect that is very often ignored in schema and ontology matching. However, this linguistic law only refers to morphemes, not to words in general. For this reason, arbitrariness of language does not mean that lexico- graphic and linguistic match strategies are generally inconvenient or futile. It only means that the mere spelling comparison of two words is too simple to reliably decide whether they are in any relation or not. Knowing how words and morphemes are put together, and how to interpret the meaning of such combinations, can highly foster matching and semantic mapping enrichment.
Word formation
Relevance for OM?
Mapping Example and counterexample
Remark
Derivations Yes section ↔ intersection Can help to discover false matches (antonyms) resp. pseudo compounds.
Compounds Yes bike ↔ mountain bike
butterfly = fly
Can help to discoveris-a relations.
Shortenings Partly lab ↔ laboratory lab = label
Can help in some cases to discoverequal relations, but can also be misleading.
Acronyms, Abbrev.
Yes EU ↔ European Union EU = Essen University
Can help to discoverequal relations.
Blends No motel ↔ hotel
motel = cartel
Blends cannot be
unequivocally determined.
Conversion No (professional (adj.) ↔
professional (noun))
Word classes different from nouns are irrelevant for mappings.
Loan words No (doppelganger ↔
lookalike)
Loan words have often no relevant lexicographic overlap (require handling by dictionary).
Table 3.1: Overview of the dierent forms of word formation and their relevance for ontol- ogy mapping.