CAPITULO IV: DOULAB: LA RUEDA
4.3 Proceso de montaje y decisiones directoriales
4.3.3 Fundamentación de decisiones actorales
9.2.1 Overview
Basically, SemRep is organized as a simple graph with the nodes being concepts and the edges being the semantic relations in between (see Fig. 9.1). Edges are directed, but as each relation type has a well-defined inverse type, the graph can be traversed in any direction. Each node and edge has a few attributes which are necessary for path calculation and confidence measurement. Node attributes comprise the concept name and the resources in which the concept appears. Edge attributes comprise the semantic relation type and the resources where the relation appears.
SemRep does not distinguish between the different semantics of a concept, nor between the different languages. Given a specific word w, there is exactly one node. For example, mouse is represented by one node, although it has different meanings (an animal, an input device, a small person, etc.). The word gift resp. Gift exists both in German and English, but has different meanings in either language. Still, this word is only represented once in the repository (words are case-insensitive). This means that homonyms cannot be
dis-Figure 9.2: Repository infrastructure in UML notation (class diagram).
tinguished in SemRep, which can impair the quality of query execution. We will discuss this point in detail in Chapter 10.
9.2.2 Implementation Details
There are different possibilities to build a repository like SemRep. Intuitively, it seems reasonable to use a graph database, because of the repository’s graph structure. In an early implementation, the graph database system Neo4j73was used to store the millions of relations from the different resources. Neo4j makes it relatively easy to import those relations and provides different techniques to find semantic paths between nodes (query execution). However, determining paths between two nodes was excessively slow and could take up to 30 seconds, which was an unacceptable execution time for STROMA, as it needs to process hundreds or thousands of correspondences within a mapping.
For this reasons, SemRep is a tailored implementation that utilizes a Java-based hash map structure and is run in main memory. The basic structure is illustrated in the UML model shown in Fig. 9.2. The central class Repository contains a repository hash map as its primary element, which is a set of concept entries. The hash key is based on the concept name, allowing fast access to a given concept. A concept entry has a name, a list of resources where it appears and a list of relation entries. A relation entry has a list of resources where it appears, a relation type (encoded by an internal number) and a target concept, which is another concept entry. Finally, a resource has a name (e.g., WordNet), a language (e.g., English) and a resource-specific confidence threshold used for path scoring (see Section 9.3.4). As an example, consider the two relations (car, automobile) and (car, vehicle). Let us assume that the first relation was provided by WordNet, and the second by Wikipedia. There is one concept car with two relations to automobile and vehicle, which can be visualized in an object diagram as shown in Fig. 9.3.
In SemRep, relations are always directed. Given two concepts x, y, only one relation is stored in the repository. The opposite direction can be easily calculated, because any relation type r has a well-defined inverse relation type r−1. For this reason, there is an (1 : ∗) cardinality between ConceptEntry and RelationEntry, as there could be concepts
73http://neo4j.com/
Figure 9.3: Repository infrastructure in UML notation (object diagram).
that have no outgoing relations to other concepts (which makes them leaf concepts in the repository). By contrast, the associations between ConceptEntry resp. RelationEntry and Resource are mandatory, i.e., each concept entry and relation entry must obviously occur in at least one import resource.
As there is only one node for one term, a node may represent concepts from different languages. For instance, the term oldtimer exists both in German and English, but it has a completely different meaning in either of the two languages. In English, an oldtimer refers to a veteran or an elderly person in general, while it refers to a vintage automobile in German. Thus, from the node oldtimer there are both German and English relations leading to other (German and English) nodes. However, such an implementation does not pose any problem so far, because the language of a concept entry and relation entry can be easily determined by the resource objects it contains (see Fig. 9.2). The language of the repository is English by default, but it can be changed at any time. For example, if German mappings are to be processed, the language has to be switched to German first.
Then, if the concept oldtimer is queried, SemRep will only follow German relations and ignore relations of all other languages. Thus, there is no impairment by having one node for concepts originating from different languages and SemRep will always make sure that only paths of the selected language are used. Of course, this is a very simple imple-mentation, which could be improved by an advanced repository that creates two nodes for terms of two different languages. For the general purpose of schema and ontology mapping, the current data structure seems sufficient, though.
9.2.3 Data Import
Each resource is a set of triples (word1, word2, type), with the type being encoded by a single digit. It holds 0 =equal, 1 = is-a, 2 inverse is-a, 3 = has-a and 4 = part-of. The typerelated (co-hyponymy) is not supported for data import, because a co-hyponym relation is an indirect relation of the form X is-a Y inverse is-a Z, yet the repository only accepts direct relations. The triples are stored in a simple CSV-like text file, with two
is very similar to the STROMA import format (see Listing 5.1). There are more advanced formats to store triples, like RDF, but as the files solely serve for data import, there was no necessity to use such a more complex format.
c a r : : v e h i c l e : : 1 c a r : : automobile : : 0 mountain b i k e : : b i k e : : 1 b i k e r i n g : : handlebars : : 4
Listing 9.1: Sample excerpt from an import le
Figure 9.4: Querying Workow.
The simple structure of the import file makes it easy to extend SemRep by further resources. The Wikipe-dia relations were already built in this input format, so that they could be directly imported to SemRep.
WordNet, UMLS, OpenThesaurus and ConceptNet had to be converted into this list of triples, though, which had been more laborious. For example, Word-Net uses a graph-like structure to organize its synsets, and each synset S comprises a set of synonym words.
To convert WordNet into a set of (1:1)-relations, the WordNet tree has to be traversed from top to bottom and all words w1 ∈ S1, w2 ∈ S2 to be put into a rela-tionship if S1is in any direct semantic relation to S2. Besides, all w1, w2within a synset have to be put into anequal relation if it holds w1 6= w2.