2. Descripción del Negocio
4.1 Estados Financieros
4.1.2 Análisis Horizontal
We start our investigation by first looking for a set of 10 relevant nouns. We decided on the following 10 nouns: Konzern (corporation), Tochter (daughter ), Unternehmen (company), Umsatz (business volume), Industrie (industry), Bank (bank ), textil (textile), Branche (branch), Firma (firm), Versicherung (insur- ance). We decided on these nouns because we considered them relevant for the economic domain. We look at whether the noun occurs alone, or in the context of a compound word, and in the latter case, whether it appears as a prefix or a suffix of the compound word. For example, the German noun Konzern (corporation) can appear in the following compounds:
(6) Der gr¨oßte deutsche Chemiekonzern the largest German Chemical corporation
(7) PKI erstellte erstmals einen Konzernabschluss PKI generated for the first time a corporation report (8) Der 75j¨ahrige Konzernchef
The 75 year old head of the corporation (9) beim amerikanischen Johnson-Konzern with the American Johnson corporation
From these examples, based on our observations, we can already extract a lot of information that can be used as the basis for an ontology:
the compounded sequence named_entity hyphen noun leads to the def- inition of an instanceOf of an ontology class that could have Konzern (corporation) as its label (or an alias);
5.1. TEXT-BASED LAYER 78 the multi-word expression Konzern followed by a noun leads to a relation associated with the ontology class that could have Konzern (or an alias) as its label : Konzern genericRelation Chef;
the multi-word expression noun followed by Konzern leads to a subClassOf relation between the expression itself and the class having Konzern (or an alias) as its label: Chemiekonzern (chemical corporation) is a subClassOf of the class Konzern (corporation);
As attractive as this very simple approach might appear, the first and most ob- vious drawback of this approach is that it allows us to extract possible ontology classes and relations only for nouns defined by the user. In this way we achieve high precision but a very low recall. On the other hand, the extraction is applica- ble only on words alone, not taking into account any possible textual context. In the following paragraphs we present a generalized approach for ontology extrac- tion from plain text. Based on extraction rules, we will show how the extraction of ontological knowledge from plain text by just using linguistic knowledge is performed.
In order to develop a more generalized method (non-user defined) for the ex- traction of ontological knowledge we decided to start by extracting all noun compounds from the corpus. The decision for extracting all noun compounds is based on the assumption that from noun compounds we can extract ontological knowledge. This assumption is also supported by grammaticians who investigate the specificities of the German language (Fleischer and Barz, 1995; Lohde, 2006; Motsch, 2006). In their view, in most of the cases a noun compound2 is built from two or more words which can also stand alone in the text and which are se- mantically connected3 to each other (Duden, 2006). Based on this, the elements
2We deal here with the specific case of determinative noun-noun compounds. More on this
aspect in Section 5.1.2
3Semantically connected means that, the components of the compound are connected to each
5.1. TEXT-BASED LAYER 79
of a compound are, for our task, potential ontology classes and the relations be- tween the elements of the compound are potential ontological properties. On the other hand, a determinative compound is a hyponym of the second element of the compound (Erben, 1993; Donalies, 2007). As described in Chapter 4, we assume that all extracted compounds are determinative compounds.
To attain our aim, we implemented a pattern-based algorithm which exploits specific characteristics of the German language. Since noun compounding in German implies the existence of a noun and nouns start in German with a capital letter, we first decided to select from the corpus words starting with a capital letter. We just assumed that all words starting with a capital letter are nouns.
Key Frequency Key Frequency
Mark 797 Ende 140 Prozent 653 Deutschen 128 Unternehmen 340 Zeit 119 Jahr 305 Branche 99 Millionen 295 Bank 96 Milliarden 264 Markt 95 Jahren 223 Dollar 94 Deutschland 171 Umsatz 88 Jahre 141 USA 86
Table 5.1: The top 20 nouns and their frequencies.
The pattern-based extraction of all possible candidates for compounding (but also ontology class candidates) has shown that, from a total number of 200107 words in the corpus, 19292 words are possible candidates for appearing as part of a compound, and therefore being an ontology class candidate. Table 5.1 lists the top 20 nouns and their frequencies in the corpus.
Here we have to notice that words like articles, prepositions, pronouns and par- ticles such as der, f¨ur, es, doch have been already filtered out and do not appear in the list. Another aspect which has to be pointed out, is the fact that at this processing stage the counted candidate nouns are nothing else but the number of
5.1. TEXT-BASED LAYER 80
tokens potentially used later in the process of ontology extraction. It would be to ideal to count just types, because from different forms one can extract a single relation type. Not counting for morphological variations does not introduce real errors, but redundancy takes place. This redundancy we intend to reduce in a further step when we will use morphological information for defining the classes or labels of classes.