Análisis de Variables Socio-Económicas en el Banco de Datos

Just as with POS tagging, the task of semantic tagging can also be broadly subdivided into two phases (Garside & Rayson, 1997, p. 188):

1) Tag assignment: All potential semantic tags are to be attached to each word. 2) Tag disambiguation: From this set of all potential semantic tags, the contextually

appropriate one is selected22.

If a word in a text is included in the semantic lexical resources, if it has only one sense, and if is not a part of a MWE, tagging it correctly is a straightforward task for a semantic tagger. If this is not the case, the task becomes far more difficult, since successful semantic tagging entails being able to both recognize if a word is a single word or part of a MWE and to identify which of the senses is the appropriate sense in a given context if the word has more senses than one.

There are seven procedures which the EST can utilize for the task of semantic tag disambiguation, in other words, for finding the correct semantic tag for the given sense (Garside and Rayson, 1997, pp. 190–192; Piao, Rayson, Archer, Wilson, & McEnery, 2003):

1) POS tag

The first disambiguation method is the POS tagging already introduced in section 2.2. which takes place prior to semantic tagging and is carried out by the CLAWS POS tagger. By way of illustration, "address" can be either a singular common noun or a basic form of a lexical verb:

address NN1 H4 Q2.2

address VV0 Q1.2 Q2.2 A1.1.1

If CLAWS determines that the tag NN1 representing a singular common noun is the relevant grammatical tag, this simplifies the task of the semantic tagger by leaving it with only two candidate semantic tags to choose from: the tag H4 representing the category "Residence" and the tag Q2.2 representing the category "Speech Acts".

2) General likelihood ranking for single word and MWE tags

The senses in the semantic lexicon entries have been arranged in frequency order according to information obtained from frequency-based dictionaries, past tagging experience, and intuition of the compilers. The most frequent and thus the most likely semantic tag is placed first, the second most frequent and thus the next likely semantic tag is placed second, etc. As a consequence, if there is no other disambiguation method which the program can apply, it is wisest to use the first tag, since that represents the most common sense and is thus most likely to be the correct tag. By way of

illustration, the lexicon entry for the noun "mouse" contains the following tags:

The tag NN1 is a POS tag assigned by the CLAWS component and indicates a singular common noun. The POS tag is followed by the relevant semantic tags. The first semantic tag, L2, represents the category "Living Creatures Generally", so the first and thus the most common sense is that of a rodent. The second semantic tag, Y2, represents the category "Information Technology and Computing", so here it refers to the pointing device for the computer. The third and the least likely sense is that of a quiet or timid person, which is represented by the semantic tag S1.2.3-/S2mf.

3) Overlapping idiom resolution

Normally, MWEs take priority over single word tagging. In other words, the semantic tagger first matches the text against the MWE templates, and if it discovers words which match a template and thus together form a MWE, it tags these words together as a unit having the same sense. If no suitable MWE template is discovered, a word is considered to be a single word and tagged individually. However, in some cases, MWE templates can overlap, in that some MWE templates can produce more than one set of possible taggings for the same set of words. To resolve such situations, a set of rules has been developed, whereby these rules help to determine which of the MWE templates is the most likely one and should therefore be favoured. The rules take account of both the length and the span of the MWEs and of how much of the template is matched in each case.

4) Domain of discourse

If the domain or topic of discourse in a given text is known, this information can be used to "weight" tags, in other words, to alter the order of semantic tags in the single

word lexicon and MWE lexicon for a particular domain. Taking the noun "mouse" again as an example:

mouse NN1 L2mfn Y2 S1.2.3-/S2mf

If the topic of discourse in the text dealt with computing, it would be sensible to weight the category Y2 ("Information Technology and Computing") to automatically raise its likelihood, since this would be the most likely sense in this context.

5) Text-based disambiguation

Gale, Church, and Yarowsky (1992, pp. 233–237) carried out experiments with polysemous words to support their hypothesis that well-written discourses tend to avoid multiple senses of polysemous words. Indeed, they discovered that this tendency was as strong as 98%. One of their test words was "sentence", and the same sense repeatedly appeared both in texts which deal with grammar and in texts which deal with the law. If this hypothesis continued to hold in other cases, it would represent an important addition to the methods for determining word senses. This approach has not, as yet, been implemented in the EST, but it resembles the above-mentioned procedure number 4 with the exception that, while in procedure 4 the weighting is adjusted manually, in this approach the weighting would be determined by the program.

6) Template rules

The same type of template rules that are written for the identification of MWEs can also be used for detecting certain senses of words. For instance, when the noun

"account" occurs in a sequence, such as "someone’s account of something", it is very likely to mean "narrative explanation" and not "bank account".

7) Local probabilistic disambiguation

It is generally supposed that the local surrounding context determines the correct semantic tag for a given word. Thus, the surrounding context can be identified in terms of a) the words themselves, b) their grammatical tags, c) their semantic tags, or d) some combination of all three. An application of this method named the “Domain Detection System” was developed in the Benedict project, where the most probable

sense of a word was calculated by making use of information about the other words in the same sentence. The Domain Detection System is described in more detail in Löfberg et al., 2004.

In document Importancia de los atributos de vivienda y barrio en localización residencial : una aplicación del método best worst al centro de Santiago (página 41-48)