POSIBLES MÉTODOS DE DETECCIÓN DE POLÍGONOS SLIVERS EN CARTOGRAFÍA
3. APLICACIÓN DE LOS ÍNDICES A DISTINTAS FORMAS DE POLÍGONOS
Using a combination of diverse types of lexical chains, we develop a text document representation that can be used for semantic document retrieval. Four different semantic representation models are proposed: (i) Best SynsetID (BSID), (ii) Flexible Lexical Chains (FLC), (iii) Flexible to Fixed Lexical Chains (F2F) and (iv) Fixed Lexical Chains (FXLC). For now, let us define a synset of a word as a set of synonyms for that word, and a hypernym of a word as a set of more general synonyms for that word [48].
In (i) we propose an extension of WSD techniques, in which we extract the best semantic representation of a word, considering the influence of its immediately neighboring words. The motivation for this technique is to prevent that words end up with an inadequate representation, given its multiple synonyms and the effects of its neighbors.
The second type (ii), uses the previous representation to build variable-sized lexical chains that delineate all concepts in a document. Though the algorithm has its complexities, the underlying idea behind the algorithm is quite simple. Proceeding linearly through the text, we convert each successive word to its semantic representation using (i). In parallel, as long as succeeding synsets share some semantic similarity, they will be part of the same set (chain), otherwise a new one must be created to capture a new idea. To illustrate the FLC algorithm, consider the sentence “the dog and the cat run with the child and her mom in the park this Summer” as an example. After cleaning the data and applying the Best Synset Disambiguation (BSD) algorithm, we only keep the following list {dog, cat, child, mom, park, summer}. The chain starts with BSID(dog) as the first element and ID representing the chain under construction (current chain), and evaluates the BSID(cat), which has the hypernym carnivore in common, so BSID(cat) is incorporated into the current chain and BSID(carnivore) is set as the ID for the current chain. Next, the ID representing the chain being constructed (BSID(carnivore)) is evaluated with the next BSID(child), which has the hypernym organism in common. The BSID(child) is incorporated into the current chain and BSID(organism) is set as the new ID for this chain. Next, since BSID(mom) has a
hypernym in common (organism) with the current chain, BSID(mom) is also incorporated and the ID representing the chain under construction remains unchanged. The BSID(park) and BSID(summer) are not incorporated to the chain, as they do not share any common hypernym other than WN’s root itself (i.e. entity). They also do not have any hypernym in common among themselves, forcing them to have their own single-synset-chain, resulting in the following structure {{dog, cat, child, mom}, {park}, {summer}}, where {organism}, {park} and {summer} represent each flexible chain, respectively.
In (iii), we develop an algorithm to transform FLC (ii) into a fixed size structure chain. We want to mitigate the problem of two or more long flexible chains being separated by single-synset-chain occurrences. All flexible chains in this step have an ID (FLCID) that is assigned to all BSIDs integrating the chain under construction. For example, let us consider the flexible chains {{dog, cat, puppy}, {park}, {summer}, {dog, cat, puppy}} represented using the synset IDs {{animal}, {park}, {summer}, {animal}}. These IDs are propagated to the BSIDs of the original chain, resulting in the following chain {{animal, animal, animal}, {park}, {summer}, {{animal, animal, animal}}, which will be processed into fixed structures. For this task, we divide the BSIDs, represented by FLCIDs, in sets of four units. Considering our example, the new fixed chains have the following structure {{animal, animal, animal, park}, {summer, animal, animal, animal}}. Both, the first and second chains, have the synset animal as the dominant interpretation, therefore the IDs for these fixed chains are readjusted to {{animal}, {animal}}. In our experiments (Section 6.3), the set size of four provides the most diverse set of chains.
Finally, in (iv), we investigate how fixed lexical structures can be derived directly from a document’s semantic representation (i). In this algorithm we divide the BSIDs, for every document, in chunks of size n (cn), and evaluate what is the synset that best represents each one of these chunks. As in the previous approach (iii), the size of four synsets is chosen, so both techniques can be better compared. For each chain cn, we extract all hypernyms (including the initial synsets) from all the BSIDs in each chunk and select the dominant
synset to represent the entire chain. If there is no dominant BSID, we select the deepest one in the chain using the root of WordNet as the start point. In case there are more than one, one synset us selected randomly, since all of them could represent the given chain. It is important to mention that, hypernyms beyond a certain threshold are not considered in our approach. The closer the root we get, the more common our synsets become, contributing poorly to the semantic diversity of our chains. Therefore, hypernyms with depth below five [71] are discarded.
All these approaches are used to construct high-dimensional vectors corresponding to the document’s semantic structure, which are compared with traditional techniques, such as BOW and tf-idf. In the next sections, all cases related to the proposed algorithms will be discussed in detail.
6.2.1 Best Synset Disambiguation Algorithm (BSD)
Prior to constructing lexical chains, we need to capture the most adequate representation for the meaning of words in a document. This is done using and extending WSD algorithms. The product of this task provides what we call a BSID, a higher level of abstraction for all the words in the document, which will be used to build our lexical chains. For this, we follow the definition of lexical chains [118]. The terms used to build our lexical chains are represented through the most suitable semantic value of a word, also known as BSID.
The semantic representation of words is obtained using a lexical database, which in our case is WordNet [48]. WordNet provides a complex structure for the words and their relationships through several different semantic hierarchies. The following, is a brief summary of definitions used in WordNet, necessary to understand our work on WSD and lexical chains:
• Lemma - the lowercase word found in WordNet structure. The base form of a word;
• Gloss - consists of a brief definition or sentence use of a synset;
• Synset - a set of cognitive synonyms (one or more) of a given word that share common meaning;
• Synset ID - a unique ID that represents the entire synset;
• Sense - the elements in each synset;
• Hypernym - a more general abstraction of a synset, corresponding to a-kind-of relationship. A human is a kind of a mammal, so mammal is the hypernym;
• Hyponym - a more specific abstraction of a synset, the opposite of hypernyms;
• Meronym - constitutes a “part of ” relationship. A hand is part of an arm;
• Least Common Subsumer - the most specific synset in the hypernym hierarchy which is an ancestor of given synsets
• Root - initial synset in WN, called entity.
Most publications in the lexical chains field try to build these structures considering only the words within the document, some use an auxiliary annotated corpus for learning, while others use the most common synset for each word (i.e. the first synset in WordNet for each words). Our approach considers the effects of immediate neighbors for each term wi evaluated, using all synsets available in the structure and their hypernyms. For each word wi, with i = 1, 2, . . . , n there are 0 or more synsets available in WN. In our approach, only nouns within WordNet are considered, so the remaining are discarded. The current version of WordNet used in this project (3.0) has approximately 117,000 synsets, divided into four major categories: 81,000 noun synsets, 13,600 verb synsets, 19,000 adjective synsets, and 3,600 adverb synsets. Since the number of nouns compose almost 70% of all information available in WordNet, we choose to work with this category of synsets.
In addition, nouns allow us to explore interesting relationships between synsets, such as: hypernyms, hyponyms, meronyms, etc.
We represent the BSID of a word wi analyzing the effects of its predecessor (wi−1) and successor (wi+1), called Former SynsetID (FSID)(wi)) and Latter SynsetID (LSID)(wi)) respectively. FSID(wi) and LSID(wi) are selected based on the score obtained in all possible combinations between all available synsets of the pairs (wi, wi−1) and (wi, wi+1). The synsets for wi with the highest similarity value in comparison with wi−1 and wi+1 will be represented as FSID(wi) and LSID(wi) respectively. Algorithm 5 illustrates how FSID(wi) and LSID(wi) are extracted.
Algorithm 5 Former SynsetID (FSID) and Latter SynsetID (LSID) extraction through