1.3. CONTAMINACIÓN Y PÉRDIDA DE SUELOS
1.3.1. PROCESOS DE DEGRADACIÓN
1.3.1.4. Efecto de los metales pesados para los ecosistemas y la salud
According to O’Shea et al. (2013), there are at least 50 measures created between 2004 and 2012, including improvements of existing measures. A number of these measures are proposed based on STASIS (Ferri et al., 2007) and on LSA (Jin and Chen, 2008). Example of these are WordNet based measure (Kennedy and Szpakowicz, 2008; Quarteroni and Manandhar, 2008), or thesaurus-based measures (Inkpen 2007; Kennedy and Szpakowicz, 2008). Other technique measures include: TF*IDF variants (Kimura et al., 2007); a measure based on string similarity (Islam, 2008); other cosine measures (Yeh et al., 2008); Jaccard coefficient and other word overlap measures (Fattah and Ren, 2009); grammar based measures (Achananuparp et al., 2008); graph or tree based measures (Barzilay and McKeown, 2005); concept expansion (Sahami and Heilman, 2006); and a directional relation between text fragments measure (Corley et al., 2007). However, this section is focused on two significant sentence similarity measures: Latent Semantic Analysis (LSA)
31
and Sentence Similarity, based on Semantic Nets and Corpus Statistics (STASIS). Also presented is a review of a non-English sentence similarity measure.
2.3.1
Latent Semantic Analysis (LSA)
Latent Semantic Analysis is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer et al., 1998). There are two stages of LSA. First, a matrix of words is created based on the number of times a word appeared in a specific context; the word order in a sentence is not taken into consideration (Landauer et al., 1998). Second, the application of Singular Value Decomposition (SVD) is used to decompose the word matrix to reduce its size. SVD is a mathematical matrix decomposition technique that reduces the dimensional representation of the word matrix by trying to keep the entries that have the strongest relationship between the words and their occurrences in sentences. As LSA does not take the word order and the word with polysemy (the coexistence of many possible meanings for a word), this might cause an inability to analyze the sentence correctly.
2.3.2
Sentence Similarity based on Semantic Nets and Corpus
Statistics (STASIS)
STASIS (Li et al, 2006) uses three elements for the determination of sentence similarity: word similarity; statistical information such as word frequency; and word order similarity. Figure 2.2 shows an overview of STASIS.
32 Construction of the Joint Word Set
Equation 2.3 describes a joint word set T derived from all unique words in two sentences:
T1 and T2.
∪ = , , ⋯ , Equation 2.3
Giving two sentences T1 and T2 a joint word set is formed using Equation 2.3:
: The lion is the king of the jungle. : Lion is a mammal.
A joint word set, T is
T= {The lion is the king of the jungle a mammal}
Formation of the Lexical Semantic Vectors
The vector derived from the joint word set, called the ‘lexical semantic vector’, denoted by
š. š , is derived from the joint word set for each short text, and m equals the number of
words in the joint word set. Each entry, ̌ (where i=1, 2, ..., m) is determined by the semantic similarity of the corresponding word in the joint word set to a word in the sentence.
For each word in the joint set, there are two possible cases to process when the joint set is scanned.
• Case 1: ̌ is set to 1, if wi appears in the sentence,
• Case 2: if is not contained in T1, a semantic similarity score is computed between
and each word in the short text T1, using the word measure described in Li et al.
(2003). The most similar word in T1 to is that with the highest similarity score. If
the highest score exceeds a preset threshold, then ̌ is equal the highest score; if not, ̌ is 0.
Equation 2.4 shows how the words are weighted according to their information content (Resnik, 1999), on the assumption that word frequency influences the contribution of the individual words to the overall similarity. Entropy measures are calculated using the Brown Corpus (Francis, 1979):
33
Given that is a word in the joint word set, and ( is its associated word in the sentence, # is the information content of in the corpus. The value of # can be [0,1] and it is defined as:
# = )*+ , -.
)*+ /0 Equation 2.5
Where 1 is the probability of a word wi, and N is the total number of words in the
corpus, 1 can be calculated as:
1 = /020 Equation 2.6
where n is the word frequency of the word w in the corpus.
Calculation of the Semantic Similarity component
Lastly, the semantic similarity between T1 and T2 (Ss) is calculated using the cosine
similarity measure between two vectors, as shown in Equation 2.7:
S4 = ‖4455‖×‖4×466‖ Equation 2.7 Formation of the Word Order Vectors
Given two sentences T1 and T2, the following sentence pair illustrates the importance of
word order:
T1: The male lion kills the poor tiger.
T2: The male tiger kills the poor lion.
Then, using Equation 2.1 to create joint word set:
T = {The male lion kills the poor tiger}
A unique index number is assigned for each word in T1 and T2 by the order of the word that
appears in the sentence. For instance, in T1 for the word lion, the index number is 3 and 6
for poor. A word vector 8 and 8 is creating based on the joint word set. From the T1 and
T2, the vectors 8 and 8 are produced as:
8 : {1 2 3 4 5 6 7} 8 : {1 2 7 4 5 6 3}
34
Calculation of the Word Order Similarity Component
Word order similarity (Sr) is calculated as shown in Equation 2.7:
S9 1 −‖;‖;5 96‖
5096‖ Equation 2.7
Calculation of Overall Sentence Similarity
Finally, the overall similarity between two sentences S , is calculated using two components !< and !; as in Equation 2.8:
S , = =!<+ 1 − = !; Equation 2.8
where δ is a constant in the range 0.5 <δ < 1 which adjusts the relative contributions of semantic and word order.
2.3.3
Sentence Semantic Similarity Measure in Other Languages
There are a number of non-English sentence measures including in Chinese, Malay, and Arabic. The recent Chinese sentence similarity measure is proposed by Ru Li (2009); the measure calculates the similarity between two Chinese sentences based on Chinese FrameNet (You, 2005) and Chinese Dependency Graph as its knowledge. The measure calculates the similarity by distance between two frames in Chinese FrameNet; also another component of similarity is obtained by looking from a Chinese Dependency Graph. Then, the similarity between two Chinese sentences is combined from those two components.A sentence similarity measure for two Malay sentences is proposed by Noah (2007). The measure uses the method of STASIS (Li et al., 2006) to produce the sentence similarity between two Malay sentences. Although there was no short text benchmark dataset available at that time, Noah (2007) claims that the experiment has shown consistent and encouraging results which indicate the potential use of this modified approach.
An Arabic sentence similarity measure (Almarsoomi, 2013) is under development, and the measure is also derived from STASIS.
35