• No se han encontrado resultados

The core idea of C-HTS is the use of an external knowledge base to enrich text representations in order to measure the semantic relatedness between terms, and thus sentences, and to utilise this in hierarchical text segmentation. The purpose of measuring semantic relatedness is to allow computers to reason about text. Various approaches have been proposed in the literature to measure the semantic relatedness between terms using an external knowledge source. Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch, 2007) is a method that represents meaning in a high-dimensional space of concepts, automatically driven from human-built knowledge repositories such as Wikipedia. ESA defines concepts from Wikipedia articles e.g., ALBERT EINSTEIN and COMPUTER SCIENCE. A target term is essentially represented as a vector of concepts in Wikipedia based on how this term is mentioned in the concept’s article. Relatedness is then calculated as the cosine similarity between the two vectors of the target terms (see next section for more details).

Another approach that uses the link structure of Wikipedia to measure semantic relatedness is the Wikipedia Link-based Measure (WLM) (Witten and Milne, 2008). WLM measures the relatedness between two terms using the links found within their corresponding Wikipedia articles rather than using the articles’ textual content.

80

The notion behind using explicit semantic relatedness is that it relies on a knowledge base that is built and continuously maintained by humans. The knowledge base used in this research is Wikipedia, the largest and fastest growing encyclopaedia in existence. This knowledge base is a collaborative effort that combines the knowledge of hundreds of thousands of people. In this research, ESA is used as the approach for measuring the semantic relatedness between text segments. ESA has been widely used in a variety of tasks such as semantic relatedness calculation (Gurevych et al., 2007), concept-based information retrieval (Egozi et al., 2011; Jungwirth & Hanbury, 2018) and text classification (Chang et al., 2008) among other tasks. The efficacy of ESA has been proven compared to other approaches that do not rely on explicit knowledge bases. 4.3.1How does Explicit Semantic Analysis work?

As mentioned above, ESA relies on a concept space built from a knowledge base, such as Wikipedia, to measure the semantic relatedness between two terms (or text blocks).In Wikipedia-based ESA, a given word is described by a vector which stores the word’s association strengths to Wikipedia-derived concepts. A concept is a Wikipedia article (e.g. ALBERT EINSTEIN). This concept is represented as a vector of the terms which occur in that article. Each term, in that vector, is assigned a weight using the tf-idf scheme (Salton & McGill, 1986). These weights quantify the strength of association between terms and concepts. After generating terms from the concept article, an inverted index is created that maps each term to a list of concepts in which this term appears. Thus, each word appearing in the Wikipedia corpus can be seen as triggering each of the concepts it points to in the inverted index, with the attached weight representing the degree of asso- ciation between that word and the concept. The name, Explicit Semantic Analysis, stems from the way vectors are comprised of concepts that are manually defined, as opposed to the mathematically derived contexts used by Latent Semantic Analysis. The processing of Wikipedia articles and building of the concept space is depicted in Figure 4.1. In this example, terms are extracted from the Wikipedia article (Economy). Terms such as: “market”, “trade”, “property”, etc. Each of these terms is indexed in a database and

mapped to a list of concepts (articles) in which this term appears along with the tf-idf

score of the term in that article. For example, one of the concepts that the term “market

is mapped to is “Bazaar” with 0.72 score. This means that the word “market” appears in

81

Figure 4.1 The process of generating an ESA model from Wikipedia articles (Egozi et al., 2011).

After building such a concept space, each input term in a text processing task (e.g. seg- mentation) can be represented as a vector of concepts that the term is associated with, accompanied by the degree of association between the term and each concept. The se- mantic relatedness between two given terms is measured by computing the cosine simi- larity between the concept vectors of the two terms. For larger text fragments (a sentence or a paragraph), a concept vector is retrieved for each term in the fragment, then the se- mantic relatedness between two text fragments is measured by computing the cosine sim- ilarity between the centroid of the vectors representing the two fragments. The centroid vector of a text fragment is built based on ranking all the Wikipedia concepts by their relevance to the fragment (Han and Karypis, 2000). Figure 4.2 illustrates the semantic interpretation of two given texts and how the semantic relatedness between their centroid vectors is measured. Given a text fragment (sentence or paragraph), the fragment is rep- resented as a vector using tf-idf. For each term in this text fragment, a vector of corre- sponding entries from the inverted index (the concept space) is retrieved. The retrieved vectors are merged into a weighted vector of concepts that represents the given text. Let

𝑆 be the set of terms in the input text fragment after removing stop words. Let 𝑡⃗ be the vector of weights for concepts associated with term 𝑡 in the concept space. The centroid vector 𝐶 ⃗⃗⃗⃗is defined as:

𝐶 ⃗⃗⃗⃗ = 1

|𝑆| ∑ 𝑡⃗

𝑡∈𝑆

82

where |𝑆| is the length of vector 𝑆 that is used for normalisation in order to account for text units of different lengths. The relatedness between two centroid vectors 𝐶𝑖 and 𝐶𝑗 of two text fragments is computed using the cosine measure:

cos(𝐶⃗⃗⃗⃗⃗, 𝐶𝑖 ⃗⃗⃗⃗) = 𝑗 𝐶⃗⃗⃗⃗⃗. 𝐶𝑖 ⃗⃗⃗⃗⃗𝑗 ‖𝐶⃗⃗⃗⃗⃗‖ ‖𝐶𝑖 ⃗⃗⃗⃗⃗‖𝑗

4.2

Figure 4.2 Semantic interpretation of two text units using ESA (Gabrilovich and Markovitch, 2007)

To elaborate on the notion of semantic relatedness using ESA, consider the two sentences in the example mentioned earlier in section 4.2:

Albert Einstein is a German scientist who was born on the 14th of March 1879.

Mileva Marić was born on December 19, 1875 into a wealthy family in Titel, Serbia.

After applying morphological analyses (see section 4.4.1) on the two sentences, each re- maining term in each sentence is mapped to a vector of concepts from the vector space. Each sentence is then represented as the centroid of the vectors of the sentence’s terms (Han & Karypis, 2000). For the first sentence, the centroid of the vectors contains the following concepts (among other concepts):

83

ALBERT EINSTEIN AWARD

THE EVOLUTION OF PHYSICS

HANS ALBERT EINSTEIN  (second child and first son of Albert Einstein and Mileva Marić)

ELSA EINSTEIN  (the second wife of Einstein)

And the centroid of the vectors of the second sentence contains the following concepts (among other concepts):

MILEVA MARIĆ

HANS ALBERT EINSTEIN

ELSA EINSTEIN

EINSTEIN FAMILY

From these vectors, we can see that the concept vectors of the two sentences have con- cepts in common and measuring the cosine similarity between them (Equation 4.2) can show that although the two sentences are not lexically similar, they are semantically re- lated to each other.