The key idea proposed in C-HTS is to perform the segmentation of text based on the semantic relatedness between its blocks. As discussed in section 4.3, C-HTS uses explicit
6 https://en.wikipedia.org/wiki/List_of_Wikipedias [Accessed: April 08, 2018] 7 https://en.wikipedia.org/wiki/German_Wikipedia [Accessed: April 08, 2018]
93
semantic analysis (ESA) to measure the semantic relatedness between text blocks using Wikipedia as its knowledge base. In this research, the experimantal results reported in section 4.5.3 demonstrated the efficiancy of C-HTS in building a hierarchical structure out of textual documents and its competitive performance against the state of the art approaches.
To validate the efficacy of using Wikipedia as the underlying knowledge base for conceptual representation of text in C-HTS, an experiment was carried out where the WordNet thesaurus (Miller, 1995) is used as the underlying knowledge base to add semantic representation of text (phase 2 in C-HTS, section 4.4.2).
Additionally, to validate the efficacy of using the explicit semantic representation of text rather than its lexical representation, another experiment was carried out where the lexical similarity, in contrast to semantic relatedness, between text constituents is measured in C-HTS.
4.7.1Semantic Similarity using WordNet
WordNet8 (Miller, 1995) is a broad coverage lexical network of English words. Nouns, verbs, adjectives, and adverbs are each organised into networks of synonym sets (called
synsets) that each represent one underlying lexical concept and are interlinked with a variety of relations (Budanitsky and Hirst, 2006). Over time, different versions of Word- Net have been proposed that cover languages other than English, such as EuroWordNet9
(Vossen, 1998) which covers several European languages (Italian, Spanish, etc.) and Ger- maNet10 (Hamp and Feldweg, 1997) which covers the German Language. Different NLP
approaches relied on WordNet as their source for semantic representation of text (Stokes, Carthy and Smeaton, 2004; Lu et al., 2015). However, as discussed earlier (section 4.1), the use of lexical resources (e.g. WordNet) offers limited information about the different word representations. Furthermore, such resources cover only a small fragment of the language lexicon.
To assess this assumption, an experiment was carried out where WordNet is used as the underlying knowledge base for C-HTS. Additionally, different concept similarity metrics are used in this experiment:
8 https://wordnet.princeton.edu/ [Accessed: March 28, 2018] 9 http://projects.illc.uva.nl/EuroWordNet/ [Accessed: April 08, 2018] 10 http://www.sfs.uni-tuebingen.de/GermaNet/ [Accessed: April 08, 2018]
94
1- Path similarity (Rada et al., 1989):computes shortest number of edges from one word sense to another in WordNet hierarchical structure. Using edge counting (section 3.2.2.1), the distance between two disjunctive sets of concepts is defined as the minimum path length from any element of the first set to any element of the second.
2- Leacock-Chodorow Similarity (LCH) (Leacock and Chodorow, 1998): the same as the Path similarity except that it uses the negative logarithm of the result of Path similarity.
3- Wu-Palmersimilarity (WUP) (Wu and Palmer, 1994): similar to LCH, except it weights the edges based on distance in the hierarchy (section 3.2.2.1).
4- The Lesk similarity (Lesk, 1986): it defines the similarity between two concepts as a function of the overlap between the corresponding definitions, as provided by a dictionary such as WordNet.
The WS4J Library11 (Shima, 2014) is used in this experiment.
Table 4.2 shows that the performance of C-HTS using Wikipedia as its knowledge base outperforms its performance using WordNet even with different relatedness measures used with WordNet. This proves that using Wikipedia as a large knowledge base that is built from the collaborative work of hundreds of thousands of people is better than relying on a limited knowledge base such as WordNet.
Table 4.2 Comparison between different similarity measures using WordNet in C-HTS
11 https://github.com/Sciss/ws4j/ [Accessed: January 22, 2018]
Level Moonstone Wikipedia
Wikipedia ESA 3 (top) 2 (middle) 1 (bottom) 0.320 0.507 0.488 0.330 0.397 0.402 WordNet Path 3 (top) 2 (middle) 1 (bottom) 0.393 0.523 0.523 0.385 0.412 0.421 WordNet LCH 3 (top) 2 (middle) 1 (bottom) 0.393 0.525 0.520 0.385 0.410 0.422 WordNet WUP 3 (top) 2 (middle) 1 (bottom) 0.397 0.523 0.522 0.378 0.412 0.424 WordNet Lesk 3 (top) 2 (middle) 1 (bottom) 0.375 0.508 0.536 0.377 0.411 0.420
95
4.7.2Lexical Representation
Lexical representation has been widely used in the literature in text segmentation (Hearst, 1994; Choi, 2000). As its name suggests, it splits text into segments based on words that these segments share with each other. Lexical cohesion refers to the connectivity between two portions of text in terms of word relationships. It relies mainly upon the endogenous knowledge extracted from the documents themselves. Text segmentation approaches that rely upon lexical similarity between text blocks, however, fail to recognise relevant seg- ments that do not share words with each other. Hence, in C-HTS, the semantic relatedness between text blocks is employed to reveal much knowledge about the meaning beyond text.
In order to assess the efficacy of using the semantic representation of text in C-HTS, an experiment was carried out where the lexical representation of text is used to measure the lexical similarity rather than using the semantic relatedness between text blocks (second phase in C-HTS, section 4.4.2). Additionally, different lexical similarity measures are used in this experiment:
1- Cosine Similarity (Singhal, 2001): a basic measure often used in information retrieval, weights words according to their term frequencies scores, and computes the cosine between two text vectors.
2- A string distance metric such as Levenshtein distance (Levenshtein, 1966): it measures the similarity between two given strings based on the distance between them. The distance is the number of deletions, insertions, or substitutions required to transform the first string into the second.
3- Monge-Elkan measure (Monge and Elkan, 1996): is a simple but effective method for measuring the similarity between two strings containing multiple tokens, using an internal similarity between tokens. It measures the average of the similarity values between pairs of more similar tokens within two given strings.
4- Longest Common Subsequence (LCS) (Allison and Dix, 1986): refers to the longest string two texts have in common, when gaps between the series in characters are allowed.
The Dkpro Similarity Framework12 (Bär et al., 2013) is used in this experiment.
96
Table 4.3 shows that the explicit semantic representation of text (ESA) outperforms the lexical representation approach in all similarity measures, segmentation levels and in both datasets. This in fact is not surprising as lexical representation approaches can process only the information that they can ‘see’. While, on the other hand, explicit semantic rep- resentation of text allows a NLP task (e.g. segmentation) to reason about text using knowledge extracted from a massive knowledge base such as Wikipeida.
Table 4.3 Comparison between different coherency measures used with C-HTS