Congreso Nacional
COMISIÓN NACIONAL DE TELECOMUNI-
This chapter presented an overview of Natural Language Processing (NLP) techniques and presented a state of the art review of the existing approaches for text segmentation as a technique for structuring textual content. It first reviewed the different criteria that the text segmentation task is categorised according to. From a text representation perspective, text segmentation approaches were categorised into linear and hierarchical approaches. Reviewing linear segmentation approaches identified that they can only produce a single- level segmentation of a document. However, considering the structure of a document as
56
a sequence of segments is in certain discord with most theories of textual content struc- ture, where it is more usual to consider documents as trees. Thus, hierarchical text seg- mentation is seen as a method that can effectively represent a document as a tree-like hierarchy structure.
The chapter presented a focused review of hierarchical text segmentation approaches and how they process text. The review showed that these approaches are limited by the fact that they can only process the information that they can ‘see’. In other words, they are based on the lexical and/or syntactic representation of text, a method that relies mainly upon the traditional bag-of-words representation of text to measure similarity (or dissim- ilarity) between text blocks. However, a representation based solely on the endogenous knowledge in the documents themselves does not reveal much about the meaning of the text.
Building on the review and analysis of the state of the art approaches to text segmentation, the next chapters (Chapter 3 and Chapter 4) present two novel approaches to hierarchical text segmentation that utilise external knowledge resources in order to enrich text and infer more information about text constituents.
The chapter also presented an overview of adaptive systems, as an application for content adaptation, and reviewed their anatomy, their models and in particular their content model. Closed and open corpus content models were reviewed in order to better illustrate how adaptive systems process different types of content. The chapter then presented a review on different approaches utilised by adaptive systems to discover content according to their users’ needs. Content reusability techniques were also reviewed along with their limitations. Furthermore, a review of current NLP techniques utilised by adaptive systems was undertaken. The aim of this review is to investigate how adaptive systems use NLP techniques in processing content resources and how these techniques contribute to the provision of adaptive experiences to adaptive systems’ users.
Building on the review and analysis of adaptive systems and content discoverability and reusability techniques, Chapter 5 presents a content-supply service (named CROCC) that facilitates the use of the new segmentation approach (Chapter 4) for content discovera- bility and reusability for adaptive systems. Additionally, Chapter 6 presents a user-based evaluation of the effectiveness of the proposed service in content discoverability and re- usability.
57
3.
OntoSeg: A Novel Approach to Text Segmentation using Onto-
logical Similarity
3.1
Introduction
As outlined in Chapter 2, many adaptive systems have relied upon the original structure of content resources (HTML structure or paragraph structure) to produce content frag- ments and hence use them in content adaptation. Since this structure does not necessarily reflect the needs or preferences of individual users or applications, more recent systems have tried to employ text segmentation techniques in order to build a structure out of content resources based on the text itself, rather than the structure provided by the content author (section 2.7).
Text segmentation is the process of placing boundaries within text to create segments according to some task-dependent criterion. It is considered an essential task for various NLP tasks (Beck et al., 2014; Bokaei et al., 2016). Text segmentation aims to divide text into coherent segments which reflect the sub-topic structure of the text. As outlined in Chapter 2, current approaches to text segmentation are similar in they all use the tradi- tional word-frequency metrics to measure the similarity between two regions of text, so that a document is segmented based on the lexical cohesion between its words (sec- tion 2.3.6). However, the relationship between segments may be semantic, rather than lexical or syntactic.
Various NLP tasks are now moving towards the semantic web and the use of ontologies. In Information Retrieval, for example, systems that are based on keywords provide lim- ited capabilities to capture the topical interests of users and topics contained within con- tent. In order to solve these limitations, the idea of semantic search, based on the semantic meaning of text, has been the focus of a wide body of research and many ontology-based IR systems have been developed (Fernández et al., 2011; Meštrović and Calì, 2017; Selvalakshmi and Subramaniam, 2018). Hence, a need for segmenting and representing text based on the semantic (ontological) similarity between its constituents arises. This chapter proposes OntoSeg (Bayomi et al., 2015), a novel approach to hierarchical text segmentation based on the semantic similarity between text blocks. In contrast to traditional text segmentation approaches that rely upon bag-of-words representation of content, OntoSeg uses semantic similarity to explore conceptual relations between text
58
segments and a Hierarchical Agglomerative Clustering (HAC) algorithm to represent the text as a tree-like hierarchy that is conceptually structured. The output is a hierarchical structure of the underlying content that is constructed based on how conceptually similar text blocks (one or more sentences) are to each other.
The aim of this chapter is to answer the first research question posed by this thesis (sec- tion 1.2):
To what extent can the semantic representation of unstructured textual content be exploited by novel text segmentation approaches to build a document structure?
and to contribute to its second objective (RO 2). The architecture of OntoSeg is presented and a set of experiments are described, which have been carried out in order to evaluate the performance of OntoSeg using a well-known evaluation metrics. The evaluation com- prises different experiments where each experiment evaluates OntoSeg from a different perspective. Experiments demonstrate that segmenting text based on the semantic simi- larity is applicable with a low error rate. The performance of OntoSeg is also compared against a set of state of the art approaches using a dataset widely used in the literature.