Variables de investigación - MATERIALES Y MÉTODOS

II. MATERIALES Y MÉTODOS

2.3. Variables de investigación

The structure of the thesis can be segmented into five different parts. The first part for the introduction, overviews related research work and describes features construction. The second and third parts contain the thesis main contributions regarding how external repositories, namely WordNet and Wikipedia, can help in the task of summarization. The fourth part describes several related applications that utilize the developed methodologies and can either help with summarization or test the usability of the extracted features. The fifth part draws the conclusions of this thesis. These five parts have been structured into

 Part I: Introduction, Background and Context

 Chapter1: Introduction

 Chapter 2: Background and Related Work. This provides background about

related research work and state of the art in Summarization and summarization evaluations.

 Chapter 3: Features Generation and Selection. This chapter describes the

need for features generation and selection when using external ontologies for automatic summarization. An overview about the stages involved and how semantic distance is derived from WordNet and Wikipedia are also given.

 Part II: Using WordNet for Summarization

 Chapter 4: Summarization Aided with WordNet. This chapter describes

several metrics for computing the similarity between sentences with the aid of WordNet. The implementation of the built summarizer, evaluations performed and improvements added via redundancy checking and diversity enhancement are also given.

 Part III: Using Wikipedia for Summarization

 Chapter 5: Summarization Aided with Wikipedia. This chapter provides an

overview on how features are extracted and built for Wikipedia for use in different applications. The extracted features are used in the built summarizer and the evaluation results of its performance are also reported.  Chapter 6: Sentence Simplification for Automatic Summarization. This

chapter extends the previous chapter by introducing SSM to further condense the summaries and allow for inclusion of more information.

 Chapter 8: Using SSM for Summarization. This illustrates the effects of

applying SSM to several sample sentences and its usability in the application of summarization.

 Part IV: Related Applications

 Chapter 5. The built features from Wikipedia were used in the task of WSD

for two reasons: to get a better view of the effectiveness of the extracted features and to also aid in the task of automatic summarization. In many cases when analyzing documents for summarization, ambiguous words are encountered and a module that effectively handles these types of terms would positively affect the overall performance of the summarizer.

 Chapter 7: Classification Aided with Wikipedia. To test the effectiveness of

the built features from Wikipedia, text classification was used as the first application to explore. I used the constructed features to build a classifier and evaluated its performance.

 Part V: Conclusions and Future Work

 Chapter 9: Conclusions. This chapter draws the conclusions for this thesis

and potential future work.

The relationships among the parts and the chapters are displayed in Figure 1.9. These relationships and links serve the purpose of highlighting the flow of reading required. For example, if a person is interested in learning only about how WordNet was used for summarization in this thesis, then Parts I, II and IV need to be read in that order.

Figure 1.9: Relationships among thesis main parts and chapters

In this thesis, I chose WordNet and Wikipedia as the main ontologies due to the abundance of human concepts available in both. The human knowledge that exists in both ontologies is made available to machines through the framework and algorithms I describe in this thesis. The superior inferring capability of humans is counter measured by introducing the abundant human knowledge to machines all at the same time through the proposed algorithms and methods.

I use WordNet, a hierarchically-structured repository that was created by linguistic experts and is rich in its explicitly defined lexical relations. With WordNet, algorithms for computing the semantic similarity between terms are proposed and implemented. The relationship between terms, and a composite of terms, is quantified and weighted through new algorithms allowing for grouping the terms, phrases and sentences based on the semantic meaning they carry. These algorithms are especially useful when applied to the application of Automatic Documents Summarization as shown with the obtained evaluation results. Several novel methods are also adapted to enhance the diversity and reduce redundancy in the generated summaries.

I also use Wikipedia, the largest encyclopaedia to date. Because of its openness and structure, three problems had to be handled: Extracting knowledge and features from Wikipedia, enriching the representation of text documents with the extracted features, and using them in the application of Automatic Summarization. First, I show how the structure and content of Wikipedia can be used to build vectors representing human concepts. Second, I illustrate how these vectors can be mapped to text documents and how the semantic relatedness between text fragments is computed. Third, I describe a summarizer I built which utilizes the extracted features from Wikipedia and present its performance. I apply the methodologies proposed in this thesis to the application of automatic documents summarization. To evaluate the effectiveness of the different variations of the implemented summarizer, I participated in the TAC 2008 and TAC 2010 summarization tasks with runs from the WordNet-based and the Wikipedia-based summarizers. I report in this thesis the results of the evaluations performed and compare them against several baselines and the results of the other TAC participants.

Chapter 2 Background and Related Work

This chapter presents an overview on Text Summarization and the stages it involves in general. It then reports the main Automatic Summarization Systems which have been developed so far and outlines the major techniques being used. Afterwards, the major summarization evaluation methods are described. The challenges being faced by Automatic Text Summarization are then introduced and those that are addressed by this thesis are highlighted.

In document FACULTAD DE CIENCIAS DE LA SALUD ESCUELA PROFESIONAL DE FARMACIA Y BIOQUÍMICA (página 13-0)