Proceso de Toma de Decisiones Estadísticas
1. Representaciones de la Función de Utilidad con Aplicaciones
Text summarisation has been a domain of research for many years. An early example (c1958) can be found in [70] in the context of literature abstracts. Many text sum- marisation techniques have been reported in the literature; these have been categorised in many ways according to: (i) the field of study, (ii) factors inherent to the text or (iii) whether they adopt a statistical or a linguistic approach. Three dominant cate- gorisations of text summarisation techniques are those proposed by: (i) Jones et al.
[64], (ii) Alonso et al. [4] and (iii) Afantenos et al. [2]. Jones et al. [64] proposed a categorisation dependent on: (i) the input that is received, (ii) the purpose of the summarisation and (iii) the output desired. Afantenoset al. [2] and Alonso et al. [4]
presented their categorisation of text summarisation in their respective surveys based on the one previously produced by Jones et al. [64]. The categorisation of Alonso et al. [4] is founded on the “traditional phases” of text summarisation: (i) analysis of the input text (input), (ii) transformation of the input text into the form of summary (purpose) and (iii) synthesis of the output to produce the desired summary (output). Many of the text summarisation approaches presented in the literature focus on one of these phases. The categorisation of Afantenos et al. [2] suggests categorisation based on a number of factors. These factors are arranged into three groups (according to the categorisation by Jones et al. [64]). The groups are: (i) input, (ii) purpose and (iii) output. The details of each of these groups is shown in Figures 2.4, 2.5 and 2.6.
Input f actors
Single−document vs. multi−document T ext vs. multimedia Language M onolingual M ultilingual Cross−lingual
Figure 2.4: Input factors to be considered for text summarisation
With respect to Figure 2.4, the categorisation by single-document or multi-document is self explanatory: the input for summarising can be one document or many documents, respectively. This categorisation can also be found in [71], where it is described briefly, and in [22], where it is used to define the structure of a survey. The text vs. multi- media categorisation is in regard to the format in which the input and the output are presented, whether it is in the form of text or some multimedia format (e.g. image, sound, video). The next categorisation is in regard to the language of the input to be summarised: in the case of monolingual summarisation the language is the same in both the input and the output; in the case of multilingual summarisation the input and output languages are the same, but more than one language may be used; in the case of cross-lingual summarisation the input and the output use different languages.
P urpose f actors
Indicative vs. inf ormative Generic vs. user−oriented
General purpose vs. domain specif ic Figure 2.5: Purpose factors to be considered for text summarisation
In the case of the purpose factors (Figure 2.5), a summary can be classified as being indicative if the summary does not replace the original document but indicates the relevant contents; or informative if the source document is replaced but the information covered is taken into account. Generic text summaries are those that give a general view of the text and, as the name suggests, are not intended to fulfil the requirements of a specific type of user. User-oriented summaries, on the contrary, are produced according to the interest of a specific audience. Gong and Liu [36], and Mani [71] also use this categorisation in their approaches. Finally, while a general purpose summarisation system can be used in many domains, a domain specific one can only be applied to a domain of interest. Output f actors Completeness Accuracy Coherency
Figure 2.6: Output factors to be considered for text summarisation
The categorisations within the output group are shown in Figure 2.6 and takes into account the quality of the output, in other words providing a mechanism for measure whether a summary is complete, accurate and coherent (among other things). Mani [71] also addressed the categorisation of text summarisation techniques according to output factors distinguishing between the concept of a summary and an abstract. In their work a summary was formed by extracting the most important sentences and putting them together, typically to the detriment of readability, to form a summary. An abstract, on the other hand, is formed as a result of processing extracted sentences, putting them together and using an algorithm to provide a more human-like interpretation.
In the context of the research described in this thesis the categorisation of Afantenos
et al. [2] has been adopted. Thus the summarisation techniques described in this thesis can be categorised as follows:
• Input factors
– Multi-document: because with respect to the research described in this thesis the mining was applied to many questionnaires.
– Text: because the text summarisation was applied to the free text part of the questionnaires.
– Monolingual: because the questionnaires used English language only. • Purpose factors
– Indicative: because the generated summary is an indication of the relevant contents in the free text.
– Generic: because the text summarisation of the free text did not have the purpose of fulfilling any specific user requirements.
– General purpose: because the proposed summarisation techniques were in- tended to be used with respect to questionnaires directed at different fields.
• Output factors
– Complete: because the generated summaries include the most relevant ideas from the free text.
– Accurate: because the generated summaries are accurate with respect to the content of the free text.
– Coherent: because the generated summaries are presented to the user in a readable and understandable way.