REGLAMENTO PARA LA CONSTITUCIÓN, ADECUACIÓN Y FUNCIONAMIENTO DE LAS COOPERATIVAS DE

Natural Language Generation (NLG) takes structured data in a knowledge base as input and produces natural language text, tailored to the presentational context and the target reader [Reiter and Dale, 2000]. NLG techniques build user context models and use them

to select appropriate presentation strategies. For example, delivering short summaries to the user’s WAP phone or a longer multimodal text if the user is using their PC desktop. In the context of Semantic Web or knowledge management, NLG can be applied to provide automated documentation of ontologies, knowledge base or metadata summari- sation. Unlike human-written texts, an automatic approach will constantly keep the text up-to-date which is vitally important in the Semantic Web context where knowledge is dynamic and is updated frequently. NLG also allows for generation in multiple languages without the need for human or automatic translation [Aguado et al., 1998]. The advan- tage of automatically producing textual documentation from ontologies, is that it is more readable than the corresponding formal notations. Hence, users, who are not knowledge engineers, can more easily understand and use ontologies. Secondly, a number of applications have now started using ontologies to encode and reason with internally, but this formal knowledge needs to be also expressed in natural language in order to produce reports, letters, etc. In other words, NLG can be used to present structured information in a user-friendly way. There are several advantages to using NLG rather than using fixed templates where the query results are filled in: NLG can use different sentence structures depending on the number of query results, e.g., conjunction vs itemised list. Depending on the user’s profile of their interests, NLG can include different types of information – affiliations, email addresses, publication lists, indications on collaborations (derived from project information). Given this variety of what information from the ontology can be in- cluded and how it can be presented, depending on its type and amount, writing templates will be unfeasible because there will be too many combinations to be covered.

This variation comes from the fact that it is expected that each user of the system will have a profile comprising of user supplied (or system derived) personal information (name, contact details, experience, projects worked on), plus information derived semi- automatically from the user’s interaction with other applications. Therefore, there will be a need to tailor the generated presentations according to user’s profile. NLG systems that are specifically targeted towards semantic web ontologies have started to emerge only

recently. Initial ones were based on templates, verbalising closely the ontology structure. More recent ones generate more fluent reports, oriented towards end-users, not ontology builders. In contrast to these applied NLG approaches, at the other end of the spectrum are sophisticated ones, which offer tailored output based on user models. There is a trade- off between applied approaches, which explore generalities in the domain ontology with low customisation overheads, and, on the other hand, more sophisticated, flexible and expressive systems, which, tend to be difficult to adapt by non-NLG experts. Experience shows that knowledge management and semantic web ontologies tend to evolve over time, so it is essential to have an easy-to-maintain NLG approach.

2.6.1 Generation from Semantic Web Ontologies

The ONTOGENERATION project [Aguado et al., 1998] explored the use of a linguistically oriented ontology (the Generalised Upper Model (GUM) )(Bateman et al. 1995) as an abstraction between generators and their domain knowledge base (chemistry in this case). The Generalised Upper Model (GUM) is a linguistic ontology with hundreds of concepts and relations, e.g., part-whole, spatio-temporal, cause-effect. The types of text that were generated are: concept definitions, classifications, examples, and comparisons of chemical elements. However, the size and complexity of GUM make customisation more difficult for non-experts. On the other hand, the benefit of using GUM is that it encodes all linguistically-motivated structures away from the domain ontology and can act as a mapping structure in multilingual generation systems.

2.6.2 Modern Shallow Generation

Wilcock [Wilcock, 2005] provides an overview of shallow XML-based Natural Language Generation from ontologies. His work provides descriptions of XML based pipelined ar- chitectures for NLG, involving: Text Planning, Micro Planning and Surface Realisation. The research is based on practical experience which applies NLG to a spoken dialog system. The author claims that XML as a generation tool fits into the pipeline model for

NLG. He argues that powerful methods for processing XML already exist. In Wilcock’s approach, XML transformations are performed on text plan trees in order to produce text specification trees using Extensible Stylesheet Language Transformations (XSLT). The author [Wilcock, 2005] emphasises that the role played by XML transformations across text plans as a consequence of using XML templates implies template based text planning and not just template based generation, whereby the text plan is passed through various stages of the NLG pipeline for processing using XSLT. The author maintains that the template based approach to text planning offers an efficient alternative to AI planning techniques traditionally used within Deep NLG, which can be expensive and complex, involving potentially exponential amounts of backtracking. However, Wilcock maintains XSLT alone is not suitable for the surface realisation of all languages, especially where complex morphological processing is required as is the case in Finnish. One can however use XSLT with extension functions for JAVA or reuse existing morphological processing resources. An interesting avenue proposed, is a hybrid combination of shallow XML- based methods in combination with traditional deeper NLG methods such as OpenCCG [Foster and White, 2004, White and Baldridge, 2003, White, 2004] (which in itself com- bines both unification based and statistical approaches). Such a hybrid approach is likely to fulfil needs for producing scalable, efficient and high quality methods of generating texts from ontologies.

2.6.3 NLG in CLEF - Clinical E-Science Framework

CLEF [Rector et al., 2003] focuses on integrating clinical care with biomedical research. It aims to develop methods for managing and using pseudonymised repositories of long- term patient histories, which contain and link information relating to genetic, genomatic and image information. It has applications for clinical research and can be used to support patient care. The project involves involves the use of HLT for IE and NLG from metadata CLEF aims to create an Electronic Patient Repository. In order for the data to be of use to a scientist or clinician, it must be easily accessible and user friendly.

Techniques from language generation are applied to produce reports from integrated information and metadata in the repository and to provide an electronic health care record for the medical specialist. In addition, the NLG contribution to CLEF aims to provide a Natural Language Window into the Knowledge base in order to aid knowledge engineers to build and maintain the knowledge base and furthermore to permit clinical experts to provide quality assurance across the the knowledge base while still being shielded by the underlying complex formalisms. The initial CLEF [Rector et al., 2003] work references WYSIWYM technology and natural language directed feedback [Power et al., 1998a] as the basis for NLG. The paper [Hallett et al., 2006] also focuses on the problems of pre- senting aggregated clinical data, in the form of comprehensible textual reports, from a vast and rich source of varying clinical knowledge. The authors discuss the system requirements gathered from clinicians and summarise the results as different report types, depending on user requirements. Input to the report generator is a user selected type of report or selected events in a timeline. Hence the input from the knowledge base ( implemented using DAML+OIL) for generation is a chronicle, which is a highly structured representation of a patient record stored in semantic nets (no other details are provided with respect to the knowledge representational language). A chronicle contains facts and relations and relations are categorised into three types according to their role: rhetorical relations, ontological relations and attribute relations. The system follows a classical NLG architecture: Content Selector, Content planner and Syntactic realiser. The Content planner is tightly coupled with the content selector as document structure is decided when a user selects an event or collection of events. Sentence Aggregation occurs primarily at the conceptual level at the content planning stage rather than the syntactic stage. The document planner uses a combination of schemas with a bottom-up approach. Rhetorical relations holding between a simple event are often realised as a single sentence. Complex individual events are realised as individual clauses or sentences, which are connected via the appropriate rhetorical relation. Depending on the type of report, the focus may move to different events, which consequently has an effect at the

syntactic realisation stage. Other solutions to aggregation are to restrict information to events that deviate from the norm i.e. abnormal test results.

2.6.4 Summary Generation from Ontologies

This section focuses on the summary generation problem, addressed in the ONTOSUM system [Bontcheva, 2005], developed in the context of the SEKT research project. Sum- mary generation starts off by being given a set of statements regarding instances in the ontology. (i.e., triples), in the form of RDF/OWL. Since there is some repetition, these triples are first pre-processed to remove already said facts. In addition to triples that have the same property and arguments, the system also removes triples involving inverse properties with the same arguments as those of an already verbalised one. The information about inverse properties is provided by the ontology (if supported by the representation formalism). The ONTOSUM system is implemented as a set of compo- nents in the GATE infrastructure [Bontcheva et al., 2004]. In particular, GATE makes use of its ontology support, which provides language-independent access to ontologies. The lexicalisations of concepts and properties in the ontology can be specified by the ontology engineer, be taken to be the same as concept names themselves, or added man- ually as part of the customisation process. ONTOSUM is parameterised at run time by specifying which properties are to be used for building the lexicon. A similar approach was first implemented in a domain- and ontology-specific way in the MIAKT system [Bontcheva and Wilks, 2004]. In ONTOSUM, it is extended towards portability and personalisation. Similar to the PEBA system, summary structuring is done using discourse/text schemas [Reiter, 1994], which are script-like structures which represent discourse patterns. They can be applied recursively to generate coherent multi-sentential text. The schemas are independent of the concrete domain and rely only on a core set of four basic properties – active-action, passive-action, attribute, and part-whole. When a new ontology is connected to ONTOSUM, properties can be defined as a sub-property of one of the four basic ones. Consequently, ONTOSUM will be able to verbalise them

without any modifications to the discourse schemas. However, if more specialised treat- ment of some properties is required, it is possible to enhance the schema library with new patterns that apply only to a specific property. ONTOSUM then performs semantic aggregation, i.e., it joins RDF statements with the same property name and domain as one conceptual graph. Without this aggregation step, there will be three separate sentences instead of one bullet list, resulting in a less coherent text. Finally, ONTOSUM verbalises the statements using the HYLITE+ surface realiser. The output is a textual summary.

In document Licencia de funcionamiento para CAC societarias en proceso de adecuación 1/2 (página 50-53)