C APÍTULO II Emplazamientos

Ley Agraria*

C APÍTULO II Emplazamientos

Information Extraction (IE) is a specific NLP technique defined as a text analysis task aimed at extracting targeted information from context (Cowie and Lehnert 1996; Gaizauskas and Wilks 1998; Moens 2006). It is a process where a textual input is analysed to form a textual output able for further manipulation. Such data manipulation may be then aimed for automatic database population, machine translation tasks, term indexing analysis, text summary algorithms and other.

Hobbs (1993) describes the generic information extraction system as “a cascade of

transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically”.

He recognises that each information extraction system is dictated by its own set of modules however, he highlights a set of 10 individual modules that contribute to the general architecture of every information extraction system. These are;

Text zone analyser to divide input into segments,

Pre-processor to convert segments into sentences based on part-of-speech recognition,

Filter to discard irrelevant sentences generated in the previous process,

Pre-parser to detect small scale structures as noun groups, verb group and modifiers,

Parser to produce a set of parse tree fragments possibly complete that describe the structure of a sentence,

Fragment combiner to complete the parsing of incomplete parse tree fragments into a logical form for the whole sentence,

Semantic Interpreter to generate meaning representation structures from a parse tree or a parse tree fragment,

Lexical Disambiguation to resolve any ambiguities of terms in the logical form, Coreference Resolution to connect different descriptions of the same entity in

different parts of text and

Template generator to generate the final representations of the extracted text. Information Extraction and Information Retrieval operations are fundamentally different and as such cannot be seen as two competitive methods employed to resolve the same problem. They have been described as two complementary methods where their

combination promises the creation of new powerful tools in text processing (Allan et al. 2003; Cunningham 2005; Lewis and Jones 1996; Moens 2006; Smeaton 1997; Wilks 2009).

The two technologies have different and distinct historical backgrounds. Computational Linguistics and NLP have formed the environment within which IE developed, whereas Information Retrieval growth was based on Information Theory, Probability Theory and Statistics. For the average user, it would not be hard to imagine the operation of an IR system, since these kind of systems are widely used when searching the Web or a local library catalogue. On the other hand, IE systems arguably could not be described as applications available to the average user since such systems operation is usually closely bound to an application scenario or domain.

2.3.1 The Role of the Machine Understanding Conference (MUC)

The contribution of the Machine Understanding Conference (MUC) in a period of ten years from 1987 to 1997 has been significant and supported the growth of the IE field, providing the funds and a common ground, for evaluation, and sharing of knowledge and resources in Information Extraction. The conference adopted precision and recall measurements while redefining them to suite the information extraction task, including measurements for incorrect and partially correct results (Grishman and Sundheim 1996).

The fourth MUC marked the beginning of the conference inclusion in the TIPSTER programme. TIPSTER funded by DARPA and various other US Government agencies focused on three underlying technologies; Document Detection, Information Extraction, and Summarisation. Efforts involved the creation of a standard architecture for information retrieval and extraction systems, while improving the portability and re-usability of information extraction techniques. The programme has enjoyed three development phases from 1991 to 1998 and achieved its purposes under the directions of Ralph Grishman of NYU and the efforts of the TIPSTER Architecture Working Group.

The sixth MUC conference provided for the first time to participants the option to choose to perform one or more of four smaller evaluation tasks, described as Named Entity

Recognition (NER), Coreference Identification, Template Element Filling and Scenario Template. The MUC programme concluded in 1999, an effort which occupied seven

conferences and spanned for a decade. The extracts and conclusions that have been drawn from the MUC's have influenced the design and development of many information extraction systems since (Cunningham et al. 1996; Gauzaskas and Wilks 1998).

The Automatic Content Extraction (ACE) programme, successor of MUC in the evaluation of information extraction technology, directed the evaluation effort towards a finer inference analysis of human language. The programme described four evaluation tasks; Recognition of Entities, Recognition of Relations, Event Extraction, Extraction from

Speech and OCR input (Doddington et al. 2004).

The evaluation tasks of the programme are challenging information extraction methods that operate on the semantic-entity level beyond the word-term limit. Recognition of entities in text involves Coreference Resolution for identifying all entity instances, an issue not addressed by NER. The tasks of Recognition of Relations and Event Extraction are also described and are targeted at detection and categorization of events and relations between entities. The latest April 2008 event of the ACE series, involved multilingual tasks focused on entity and relation recognition in Arabic and English within-document and cross-document tasks.

2.3.2 Types of Information Extraction Systems

Information extraction systems fall into two distinct categories; Rule-Based (hand-crafted) and Machine Learning systems (Feldman et al. 2002). During the seven MUCs, the involvement of rule-based information extraction systems has been influential. Systems such as TACITUS, FASTUS, PIE and LaSIE-II have used with success hand crafted rules to answer a range of information extraction scenarios set by the conference committee (Lin 1995; Hobbs et al. 1993; Humphreys et al. 1998).

The issue of information systems portability quickly gained attention. During MUC-4 the AutoSlog tool introduced a semi-automatic technique for defining information extraction patterns as a way of improving system's portability to new domains and scenarios. An updated and fully automated version of AutoSlog, named CRYSTAL, participated in MUC-5 introducing the involvement of machine learning information extraction systems in the conference. Although the performance of CRYSTAL did not match those of hand-crafted rules, it managed to deliver promising results that met 90% the performance of rule-based systems (Soderland et al. 1995; Soderland et al. 1997).

2.3.2.1 Rule-based Information Extraction Systems

Rule-based systems consist of cascaded finite state traducers that process input in successive stages. Dictated by a pattern matching mechanism, such systems are targeted at building abstractions that correspond to specific information extraction scenarios. Hand- crafted rules make use of domain knowledge and domain-independent linguistic syntax, in

order to negotiate semantics and pragmatics in context and to extract information for a defined problem. It is reported that rule-based systems can achieve high levels of precision between 80%-90% when identify general purpose entities from financial news documents such as Person, Location, Organisation etc. (Feldman et al. 2002; Lin 1995; Hobbs et al. 1993).

The definition of hand-crafted rules is a labour intensive task that requires domain knowledge and good understanding of the information extraction problem. For this reason rule-based systems have been criticised as costly and inflexible, having limited portability and adaptability to new information extraction scenarios. However, developers of rule- based systems claim that, depending on the information extraction task, the linguistic complexity can be bypassed and a small number of rules can be used to extract large sets of variant information.

2.3.2.2 Machine Learning Information Extraction Systems

The use of machine learning has been envisaged to be the element to break through the domain-dependencies of rule-based information extraction systems (Moens 2006, Ciravegna and Lavelli 2004). Machine Learning is a discipline that grew from the research of Artificial Intelligence, which is concerned with the design of algorithms that enable computers to “adapt” to external conditions. The term “learning” obviously does have the precise meaning that learn has in human intelligence context. Learning in the artificial intelligence context describes the condition where a computer programme is able to alter its “behaviour”, that is to alter structure, data or algorithmic behaviour in response to an input or external information (Nilsson 2005).

Machine learning strategies can support supervised and unsupervised learning activities. When supervised the learning process is based upon the provision of a training data set which is used by the machine learning process in order to deliver generalisation of the extraction rules, able to perform a large scale exercise over a large corpus. The general idea of using supervised machine learning in Information Extraction systems is to use human experts to annotate a desired set of information fragments in an exercise involving a small corpus of training documents. The training set of documents is then utilised in a machine learning process for generalisation of the extraction rules, which are able to perform a large scale exercise on a large corpus. It is believed to be easier to annotate a small corpus of training documents than to create hand-crafted extraction rules, since the later requires programming expertise and domain knowledge (Moens 2006)

During unsupervised learning, human intervention is not present and the output of the training data set is not characterised by any desired label. Instead a probabilistic, clustering technique is employed to partition the training data set and to describe the output result, which generalisation of a larger collection would expand upon (Nilsson 2005). Unsupervised information extraction is very challenging and systems are not proven to be able to perform at an operational level (Uren et al. 2006; Wilks and Brewster 2009).

Supervised information extraction systems are more widely adopted and have managed to delivered successful results at an operational level. However, criticisms of the supervised learning methods highlight the dependence of the information extraction results on the quality of the training set, the impact of the type of learned data to the maintainability of the information extraction system and the difficulty in predicting which learning algorithm will produce the most optimum result (Wilks and Brewster 2009).

In document Revista de los Tribunales Agrarios (página 92-96)