El Teorema de Solymosi-de Zeeuw - El Problema de Erd¨ os

3. El Problema de Erd¨ os

4.3. El Teorema de Solymosi-de Zeeuw

Performance evaluation of a parser can be distinguished into three kinds depending on the purpose they serve. 1 First, intrinsic evaluation, measures the performance of a parsing

system in the context of the framework it is developed in. This kind of evaluation is applicable to both grammar based and statistical parsing systems since it helps system developers and maintainers to measure the performance of successive generations of the system. For grammar based systems, intrinsic evaluation helps identify the shortcomings and weaknesses in the grammar, and provides a direction for productive development of the grammar. For statistical parsers, intrinsic evaluation provides a measure of performance of the underlying statistical model and helps to identify improvements to the model. Since the evaluation is performed in the context of the framework that the parsing system is developed in, the metrics used for intrinsic evaluation can be made sensitive to the features and output representations of the parsing system.

A second method of evaluation of a parsing system is extrinsic evaluation. Extrinsic evaluation is meaningful when a parsing system is embedded in an application and it refers to the evaluation of the parsing system's contribution to the overall performance of the application. Extrinsic evaluation could be used as an indirect method of comparing parsing systems even if they produce dierent representations for their outputs as long as the output can be converted into a form usable by the application that the parser is embedded in.

1These evaluation methodologies are applicable to general purpose speech and natural language

A third method of evaluation is comparative evaluation. The objective here is to directly compare the performance of dierent parsing systems that use dierent grammar formalisms and dierent statistical models. Comparative evaluation helps in identifying the strengths and weaknesses of dierent systems and suggests possibilities of combining dierent approaches. However, this evaluation scheme requires a metric that is insensitive to the representational dierences in the output produced by dierent parsers. For this purpose, the metric may have to be suciently abstracted away from individual representations so as to reach a level of agreement among the dierent representations produced by parsers. However, as a result of the abstraction process, the strengths of representations of certain parsers might be lost completely.

In this chapter we focus on the comparative evaluation scheme. In Section 7.1, we discuss the methods of parser and grammar evaluations that have been suggested and used in the literature. We indicate the limitations of these metrics for the purpose of comparative evaluations in Section 7.2. In Section 7.3, we present our proposal, a Relation-based Model for Parser Evaluation, as an evaluation framework that overcomes the limitations of previous evaluation schemes. In Section 7.4, we present the results of evaluating the Supertagger and Lightweight Dependency Analyzer, using this scheme.

7.1 Methods for Evaluating a Parsing System

A parsing system can be evaluated along dierent dimensions ranging from grammatical coverage to average number of parses produced to average number of correct constituents in a parse produced by a system. Owing to this multi-dimensionality, there have been a variety of metrics that have been proposed for evaluating a parsing system. A comprehensive survey of dierent parsing metrics is provided in Briscoe et al., 1996]. These metrics can be divided into test suite-based and corpus-based methods. The corpus based methods are further divided into annotated and unannotated methods depending on whether the corpus is annotated for some linguistic information or not. In the sections that follow, we review each metric and discuss its strengths and weaknesses.

7.1.1 Test suite-based Evaluation

In this traditional method of parsing system evaluation, a list of sentences for each syn- tactic construction that is covered and not covered by the grammar is maintained as a database Alshawi et al., 1992 Grover et al., 1993 XTAG-Group, 1995 Oepen et al., forthcoming]. The test suite is used to track improvements and verify consistency between successive generations of the system that result from the addition of an analysis of a construction to the grammar or altering the analysis of a previously analyzed construction in the grammar. Although this method of evaluation has been mostly used for hand- crafted grammars, they could also be used to track the improvements in performance of statistical parsers with the changes in the underlying statistical model. The advantage of this method of evaluation is that it is relatively easy and straightforward and the negative information provides a direction for improving the system. However, the disadvantage is that it does not quantify how the performance of a parsing system would scale up when parsing unrestricted text data.

7.1.2 Unannotated Corpus-based Evaluation

The following methods also use unrestricted texts as corpora for evaluating parsing systems. However, the corpora consist of sentences which are not annotated with any linguistic information.

Coverage

Coverage is a measure of the percentage of sentences in the corpus that can be assigned one or more parses by a parsing system Briscoe and Carroll, 1995 Doran et al., 1994]. It is a weak measure since it does not guarantee that the analysis found is indeed the correct one. The output needs to be manually checked to determine this.

In document El problema de Erdös y Ulam (página 39-42)