INTRODUCCIÓN A LA PROBLEMÁTICA - ANUARIO DE LA FACULTAD DE DERECHO. UNIVERSIDAD DE EXTREMADURA

M ONOGRÁFICO

1. INTRODUCCIÓN A LA PROBLEMÁTICA

4.2.1 Corpus Content

The BLLIP 1987-89 Wall Street Journal (WSJ) Corpus [90] is a pre-parsed newswire corpus which contains a complete, Penn Treebank II-style [101, 119] parsing of the three-year Wall Street Journal archive (provided by Dow Jones, Inc.) from the ACL/DCI (Association for Computational Linguistics/ Data Collection Initiative) Corpus of American English. This corpus contains about thirty million words of text, and its parsing and part-of-speech (POS) annotation were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson [90] of the Brown Laboratory for Linguistic Information Processing. All the processing for this corpus was implemented by machine. The processing comprised basic parsing, grammatical/functional tag assignment, full noun-phrase co-reference identification, pronoun reference identification, and empty node insertion.

The BLLIP 1987-89 WSJ Corpus both overlaps and supplements the one million-word, 1989 Wall Street Journal section of the Penn Treebank Corpus. In a bid to save on parsing time, sentences of length greater than 70 words (including punctuations) were not included in this corpus. The developers report that in about one news story in a thousand, there was some parser error. These parser errors imply that stories in which they occur get cut short; errors led to partial parses.

4.2.2 Tagging Convention

4.2.2.1 The Penn Treebank II Convention

The Penn Treebank II bracketing convention was implemented during the second phase of the Penn Treebank Project at the University of Pennsylvania, U.S.A. The syntactic annotation scheme used is designed to allow the extraction of simple predicate/argument structure.

In addition to the standard syntactic constituent tags (e.g. NP, PP, VP, etc.) functional tags are also assigned to constituents under this scheme. These functional tags denote text categories (list markers, titles, headlines and datelines), grammatical functions (surface subject, logical subjects in passives, true clefts, non NPs that function as NPs, clausal and NP adverbials, non VP predicates, topicalized and fronted constituents, closely related – adjuncts -) and semantic roles (vocatives, direction and trajectory, location, manner, purpose and reason, temporal phrases). For this work, as with other reported work on the WSJ Corpus [5, 6, 7, 98, 99, 100], only the standard syntactic constituent tags are used; this is all that is needed for skeletal syntactic analysis.

4.2.2.2 Exceptions to the Penn Treebank II Convention

All parsing in the BLLIP 1987-89 WSJ Corpus is done using the Penn Treebank II conventions with four exceptions. The first exception is that certain auxiliary verbs (e.g. “have”, “been”, etc.) are deterministically labelled AUX or AUXG (e.g.,

“having”).

The next exception to the Penn Treebank II scheme in this corpus is that root nodes are given the new non-terminal label S1 (as opposed to the empty string in the Penn Treebank).

Another exception is that numbers attached to non-terminals indicating co-reference are preceded by “#” (as opposed to “-” in the Penn Treebank).

The fourth exception is that two new grammatical function tags, PLE (denoting pleonastic, a form of non-coreferential pronouns) and DEI (denoting deictic, a form of non-coreferential pronouns) have been added.

In setting up this corpus, sentences of length greater than 70 words (including punctuations) were ignored.

4.2.3 The BLLIP 1987-89 WSJ Corpus Vs The Lancaster Parsed Corpus

Like the Lancaster Parsed Corpus (LPC), the syntactic part of the Penn Treebank-II tagset (used in tagging the Wall Street Journal – WSJ -) is based on that of the Brown Corpus. However, the annotation scheme used for the WSJ Corpus is an extended and somewhat modified form of that used for the LPC [119].

Whereas word tags in the LPC are quite detailed and unique to particular lexical items, the Penn Treebank tag-set is designed in such a way to eliminate lexical redundancy. For example, the LPC distinguishes five different forms of main verbs (VB – base form of lexical verb (uninflected present tense, infinitive); VBD – past tense of lexical verb; VBG – present participle or gerund of lexical verb; VBN – past participle of lexical verb; VBZ – 3^rd person singular of verb). This same paradigm is also used in the LPC for the word, have, irrespective of whether it is used as a main or auxiliary verb (i.e. HV, HVD, HVG, HVN, HVZ). The LPC also provides tags for three forms of do (DO – base form; DOD – past tense; DOZ – third person singular present) and eight forms of be (BE - be; BED - were; BEDZ - was; BEG - being;

BEM - am; BEN - been; BER – are, ‘re; BEZ – is, ‘s). On the contrary, since the distinctions between the forms of VB on the one hand and the forms of HV, DO and BE on the other hand are lexically recoverable, they are eliminated in the tag-set for the WSJ; only the five forms of VB are used as shown in table 4.1 below.

Table 4.1: Elimination of lexically recoverable distinctions in verbs

Word Word tag

Drink VB

Drinks VBZ

Drank VBD

Drinking VBG

Drunk VBN

Be VB

Is VBZ

Was VBD

Being VBG

Been VBN

Do VB

Does VBZ

Did VBD

Doing VBG

Done VBN

Have VB

Has VBZ

Had VBD

Having VBG

Had VBD

Another example of the elimination of lexical redundancy in the WSJ Corpus, as opposed to the LPC, is the case of tagging words that precede articles in noun phrases. In the LPC, the tags ABL, ABN and ABX are used to denote pre-qualifiers

(quite, rather, such), pre-quantifiers (all, half, many, nary) and both, respectively.

However, in the WSJ Corpus, a single tag, PDT is used to denote all these words (all categorised as pre-determiners).

Null tags are used in the WSJ Corpus in cases such as WH-movement, topicalization, indicating which lexical NP is to be interpreted as the null subject of an infinitive complement clause and aiding the interpretation of other grammatical structure where constituents do not appear in their default positions. Null tags are not used in the LPC. Also, the tags, AUX and AUXG are used for auxiliary verbs in the BLLIP WSJ Corpus. Auxiliary verbs are not denoted in the LPC.

Compared to the 184 tags (143 tags for words and punctuations; 41 tags for constituents) used in the LPC, 84 (57 tags for words and punctuations; 27 tags for constituents – excluding the functional tags -) are used in the BLLIP WSJ Corpus.

The Penn Treebank II tags, therefore represent coarser syntactic categories, compared to the syntactic categories represented by the LPC tags.

The BLLIP WSJ Corpus consists of longer sentences than the LPC. Sentences of length greater than 70 words (including punctuations) were not included in the BLLIP WSJ Corpus. Most sentences over 20-25 words in length found in the LOB corpus were omitted from the LPC.

In document ANUARIO DE LA FACULTAD DE DERECHO. UNIVERSIDAD DE EXTREMADURA (AFDUE) (página 120-125)