II. 3 1 Cobra: escritura y fetichización
II. 4. Para que nadie sepa que tengo miedo Cocuyo: búsqueda y
One of the foremost attempts at parsing language on a large-scale was done in the Discourse Analysis Project Joshi, 1960 Joshi and Hopely, 1997 Harris, 1962] in the Department of Linguistics at the University of Pennsylvania. This parser was designed and implemented on UNIVAC-1 during the period 1958-59. This project was directed by Zellig Harris and included Carol Chomsky, Lila Gleitman, Aravind Joshi, Bruria Kauman and Naomi Sager as members of the project. This parser, now called Uniparse Joshi and Hopely, 1997], has been reconstructed from the original complete documentation (Papers # 15 { # 19 of the Transformation and Discourse Analysis Project (TDAP)) by Phil Hopely and Aravind Joshi. This parser is not only historically important but it also incorporated what are currently regarded as the state-of-the-art techniques for parsing large-scale texts.
The parser was designed as a multi-stage system. It is a cascade of Finite State Trans- ducers (FST), except for the last stage, which technically is not an FST, but more like a Push Down Transducer (PDT). Each word in the input string is \tagged" with the class or classes to which the word belongs. If a word is tagged with multiple classes, then a series of tests are applied to identify environments in which a class de nitely cannot hold. However, these negative tests may still leave the class ambiguity unresolved.
The input string is then segmented into rst order strings based on the class marks associated with the words of the input. First order strings are typically noun phrases, sequence of verbs and adjunct phrase and are only minimal structures that do not include any nesting of other rst order strings. Once the rst-order strings are identi ed, their internal structure is not analyzed. The rst-order strings behave like chunks in which there is a principal word and all the other words bear relation to this word. The domain of their relationship is \local" and does not extend beyond the substring concerned. The rst-order strings can be recognized by a nite state computation since there is no nesting of rst-order strings.
The input is scanned either left to right or right to left depending on the type of rst-order string being recognized. Elementary noun sequences, adjunct phrases and verb sequences are recognized in that order. The chunks recognized in one scan are treated as frozen for the subsequent scans. Examples with the rst-order strings marked o is shown
below. ] indicate noun phrases,f g indicate verb sequences and ( ) indicate adjuncts.
(2.5) Those papers]fmay have been publishedg(in a hurry]).
(2.6) I]fmay (soon) gog.
Second order substrings are clausal in nature. Second order substrings are xed se- quences of rst-order strings and can include other second order substrings. Recognition of a second order string begins with a left to right scan on the input in which the rst-order substrings have been replaced by single characters. The input contains the second order heads that are not part of any of the rst order strings (for example, complementizers and conjunctions or sentential subjects). Recognition of these second-order strings is done using a push down transducer like automaton. The well-formedness of the sentence is determined if the subcategorization requirements of the verb are satis ed. The following examples illustrate the clausal annotation. The markers
<
and>
include a clause, that- clauses, sentential subjects and objects are included in between / andn and + marks theend of a complement.
(2.7) Those]
<
who read newspapers]>
fwasteg their time].(2.8) (Under conditions]) (of dual induction]) , / decreasing: G the amino - acid concentration]: +n : N (from 0.2]) (to 0.1 per cent]) fsuppressedg: W beta -
galactosidase synthesis]: + (to a greater extent]) than nitrate reductase]
At places where there are multiple possibilities in each stage, one choice is pursued but the alternatives are kept track of. The search is similar to a chart-based, depth- rst preference- driven search with the possibility of backtracking.
2.4 FASTUS
The FASTUS system Appelt et al., 1993] was developed in the context of the Fourth Message Understanding Conference (MUC) in 1992 as a successor to the TACITUS system. The main motivation for developing FASTUS was that the TACITUS system was extremely slow (taking 36hours to process 100 messages) in parsing the text in the MUC messages.
The FASTUS system processed the same set of 100 messages in 12 minutes. The crucial dierence between FASTUS over TASITUS is that FASTUS used a Finite State Mechanism to extract partial parses instead of a full parser.
Processing in FASTUS is driven by pattern matching. Patterns are used for name recog- nition, to determine the relevance of the sentence for the task at hand, and for syntactic and semantic processing. Each pattern is associated with at least one trigger word. The lexicon used for syntactic recognition contains 20000 lexical items with a total of 43000 morphologically in ected forms. The syntactic component consists of non-deterministic nite-state automata that recognize noun groups and verb groups. The noun group recog- nizer has 37 states and recognizes noun phrases, each with a head noun and left modi ers and determiners. The verb group recognizer has 18 states and includes the verb, the auxil- iary and any intervening adverbs. The noun and verb groups are used to recognize patterns and output an incident structure. A series of pattern-incident structure pairs have been written for the MUC task. There are patterns that skip over relative clauses and preposi- tion phrases on nouns. The incident structures are merged with other incident structures found in the same sentence. Merging is blocked if the types of the incident structures are incompatible.
The success of this approach lies in identifying rules for chunking and associating the interpretations to chunks. It would be interesting if these rules could be derived automat- ically for some domain given some initial training material in that domain.