8.4.1.1 Shallow Component: SPPC
Shallow preprocessing is performed by SPPC, a rule-based system which consists of a cascade of weighted finite-state components responsible for performing an analysis pipeline consisting of tokenization, lexico-morphological analysis, part- of-speech filtering, named entity recognition, sentence boundary detection, chunk and sub-clause recognition. SPPC is described in Piskorski and Neumann (2000); Neumann and Piskorski (2002).
We will briefly describe those components of SPPC which we integrated with the deep parser.
The SPPC tokenizer first segments words from punctuation symbols and re- turns a (compared to other tokenizers) relatively fine-grained token classification (52 different token classes), e.g.
<ITEM id="3" type="two_digit_number"/> <ITEM id="4" type="four_digit_number"/>
<ITEM id="6" type="number_percent_compositum"/> <ITEM id="7" type="decimal_number_with_period"/> <ITEM id="8" type="number_dot_compositum"/> <ITEM id="16" type="email_address"/>
<ITEM id="17" type="url_address"/>
<ITEM id="20" type="initial_capital_period"/> <ITEM id="21" type="lowercase_word"/>
<ITEM id="22" type="first_capital_word"/>
<ITEM id="33" type="simple_word_dash_first_capital"/> <ITEM id="48" type="abbreviation"/>
<ITEM id="50" type="word_followed_by_dots"/> <ITEM id="51" type="end_of_paragraph"/>
Tokens identified as potential word forms are then morphologically analyzed. 420 different morphological types are distinguished, representable as feature-value pairs, e.g.
<ITEM id="14" gender="M" case="GEN" number="PL"/> <ITEM id="32" person="2" case="NOM" number="SG"/>
<ITEM id="33" person="3" gender="M" case="NOM" number="SG"/> <ITEM id="38" tense="PRES" person="3" number="SG"/>
154 CHAPTER 8. WHITEBOARD <ITEM id="70" tense="SUBJUNCT-1" person="3" number="PL"/> <ITEM id="72" form="INFIN"/>
<ITEM id="91" gender="M" case="GEN" number="SG" comp="P" det="INDEF"/>
Lexical information (list of valid readings including stem, part-of-speech and inflection information) is computed using a full-form lexicon of about 700000 en- tries that has been compiled out from a stem lexicon of about 120000 lemmata. After morphological processing, PoS disambiguation rules are applied which com- pute a preferred reading for each token (the deep parser, however, can also back off to all readings). The following 24 different PoS types are recognized by SPPC
<ITEM id="1" type="N"/> <ITEM id="2" type="V"/> <ITEM id="3" type="AUX"/> <ITEM id="4" type="MODV"/> <ITEM id="5" type="A"/> <ITEM id="6" type="ATTR-A"/> <ITEM id="7" type="DEF"/> <ITEM id="8" type="INDEF"/> <ITEM id="9" type="RELPRON"/> <ITEM id="10" type="PERSPRON"/> <ITEM id="11" type="REFPRON"/> <ITEM id="12" type="POSSPRON"/> <ITEM id="13" type="WHPRON"/> <ITEM id="14" type="ORD"/> <ITEM id="15" type="CARD"/> <ITEM id="16" type="VPREF"/> <ITEM id="17" type="ADV"/> <ITEM id="18" type="WHADV"/> <ITEM id="19" type="COORD"/> <ITEM id="20" type="SUBORD"/> <ITEM id="21" type="INTP"/> <ITEM id="22" type="PART"/> <ITEM id="23" type="PREP"/> <ITEM id="24" type="STOP-WORD"/>
Named entity recognition is based on simple, string-based pattern matching
techniques to recognize e.g. organizations, persons, locations, temporal expres- sions and quantities (13 NE types, 24 subtypes)
<ITEM id="1" type="date"/>
<ITEM id="2" type="organization"/> <ITEM id="3" type="location"/> <ITEM id="4" type="monetary"/> <ITEM id="5" type="person"/> <ITEM id="6" type="percentage"/> <ITEM id="7" type="time"/> <ITEM id="8" type="number"/>
8.4. WHITEBOARD I 155 <ITEM id="9" type="address"/>
<ITEM id="10" type="person_candidate"/> <ITEM id="11" type="organization_candidate"/> <ITEM id="12" type="location_candidate"/> <ITEM id="13" type="position"/>
Next, NE-specific reference resolution is performed through the use of a dy- namic lexicon which stores abbreviated variants of previously recognized named entities. Finally, the system splits the text into sentences by applying only few, but highly accurate contextual rules for filtering implausible punctuation signs. These rules benefit directly from NE recognition which already performs a re- stricted punctuation disambiguation.
The output of SPPC comes in XML format that is transformed by WHAM into the above described index-sequential format for fast random access through the WHAM shallow API.
8.4.1.2 Deep Component: PET
The HPSG parser integrated in the WHITEBOARD system is PET (Callmeier,
2000). Initially, PET was built to experiment with different techniques and strate- gies for processing unification-based grammars. The resulting system provides efficient implementations of the best known techniques for unification and parsing and is still the fastest parser for HPSG grammars.
While PET is basically a runtime parser for fast processing of HPSG grammars, the grammar source can be developed, tested and debugged with the LKB system (Copestake, 2002), that shares with PET a common TDL formalism (Krieger and Sch¨afer, 1994) subset and a compatible type hierarchy and typed feature structure model.
Being designed as an experimental system, the original PET parser lacked open interfaces for flexible integration with external components. For instance, in the beginning of the WHITEBOARD project, the system only accepted full-form lexica and plain text input.
Bernd Kiefer extended the system in collaboration with Ulrich Callmeier. In- stead of single word input, input items where then allowed to be complex, over- lapping and ambiguous, i.e., essentially word graphs. Dynamic creation of atomic type symbols, e.g. to be able to add arbitrary symbols as feature values, has been implemented as well.
Finally, a flexible interface has been implemented that uses API calls to WHAM for the integration of morphology, tokenization and named entity recognition anal- ysis results. As WHAM is implemented in Java, and PET in C++, we defined this interface in JNI (Java Native Interface). Through the object-oriented WHAM API layer, PET could in principle also be integrated with other shallow systems than SPPC. We will discuss some shortcomings of the JNI-based API interface in
156 CHAPTER 8. WHITEBOARD
The German HPSG grammar in WHITEBOARDis based on a large-scale gram-
mar by M¨uller (1999), which was further developed in the VERBMOBIL project
for translation of spoken language (M¨uller and Kasper, 2000). It therefore covers many constructions that occur frequently in spontaneous speech. After VERBMO-
BIL, the grammar was adapted mainly by Berthold Crysmann to the requirements of the LKB/PET system (Copestake, 2002; Callmeier, 2000), and to written text, i.e., extended with constructions such as free relative clauses that were irrelevant in the VERBMOBILscenario.
The grammar consists of a rich hierarchy of 5069 lexical and phrasal types. The core grammar contains 23 rule schemata, 7 special verb movement rules, and 17 domain specific rules. All rule schemata are unary or binary branching. The lex- icon contains 38549 stem entries, from which more than 70% were semi-automati- cally acquired from the annotated NEGRA corpus (Skut et al., 1998).
A further semi-automatic technique has been applied to acquire semantic types for nouns unknown to the deep lexicon using information available from GermaNet (Hamp and Feldweg, 1997). The approach is elaborated in Siegel et al. (2001). The semantic types are needed for (syntactic) disambiguation based on semantic information and thus help to reduce ambiguity and restrict search space for the parser.