RECOMENDACIONES - FACULTAD DE INGENIERÍA Y ARQUITECTURA

In this section, we describe the WHAT-based architecture underlying the integrated system, and then provide some evaluation figures in Section 8.7.6.

The fully online-integrated hybrid WHITEBOARDtopoparser architecture con- sists of the efficient deep HPSG parser PET (Callmeier, 2000) utilizing tokeniza- tion, part-of-speech, morphology, lexical, compound, named entity, phrase chunk

8.7. WHITEBOARD II 169 and topological sentence field analyses from shallow components in a sequential architecture. input sentence APPLICATION phrase chunks SPPC LoPar Chunkie PET TnT NE, morphology topo.brackets PoS tags topo.bin topo.flat topo.chunks D D D V,N D

deep result (typed FS)

D,V,N

Figure 8.6: XSLT-based architecture of the hybrid parser

The simplified diagram in Figure 8.6 depicts the components and places where WHAT comes into play in the hybrid integration of deep and shallow processing components (V, N, D denote the WHAT query types, i.e., XSLT transformations). Solid boxes indicate components that produce annotation, dashed boxes indicate the produced XML annotation.

Solid-line arrows represent the transformations used in the online integration. Dashed-line arrows indicate possible access to the intermediate annotation that could be accessed from an application (bottom box), e.g. the Thistle (Calder, 2000) tree visualizations that show the XML annotation tree structures (Figure 8.7 through 8.9) have been created through WHAT D-queries out of the intermediate topo.* XML trees.

The system takes an input sentence, and runs four shallow systems on it:

• the rule-based shallow SPPC (Piskorski and Neumann, 2000) for named en-

tity recognition, compound analysis for German, and morphology and stemming of words unknown to the HPSG lexicon,

170 CHAPTER 8. WHITEBOARD

• TnT, a statistical PoS tagger (Brants, 2000),

• Chunkie, a statistical chunker based on TnT (Skut and Brants, 1998), • LoPar, a probabilistic context-free parser (Schmid, 2000), which takes PoS-

tagged tokens as input, and produces binary tree representations of sentence fields, e.g. topo.bin in Figure 8.7. For a motivation for binary vs. flat trees cf. Becker and Frank (2002).

The results of these components are multiple XML standoff annotations for the input sentence. Named entity, compound, morphological and stem information from SPPC is used by the deep parser to initialize the chart with prototypical feature structures that are filled with shallow information through V-queries for words unknown to the HPSG lexicon and for named entities.

In addition, preference information on part-of-speech is used for prioritization of the deep parser. Details have been described above in the WHITEBOARD-I de- scription (Section 8.4). PoS tagging from TnT is used as input for Chunkie to produce chunking and as input for the shallow topological PCFG parser (Frank et

al., 2003).

The examples in Figure 8.7 through 8.11 show the analyses of the German sentence

Untergebracht war die Garnison in den beiden Wachlokalen Hauptwache und Konstablerwache. (Located was the garrison at the two guard houses, the main guard house and the Constabler guard house.)

with a fronted verb in topic position, which the topological parser identifies cor- rectly. This macro-sentential information can be used to direct the deep parser’s search space towards the correct (and rather infrequent) construction, avoiding al- ternative exploration of the search space.

The following XML document is the output of the topological parser (topo.bin), graphically represented in Figure 8.7.

<?xml version="1.0"?> <ROOT> <CL fn="V2"> <VF fn="TOPIC"> <RK fn="VPART"> <VVPP> <W id="W0">Untergebracht</W> </VVPP> </RK> </VF> <LK fn="VFIN"> <VAFIN> <W id="W1">war</W> </VAFIN> </LK>

8.7. WHITEBOARD II 171 <MF> <ART> <W id="W2">die</W> </ART> <MF> <NN> <W id="W3">Garnison</W> </NN> <MF> <APPR> <W id="W4">in</W> </APPR> <MF> <ART> <W id="W5">den</W> </ART> <MF> <PIDAT> <W id="W6">beiden</W> </PIDAT> <MF> <NN> <W id="W7">Wachlokalen</W> </NN> <MF> <NN> <W id="W8">Hauptwache</W> </NN> <MF> <KON> <W id="W9">und</W> </KON> <MF> <NN> <W id="W10">Konstablerwache</W> </NN> </MF> </MF> </MF> </MF> </MF> </MF> </MF> </MF> </MF> </CL> </ROOT>

172 CHAPTER 8. WHITEBOARD ROOT CL_V2 VF_TOPIC RK_VPART VVPP W Untergebracht LK_VFIN VAFIN W war MF ART W die MF NN W Garnison MF APPR W in MF ART W den MF PIDAT W beiden MF NN W Wachlokalen MF NN W Hauptwache MF KON W und MF NN W Konstablerwache

Figure 8.7: Result of the topological parser (topo.bin) as binary tree

ROOT CL_V2 VF_TOPIC RK_VPART VVPP W Untergebracht LK_VFIN VAFIN W war MF ART W die NN W Garnison APPR W in ART W den PIDAT W beiden NN W Wachlokalen NN W Hauptwache KON W und NN W Konstablerwache

Figure 8.8: The topological tree after flattening (topo.flat)

of D-queries is applied to flatten the binary topological trees that are output by LoPar (result is topo.flat, Figure 8.8) and merge the tree with shallow chunk information from Chunkie (topo.chunks, Figure 8.9). In a next step, we apply the main D-query which computes bracket information for the deep parser from the merged topological tree and chunks (topo.brackets, Figure 8.11).

In order to communicate the structural constraints from the topological parser to the deep parser, the topological tree is scanned for relevant configurations by the stylesheet code of the D-query, and the span information is extracted for the target HPSG constituents. The resulting ‘map constraints’ (Figure 8.11) encode a

8.7. WHITEBOARD II 173 ROOT CL_V2 VF_TOPIC W_VVPP Untergebracht LK_VFIN VAFIN W war MF CHUNK_NP W_ART die W_NN Garnison CHUNK_PP W_APPR in W_ART den W_PIDAT beiden W_NN Wachlokalen CHUNK_CNP W_NN Hauptwache W_KON und W_NN Konstablerwache

Figure 8.9: The topological tree merged with chunks (topo.chunks)

bracket type name (34 different bracket types are mapped) that identifies the target constituent and its left and right boundary, i.e., the concrete span in the sentence.

The bracket information is used to prioritize chart elements of the deep parser that match the constituent boundaries computed by the shallow parser and chunker. The stylesheet directly generates the names of the appropriate HPSG types (value of the rule attributes in Figure 8.11).

Bernd Kiefer extended the PET parser in such a way that it can exploit computed priorities for the chart elements assigned to the phrasal constraints (brackets) from the topological analysis, in addition to the word-based PoS ranking described in WHITEBOARD-I above.

A related approach can be found in Kiefer et al. (2000) for parsing in a speech translation application, where prosodic boundaries help to structure the search space of an HPSG parser.

Depending on the bracket type and chart edge configuration (left-, right-, and

fully matching brackets), application of corresponding HPSG rules is either penal-

ized or ranked higher. A parser task in the following is a configuration of passive and an active chart edge.

A right-matching bracket may affect the priority of parser tasks whose resulting edge will end at the right bracket of a pair such as, for example, a task that would combine edges C and F or C and D in Figure 8.10. Left-matching brackets work analogously. For fully matching brackets, only tasks that produce an edge that matches the span of the bracket pair can be affected, such as a task that combines edges B and C in Figure 8.10.

If, in addition, specified rule as well as feature structure constraints hold, the task is rewarded if they are positive constraints, and penalized if they are negative ones. All tasks that produce crossing edges, i.e., where one endpoint lies strictly inside the bracket pair and the other lies strictly outside, are penalized, e.g. a task that combines edges A and B in Figure 8.10.

174 CHAPTER 8. WHITEBOARD brx brx A B C D E F

Figure 8.10: An example chart with a bracket pair of type x. The dashed edges are active

et al. (2003).

The priority of a parser task is modified relative to a default priority using two confidence values, one, conf_ent(brx), based on tree entropy of a topological parse (per sentence), and one, conf_pr(brx), based on a measure of expected accuracy for each bracket type. These confidence parameters take into account the fact the stochastic topological parser may deliver (partially) wrong analyses and try to correct them. If confidence is high, the topological brackets are fully considered for prioritization. If it is low, their impact is decreased or completely ignored.

Both confidence values are weighted using a heuristically determined weight factor γ, and all these parameters together are used to either add to or subtract from the default priority, depending on whether the bracket and chart configuration triggers reward or penalty. Thus, the priority p(t) of a task t involving a bracket

brxis computed from the default priority ˜p(t) by:

p(t) = ˜p(t) ∗ (1 ± conf_ent(brx) ∗ confpr(x) ∗γ)

Figure 8.11: The extracted brackets (topo.brackets)

Finally, the modified deep parser PET is started with a chart initialized using V-queries to access lexical (morphology, stemming, compounds, PoS preferences) and named entity information gathered from SPPC. The computed bracket information (Figure 8.11) is accessed through WHAT V and N-queries during parsing in order to prioritize constituent analyses motivated by the topological parser and additional syntactic information.

8.7. WHITEBOARD II 175 The abstraction provided by WHAT facilitates exchange of the shallow input components of PET, e.g. it would be possible to exchange some of the used components without rewriting the deep parser’s code.

The complete input for the deep parser is encoded in a single XML document (per input sentence), e.g.

<?xml version="1.0"?> <WHITEBOARD> <SPPC_XML version="2002-06-24" type="transformable"> <ENCODING_TABLES> ... </ENCODING_TABLES>

<DOCUMENT_STATISTICS tokens="12" lexical_items="12" unknown_words="1" words_found_in_lexicon="11" words_with_prefered_reading="9"

named_entities="0" phrases="4" sentences="1" subclauses="0"/> <PARAGRAPH id="P0">

<W id="W0" tc="22">Untergebracht <READINGS id="D0" pref="R0">

</W>

<W id="W1" tc="21">war

</W> </CHUNK>

<CHUNK id="H1" type="1"> <W id="W2" tc="21">die

</READINGS> </W>

<W id="W3" tc="22">Garnison <READINGS id="D3" pref="R4">

</READINGS> </W>

</CHUNK>

<CHUNK id="H2" type="2"> <W id="W4" tc="21">in

</W>

<W id="W5" tc="21">den

176 CHAPTER 8. WHITEBOARD

</W>

<W id="W6" tc="21">beiden <READINGS id="D6" pref="R8">

<R id="R8" pos="14" stem="beid" infl="334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377" code="159"/> </READINGS> </W> <W id="W7" tc="22">Wachlokalen <READINGS id="D7" pref="R9">

</W> </CHUNK>

<CHUNK id="H3" type="1"> <W id="W8" tc="22">Hauptwache

</SEGMENT>

<R id="R14" pos="5" stem="wach" infl="129 130 131 132 133 134 135 136 137 138 139 140 141 142 143" code="82"/> <R id="R15" pos="2" stem="wach" infl="43 62 63 17"

code="71"/>

<R id="R16" pos="1" stem="wache" infl="18 19 20 21" code="13"/> </READINGS> </SEGMENT> </COMPOUND> </W> <W id="W9" tc="21">und

</W>

<W id="W10" tc="22">Konstablerwache <READINGS id="D12" pref="R18">

</READINGS> </W>

8.7. WHITEBOARD II 177 </CHUNK> <W id="W11" tc="1">.</W> </S> </PARAGRAPH> </SPPC_XML> <TOPO sno="0"> <TOPO2HPSG>

<MAPC id="T1" mapc="chunk_np+det" rule="chunk" left="W2" right="W3"/> <MAPC id="T2" mapc="chunk_pp" rule="chunk" left="W4" right="W10"/> <MAPC id="T3" mapc="v2_cp" rule="vfronted" left="W0" right="W10"/> <MAPC id="T4" mapc="vfronted_vfin-rk" rule="vfronted" left="W1"

right="W1"/> <MAPC id="T5" mapc="vfronted_vfin+vp-rk" rule="vfronted" left="W1"

right="W10"/> <MAPC id="T6" mapc="v2_vf" rule="vfronted" left="W0" right="W0"/> <MAPC id="T7" mapc="v2_vfin_pvp-rk" rule="vfronted" left="W1"

right="W1"/> </TOPO2HPSG>

</TOPO> </WHITEBOARD>

The result of deep parsing including the constructed semantics representation of the analyzed sentence can be accessed through the chart interface of PET as typed feature structures (Figure 8.13), or via subsequent XSLT transformation in other formats, extracting only partial information.

In document FACULTAD DE INGENIERÍA Y ARQUITECTURA (página 93-111)