• No se han encontrado resultados

Desarrollo de estandar

EVALUACIÓN DEL RIESGO

As described in Chapter 5, XSLT can be used to transform linguistic markup for visualization purposes. The tree visualizations in Figure 8.8 through 8.12 have been generated through a generic WHAT D-query transforming the XML-encoded topological parse tree into a Thistle visualization tree (Calder, 2000).

The target SGML format (Thistle arbora DTD) can be generated thanks to the openness of XSLT with respect to output formats. The Thistle editor mode could in principle be used to edit the generated tree representations (Figure 8.14). The resulting, corrected SGML file could be used to improve the underlying stochastic topological parser that originally generated the topological tree.

Figure 8.14: Topological parse tree in Thistle editor mode

8.8

Related Work

WHITEBOARD was the first implemented system that integrated multiple shallow

processing components (not only PoS tagging) with an advanced, high-performance

deep HPSG-based parser. In addition, WHITEBOARD provides an architecture

framework that supports easy integration of other shallow components by means of XML annotation and through XSL transformation. These facts make the archi- tecture superior to other, in most cases ad hoc integrations of specific systems.

8.9. SUMMARY 181

Another NLP architecture also called WHITEBOARD has been developed at

ATR Kyoto (Boitet and Seligman, 1994). The focus of that prototypical system designed for speech translation was to overcome restrictions of both pipeline and blackboard architectures by postulating a coordinator that would schedule NLP

components and mediate between them. However, the ATR WHITEBOARD idea is

different from our WHITEBOARDin that access to NLP component results is only

possible via the coordinator. Moreover, the assumed and supported data structures are specific for speech processing (time-aligned lattice) and not directly usable for concepts such as abstraction-based deep and annotation-based shallow processing results.

There exists only very little other work that considers integration of shallow and deep NLP utilizing an XML-based architecture, most notably Grover and Las- carides (2001) for the HPSG precursor GPSG. However, their integration efforts are largely limited to the level of PoS tag information.

Ad hoc integrations of PoS tagging and specific HPSG grammars have been

conducted for Dutch (Prins and van Noord, 2003) and Spanish (Marimon, 2002a). There was also integration work in deep parsing other than HPSG, e.g. Daum et

al. (2003) combined PoS tagging and chunking with a dependency parser. Kaplan

and King (2003) and Kaplan et al. (2004) combine PoS tagging and finite-state preprocessing with the LFG parser. The common observation from their results is that mainly PoS tagging as preprocessing increases coverage and robustness that the deep frameworks alone would not accomplish in an economically way. The results we have obtained in WHITEBOARDsupport this observation.

8.9

Summary

In this chapter, we have presented the key architecture concepts of WHITEBOARD,

the WHITEBOARD Annotation Machine (WHAM) and the WHITEBOARD Anno-

tation Transformer (WHAT). We have demonstrated an application scenario with highly integrated multiple shallow preprocessors and a deep parser, and shown the advantages of integrating both for increased robustness (recognition of words un- known to the deep lexicon) and search space reduction (shallow pre-shaping of the deep parser’s search space).

An evaluation of 5000 sentences of a German newspaper corpus showed that the already high efficiency of deep parsing could be further improved by a factor of 2.25 on average, lexical coverage increased from 28 to 71% and overall parsing coverage (full parses) from 12.5 to 22%. It has to be noted that these results were obtained at a very early stage of German HPSG grammar development, where the

grammar was more elaborated on speech dialog (VERBMOBIL) than on general

newspaper text.

WHITEBOARD, extended with WHAT, is an open, flexible and powerful infras-

tructure based on standard XSLT technology for the online and offline combination of natural language processing components, with a focus on, but not limited to, hy-

182 CHAPTER 8. WHITEBOARD brid deep and shallow architectures.

The infrastructure is portable. As the programming language-specific wrapper code is relatively small, the framework can be quickly ported to any programming language that has XSLT support (which holds for most modern programming and scripting languages). XSLT makes the transformation code portable and declara- tive which it could not be when being based on DOM manipulation in an ordinary programming language.

The WHAT framework can easily be extended to new NLP components and document DTDs. This has to be done only once for a component or DTD through XSLT query library definitions, and access will be available immediately in all programming languages for which a WHAT implementation exists.

WHAT can be used to perform computations and complex transformations on XML annotation, provide uniform XML annotation access in order to abstract from component-specific namings and DTD structure. WHAT makes it easier to ex- change results between components (e.g. to give non-XML-aware components ac- cess to information encoded in XML annotation), and to define application-specific architectures for online and offline processing of NLP XML annotation.

Due to its flexibility, the infrastructure is well suited for rapid prototyping of hybrid NLP architectures as well as for developing NLP applications, and can be used to both access NLP markup from programming languages and to compute or transform it.

Besides the integration within NLP architectures described in this section, the XSLT-based infrastructure (WHAT) could also be used for interfacing applications, e.g. to translate to Thistle (Calder, 2000) for visualization of linguistic analyses and back from Thistle in editor mode, e.g. for manual, graphical correction of automatically annotated texts for training etc.

Because of the unstable standardization and implementation status, we did not yet make use of XQuery, an XML query language discussed in Chapter 5. How- ever, the WHAT framework is open, and it might be worth considering XQuery as a future extension. Which engine to ask, an XSLT or an XQuery processor, could be encoded in each<query> element of the template library using an additional

attribute. Similarly, extension of the current framework to XSLT 2.0 which among other things supports user-definable functions that can be part of XPath expres- sions, should be straightforward.

Compared to ad hoc integrations of specific deep parsers with specific PoS

taggers, the XML and XSLT-based WHITEBOARD architecture approach offers

much more flexibility. This allowed to easily also integrate further levels of natural language processing other integrated systems do not provide, such as named entity recognition or topological parsing.

However, it has to be noted that although obviously there is huge potential in combining many more existing shallow and deep NLP components and in different

ways through a general architecture such as WHITEBOARD, only some concepts

could be tried within the project, and even less could also be poured into applica- tions utilizing the new approach.

8.9. SUMMARY 183 An interesting application of the architecture that has been formulated already in the project proposal but not tried in an implementation so far, is to use deep processing to support shallow processing on demand. Using this strategy, e.g. in information extraction or opinion mining, it could be possible to both preserve the high robustness of shallow processing and achieve high precision on crucial parts of a text that could have been identified by shallow methods.

Also mainly because of time and resource limitations, the architecture was not fully instantiated for languages other than German.

A further generalization of WHITEBOARD towards more robust, application-

oriented integration of deep and shallow NLP components that is even better suited for high coverage and high precision in restricted domains and Semantic Web- related applications will be presented in the next chapter.

Chapter 9

Heart of Gold

9.1

Introduction and Motivation

In the previous chapter, we have described an integration architecture for deep and

shallow natural language processing components called WHITEBOARD. Although

WHITEBOARD has been designed for flexible integration of components, more or less a single scenario for German (with and without topo-parsing) has been fully implemented. The architecture was successful in the sense that the benefits of integrating deep and shallow approaches to NLP could well and clearly be shown in a mature and stable implementation that was robust enough to parse German newspaper corpora and other unseen text online.

However, the focus of WHITEBOARD was to demonstrate the feasibility and

evaluate the benefits of the hybrid approach from the linguistic, mainly syntactic, perspective. Many aspects that would become important when deep-shallow inte- gration would be explored in real NLP-based applications, could not be addressed during the WHITEBOARD project, one main reason being the fact that the German

HPSG grammar at that time did not provide a sufficiently functional semantics construction.

The aspects missing in WHITEBOARD with respect to architecture that had to

be addressed further comprise (1) true multilinguality (also in parallel), (2) inte- gration support for components implemented in different programming languages other than Java and C/C++, (3) more flexible, configurable processing order of components, (4) fully networking-enabled architecture, (5) post-parsing and fall- back integration on a semantics representation level.

Documento similar