CUENTAS ECONÓMICAS Y AMBIENTALES INTE- INTE-GRADAS DEL AGUA

Implementación de nuevas recomendaciones y estándares estadísticos internacionales

CUENTAS ECONÓMICAS Y AMBIENTALES INTE- INTE-GRADAS DEL AGUA

city university of hong kong, hong kong, china

Introduction

Machine translation (MT) is the mechanization and automation of the process of translating from one natural language into another. Translation is a task which needs to tackle the ‘semantic barriers’ between languages using real world encyclopedic knowledge, and requires a full understanding of natural language. Accordingly different approaches have been proposed for addressing the challenges involved in automating this task. At present the major approaches include rule-based machine translation (RBMT) which heavily relies on linguistic analysis and representation at various linguistic levels, and example-based machine translation (EBMT) and statistical machine translation (SMT), both of which follow a more general corpus-based approach and make use of parallel corpora as a primary resource.

This chapter presents an overview of the EBMT technology. In brief, EBMT involves extracting knowledge from existing translations (examples) in order to facilitate translation of new utterances. A comprehensive review of EBMT can be found in Somers (2003: 3−57) and the latest developments in Way (2010: 177−208).

After reviewing the history of EBMT and the controversies it has generated over the past decades, we will examine the major issues related to examples, including example acquisition, granularity, size, representation and management. The fundamental stages of translation for an EBMT system will be discussed with attention to the various methodologies and techniques belonging to each stage. Finally the suitability of EBMT will be discussed, showing the types of translation that are deemed suitable for EBMT, and how EBMT interoperates with other MT approaches.

Origin

The idea of using existing translation data as the main resource for MT is most notably attributed to Nagao (1984: 173−180). Around the same time, there were other attempts at similarly exploiting parallel data as an aid of human translation. Kay (1976, 1997: 3−23) for example introduced the concept of translation memory (TM) which has become an important feature in many computer-aided translation (CAT) systems. TM can be understood as a

‘restricted form of EBMT’ (Kit et al. 2002: 57−78) in the sense that both involve storing and retrieving previous translation examples; nevertheless in EBMT the translation output is produced by the system while in TM this is left to human effort. Arthern (1978: 77−108) on the other hand proposes ‘a programme which would enable the word processor to “remember” whether any part of a new text typed into it had already been translated, and to fetch this part together with the translation’. Similarly, Melby (1995) and Warner mention the ALPS system, one of the earliest commercial MT systems which dates back to the 1970s, and incorporated what they called a ‘Repetition Processing’ tool.

Conceptually, Nagao’s EBMT attempts to mimic human cognitive behavior in translating as well as language learning:

Man does not translate a simple sentence by doing deep linguistic analysis, rather, man does the translation, first, by properly decomposing an input sentence into certain fragmental phrases … then by translating these phrases into other language phrases, and finally by properly composing these fragmental translations into one long sentence. The translation of each fragmental phrase will be done by the analogy translation principle with proper examples as its reference.

(Nagao 1984: 175) Nagao (1992) further notes:

Language learners do not learn much about a grammar of a language… They just learn what is given, that is, a lot of example sentences, and use them in their own sentence compositions.

(ibid.: 82) Accordingly, there are three main components of EBMT: (1) matching source fragments against the examples, (2) identifying the corresponding translation fragments, and then (3) recombining them to give the target output.

A major advantage of EBMT over RBMT is its ability to handle extra-grammatical sentences, which though linguistically correct cannot be accounted for in the grammar of the system. EBMT also avoids the intractable complexity of rule management which can make it difficult to trace the cause of failure, or to predict the domino effect of the addition or deletion of a rule. EBMT addresses such inadequacies by incorporating the ‘learning’ concept for handling the translation of expressions without structural correspondence in another language (Nagao 2007: 153−158), and also by extending the example base simply by adding examples to cover various kinds of language use.

Definition

EBMT offers a high flexibility in the use of examples and implementation of each of the three components (matching, alignment and recombination), leading to systems with, for instance, rule-based matching or statistical example recombination. The underlying principle for EBMT, according to Kit at al. (2002: 57−78), is to ‘remember everything translated in the past and use everything available to facilitate the translation of the next utterance’ where ‘the knowledge seems to have no overt formal representation or any encoding scheme. Instead … in a way as straightforwardly as text couplings: a piece of text in one language matches a piece of text in another language.’

EBMT implies the application of examples – as the main source of system knowledge – at run-time, as opposed to a pre-trained model where bilingual data are only used for training in advance but not consulted during translation. Examples can be pre-processed and represented in the forms of string (sentence or phrase), template, tree structure or/and other annotated representations appropriate for the matching and alignment processes.

Examples

Acquisition

As it relates to the source of system knowledge, example acquisition is critical to the success of EBMT. Examples are typically acquired from translation documents, including parallel corpora and multilingual webpages, as well as from TM databases. Multilingual texts from sources such as the European and Hong Kong parliaments constitute high-quality data. The Europarl corpus (Koehn 2005: 79−86) for example covers twenty language pairings. The BLIS (The Bilingual Laws Information System of Hong Kong) corpus (Kit et al. 2003: 286−292, 2004: 29−51, 2005: 71−78) provides comprehensive documentation of the laws of Hong Kong in Chinese− English bilingual versions aligned at the clause level, with 10 million English words and 18 million Chinese characters. Legal texts like this kind are known to be more precise and less ambiguous than most other types of text. In the past decade the growing number of web-based documents represents another major source of parallel texts (Resnik 1998: 72−82; Ma and Liberman 1999: 13−17; Kit and Ng 2007: 526−529).

Possible sources of examples include not only such highly parallel bitexts, which though increasingly available still remain limited in volume, language and register coverage, especially for certain language pairs. Efforts have also been made to collect comparable non-parallel texts such as multilingual news feeds from news agencies. They are not exactly parallel but convey overlapping information in different languages; hence some sentences/paragraphs/texts can be regarded as meaning equivalent. Shimohata et al. (2003: 73−80) describe such as ‘shar[ing] the main meaning with the input sentence despite lacking some unimportant information. It does not contain information additional to that in the input sentence.’ In order to facilitate the development of an ‘Example-based Rough Translation’ system, a method is proposed to retrieve such meaning-equivalent sentences from non-parallel corpora using lexical and grammatical features such as content words, modality and tense. Munteanu and Marcu (2005: 477−504), on the other hand, proposes to accomplish the same purpose by means of machine learning strategy.

Apart from gathering available resources, new bitexts can also be ‘created’ by using MT system to translate monolingual texts into target languages. Gough et al. (2002: 74−83) reports on experiments in which they first decomposed sentences into phrases and then translated them with MT systems. The resulting parallel phrases could then be used as examples for an EBMT system. The output quality is proved better than that from translating the whole input sentence via online MT systems.

In document InstitutoNacionaldeEstadísticayGeografía. SistemadeCuentasNacionalesdeMéxico CuentaseconómicasyecológicasdeMéxico Añobase2003 (página 45-49)