CUENTA DE FLUJO DE MATERIALES

Implementación de nuevas recomendaciones y estándares estadísticos internacionales

A number of representation schemas are proposed for storing examples. The simplest representation is in the form of text string pairs aligned at various granularity levels without additional information. Giza++ (Och and Ney 2003: 19−51) is the most popular choice for implementing word level alignment. On the other hand Way and Gough (2003: 421–457) and Gough and Way (2004b: 95−104) discuss an approach based on bilingual phrasal pairs, i.e. ‘marker lexicon’. Their approach follows the Marker Hypothesis (Green 1979: 481−496), which assumes that every natural language has its own closed set of lexemes and morphemes for marking the boundary of syntactic structure. Alternatively, Kit et al.’s (2003: 286−292, 2004: 29−51) lexical-based clause alignment approach achieves a high alignment accuracy via reliance on basic lexical resources.

Examples may also be annotated with various kinds of information. Similar to conventional RBMT systems, early attempts at EBMT stored examples as syntactic tree structures following constituency grammar. This offers the advantage of clear boundary definition, ensuring that example fragments are well-formed constituents. Later works such as Al-Adhaileh and Kong (1999: 244−249) and Aramaki et al. (2001: 27−32, 2005: 219−226) employed dependency structures linking lexical heads and their dependents in a linguistic expression. Planas and Furuse (1999: 331−339) presents a multi-level lattice representation combining typographic, orthographic, lexical, syntactic and other information. Forcada (2002) represents sub-sentential bitexts as a finite-state transducer. In their Data-Oriented Translation model, Way (2001: 66−80, 2003: 443−472) and Hearne and Way (2003: 165−172) use linked phrase-structure trees augmented with semantic information. In Microsoft’s MT system reported in Richardson et al. (2001: 293−298) and Brockett et al. (2002: 1−7), a graph structure ‘Logical Form’ is used for describing labeled dependencies among content words, with information about word order and local morphosyntactic variation neutralized. Liu et al.’s (2005: 25−32) ‘Tree String Correspondence’ structure has only a parse tree in the source language, together with the target string and the correspondences between the leaf nodes of the source tree and the target substrings.

A unique approach to EBMT which does without a parallel corpus is reported in Markantonatou et al. (2005: 91−98) and Vandeghinste et al. (2005: 135−142). Their example base consists only of a bilingual dictionary and monolingual corpora in the target language. In

the translation process, a source text is first translated word-for-word into the target language using the dictionary. The monolingual corpora are then used to help determine a suitable translation in case of multiple possibilities, and to guide a correctly ordered recombination of target words. This approach is claimed to be suitable for language pairs without a sufficiently large parallel corpus available.

Webster et al. (2002: 79–91) links EBMT with Semantic Web technology, and demonstrates how a flat example base can be developed into a machine-understandable knowledge base. Examples of statutory laws of Hong Kong in Chinese−English parallel version are enriched with metadata describing their hierarchical structures and inter-relationships in Resource Description Framework (RDF) format, thus significantly improving example management and sub-sentential alignment.

In some systems, similar examples are combined and generalized as templates in order to reduce the size of the example base and improve example retrieval performance. Equivalence classes such as ‘person’s name’, ‘date’, ‘city’s name’ and linguistic information like gender and number that appear in examples with the same structure are replaced with variables. For example, the expression ‘John Miller flew to Frankfurt on December 3rd’ can be represented as ‘<PERSON-M> flew to <CITY> on <DATE>’ which can easily be matched with another sentence ‘Dr Howard Johnson flew to Ithaca on 7 April 1997’ (Somers 2003: 3–57). To a certain extent such example templates can be viewed as ‘a special case of translation rules’ (Maruyama and Watanabe 1992: 173−184) in RBMT. In general the recall rate of example retrieval can be improved by this approach, but possibly with precision trade-off. Instances of studies of example templates include Malavazos et al. (2000), Brown (2000: 125−131) and McTait (2001: 22−34).

Examples need to be pre-processed before being put to use, and be properly managed. For instance, Zhang et al. (2001: 247−252) discuss the pre-processing tasks of English−Chinese bilingual corpora for EBMT, including Chinese word segmentation, English phrase bracketing, and term tokenization. They show that a pre-processed corpus improves the quality of language resources acquired from the corpus: the average length of Chinese and English terms was increased by around 60 percent and 10 percent respectively, and the coverage of bilingual dictionary by 30 percent.

When the size of example base is scaled up, there is the issue of example redundancy. Explained in Somers (2003: 3−57), overlapping examples (source side) may mutually reinforce each other or be in conflict, depending on the consistency of translations (target side). Whether such redundancy needs to be constrained depends on the application of examples: a prerequisite for systems relying on frequency for tasks such as similarity measurement in example matching, or a problem to be solved where this is not the case.

Stages

Matching

The first task of EBMT is to retrieve examples which closely match the source sentence. This process relies on a measure of text similarity, and is one of the most studied areas in EBMT. Text similarity measurement is a task common in various applications of natural language processing with many measures available. It is also closely related to how examples are represented and stored, and accordingly can be performed on string pairs or annotated structures. In order to better utilize available syntactic and semantic information, it may be further facilitated by language resources like thesauri and a part-of-speech tagger.

When examples are stored as string pairs at the sentence level, they may first need to be decomposed into fragments to improve example retrieval. In Gough et al. (2002: 74−83) and Gough and Way (2004b: 95−104), example sentences are split into phrasal lexicons with the aid of a closed set of specific words and morphemes to ‘mark’ the boundary of phrases. Kit et al. (2002: 57−78) uses a multi-gram model to select the best sentence decomposition with the highest occurring frequencies in an example base. Roh et al. (2003: 323−329) discusses two types of segmentation for sentences: ‘chunks’ that include proper nouns, time adverbs and lexically fixed expressions, and ‘partitions’ that are selected by syntactic clues such as punctuation, conjunctions, relatives and main verbs.

The similarity measure for example matching can be as simple as a character-based one. Two string segments are compared for the number of characters required for modification, whether in terms of addition, deletion or substitution, until the two are identical. This is known as edit-distance, which has been widely applied in other applications like spell-checking, translation memory and speech processing. It offers the advantages of simplicity and language independence, and avoids the need to pre-process the input sentence and examples. Nirenburg et al. (1993: 47−57) extends the basic character-based edit-distance measure to account for necessary keystrokes in editing operations (e.g. deletion = 3 strokes, substitution = 3 strokes). Somers (2003: 3−57) notes that in languages like Japanese certain characters are more discriminatory than others, thus the matching process may only focus on these key characters.

Nagao (1984: 173−180) employs word-based matching as the similarity measure. A thesaurus is used for identifying word similarity on the basis of meaning or usage. Matches are then permitted for synonyms and near-synonyms in the example sentences. An early method of this kind was reported on Sumita and Iida (1991: 185−192), where similarity between two words is measured by their distance in a hierarchically structured thesaurus. In Doi et al. (2005: 51−58) this method is integrated with an edit-distance measure. Highlighting an efficiency problem in example retrieval, they note that real-time processing for translation is hard to achieve, especially if an input sentence has to be matched against all examples individually using a large example base. Accordingly they propose the adoption of multiple strategies including search space division, word graphs and the A* search algorithm (Nilsson 1971) to improve retrieval efficiency. In Aramaki et al. (2003: 57−64), example similarity is measured based on different weights assigned to content and function words in an input string that are matched with an example, together with their shared meaning as defined in a dictionary.

The availability of annotated examples with linguistic information allows the implementation of similarity measures with multiple features. In the multi-engine Pangloss system (Nirenburg et al. 1994: 78−87), the matching process combines several variously weighted requirements including exact matches, number of word insertions or deletions, word-order differences, morphological variants and parts-of-speech. Chatterjee (2001) discusses the evaluation of sentence similarity at various linguistic levels, i.e. syntactic, semantic and pragmatic, all of which need to be considered in the case of dissimilar language pairs where source and target sentences with the same meaning may vary in their surface structures. A linear similarity evaluation model is then proposed which supports a combination of multiple individually weighted linguistic features.

For certain languages the word-based matching process requires pre-processing of both the input sentences and examples in advance. This may include tokenization and word segmentation for languages without clear word boundaries like Chinese and Japanese, and lemmatization for morphologically rich languages such as Arabic.

When examples are stored as structured objects, the process of example retrieval entails more complex tree-matching. Typically it may involve parsing an input sentence into the same

representation schema as examples, searching the annotated example base for best matched examples, and measuring similarity of structured representations. Liu et al. (2005: 25−32) presents a measure of syntactic tree similarity accounting for all the nodes and meaning of headwords in the trees. Aramaki et al. (2005: 219−226) proposes a tree matching model, whose parameters include the size of tree fragments, their translation probability, and context similarity of examples, which is defined as the similarity of the surrounding phrases of a translation example and an input phrase.

Recombination

After a set of translation examples are matched against an input sentence, the most difficult step in the EBMT process is to retrieve their counterpart fragments from the example base and then combine them into a proper target sentence. The problem is twofold, as described by Somers (2003: 3−57): (1) identifying which portion of an associated translation example corresponds to which portion of the source text, and (2) recombining these portions in an appropriate manner. The first is partially solved when the retrieved examples are already decomposed from sentences into finer fragments, either at the beginning when they are stored or at the matching stage. However, in case more than one example is retrieved, or multiple translations are available for a source fragment, there arises the question of how to decide which alternative is better.

Furthermore, the recombination of translation fragments is not an independent process, but closely related to the representation of examples. How examples are stored determines what information will be available for performing recombination. In addition, as the final stage of EBMT, the performance of recombination is to a large extent affected by the output quality from the previous stages. Errors occurring at the matching stage or earlier are a kind of noise which interferes with recombination. McTait (2001: 22−34) shows how tagging errors resulting from applying part-of-speech analysis to the matching of examples unexpectedly lower both the recall of example retrieval and accuracy of translation output. Further complications occur when examples retrieved do not fully cover the input sentence in question.

The most critical point in recombination is to adjust the fragment order to form a readable, at best grammatical, sentence in the target language. Since each language has its own syntax to govern how sentential structures are formed, it will not work if the translation fragments are simply sequenced in the same order as in the source sentence. However, this is the approach of some EBMT systems such as that reported in Way and Gough (2003: 421−457). In Doi and Sumita (2003: 104−110) it is claimed that such a simple approach is suitable for speech translation, since sentences in a dialog usually do not have complicated structures, and many long sentences can be split into mutually independent portions.

With reference to a text-structured example base, Kit et al. (2002: 57−78) suggests that it is preferable to use the probabilistic approach for recombination. Taking an empirical case-based knowledge engineering approach to MT, they give an example of a tri-gram language model, and point out some other considerations such as insertion of function words for better readability. Techniques in SMT have also been used in the hybrid EBMT-SMT models of Groves and Way (2005a: 301−323; 2005b: 183–190), which uses Pharaoh (Koehn 2004: 115−124), a decoder for selecting a translation fragment order in the highest probability; and the MaTrEx system (Du et al. 2009: 95−99) which uses another decoder called Moses (Koehn et al. 2007: 177−180).

For EBMT systems using examples in syntactic tree structures, where the correspondence between source and target fragments is labeled explicitly, recombination is then a task of tree

unification. For instance, in Sato (1995: 31−49), possible word-dependency structures of translation are first generated based on the retrieved examples; the best translation candidate is then selected by a number of criteria such as the size and source-target context of examples. In Watanabe (1995: 269−291) where examples are represented as graphs, the recombination involves a kind of graph unification, which they refer to as a ‘gluing process’. In Aramaki et al. (2005: 219−226), the translation examples stored in a dependency structure are first combined, with the source dependency relation preserved in the target structure, and then output with the aid of a probabilistic language model to determine the word-order.

Other systems without annotated example bases may be equipped with information about probable word alignment from dictionaries or other resources that can facilitate the recombination process (e.g. Kaji et al. 1992: 672−678; Matsumoto et al. 1993: 23−30). Some systems like Franz et al. (2000: 1031−1035), Richardson et al. (2001: 293−298), and Brockett et al. (2002: 1−7) rely on rule-based generation engines supplied with linguistic knowledge of target languages. Alternatively, in Nederhof (2001: 25−32) and Forcada (2002), recombination is carried out via a finite state transition network (FSTN), according to which translation generation becomes akin to giving a ‘guided tour’ from the source node to the target node in the FSTN.

A well-known problem in recombination, namely ‘boundary friction’ (Nirenburg et al. 1993: 47−57; Collins 1998), occurs when translation fragments from various examples need to be combined into a target sentence. Grammatical problems often occur because words with different syntactic functions cannot appear next to each other. This is especially true for certain highly inflected languages like German. One solution is to smooth the recombined translation, by adjusting the morphological features of certain words in the translation, or inserting some additional function words, based on a grammar or probabilistic model of the target language. Another proposal from Somers et al. (1994) is to attach each fragment with ‘hooks’ indicating the possible contexts of the fragment in a corpus, i.e. the words and parts-of-speech which can occur before and after. Fragments which can be connected together are shown in this way. Brown et al. (2003: 24−31) puts forth the idea of translation-fragment overlap. They find that examples with overlapping fragments are more likely to be combined into valid translations if there are sentences in an example base that also share these overlapping fragments. Based on their study of the occurrence frequencies of combined fragments from the Internet, Gough et al. (2002: 74−83) finds that valid word combinations usually have much higher occurrence frequencies than invalid ones.

Suitability

In document InstitutoNacionaldeEstadísticayGeografía. SistemadeCuentasNacionalesdeMéxico CuentaseconómicasyecológicasdeMéxico Añobase2003 (página 52-55)