Guía N° 1 “Aprendamos acerca de la materia y sus propiedades”

SMT formalizes the idea of producing a translation that is both faithful to the original source text and fluent in the target language. This goal is achieved in SMT by combining probabilistic models that maximize faithfulness (or accuracy) and fluency to select the most probable translation candidate, as in Equation (2.1):

best-translation ˆE = arg max

faithfulness(E,F) fluency(E) (2.1) To achieve this, SMT uses the noisy channel model. The intuition behind a noisy channel model is that the original source (F) sentence is a distortion of the target sentence (E) as it has been passed through a noisy communication model. The goal, is to model this ‘noise’ in such a way that we can pass the observed ‘distorted’ source sentence through our model and discover the hidden target language sentence ( ˆE).

More concretely, say we have a French source sentence for which we want to produce an English translation. The noisy channel model assumes the French sentence is simply a distortion of the English one. The task is to build a model that allows you to generate from an English ‘source’ sentence the French ‘target’ sentence by discovering the underlying noisy channel model that distorted the ‘original’ English sentence. Once this has been modeled, we take the French sentence, pretend it is the output of an English sentence that has been passed through our model and we generate the most likely English sentence (Jurafsky and Martin, 2014). An illustration of the noisy channel model can be found in Figure (2.1).

Figure 2.1: The noisy channel model of SMT (Jurafsky and Martin, 2014).

More formally, we want to translate a French sentence F into an English sentence E. To do so, we traverse the search space and find the English sentence ˆE that maximizes the probability P (E | F ), as in Equation (2.2):

E = arg max

P (E | F ) (2.2)

Rewriting Equation (2.2) with Bayes’ rule results in Equation (2.3). The result- ing noisy channel equation consists of two components: a translation model P (F |E)

and a language model P (E) (Brown et al., 1990).

E = arg max

E∈English

P (F | E)P (E) (2.3) Aside from the language model taking care of the fluency of the output, the translation model makes sure the translation is adequate with respect to the source. A decoder is needed in order to compute the most likely English sentence ˆE given the French sentence F .

Initially, WB-SMT used words (Brown et al., 1990) as fundamental units in order to compute the equations described, but it soon became clear that working with phrases (Zens et al., 2002; Koehn et al., 2003) as well as single words could lead to considerably better translations. One of the major issues with WB-SMT models is the fact that such models do not allow multiple words to be mapped or moved as one unit. In reality, we know that so-called one-to-many and many-to-one mappings are in no way exceptional when dealing with translations (see Figure 2.2). Note that, in PB-SMT, the term phrases is not to be confused with what is called a phrase in linguistics. A phrase in linguistics refers to a group of words that form a unit within the grammatical hierarchy, while the term phrase in PB-SMT refers to consecutive words in a sentence (commonly referred to as n-grams).

the day before yesterday avant-hier

Figure 2.2: One-to-many relation between the French word ‘avant-hier’ and its English translation that consists of multiple words ‘the day before yesterday’.

Using phrases instead of words did not change the fundamental components of the SMT pipeline (language model, translation model and decoder). However, the decoding process became a more complex task consisting not only of words (or unigrams) as features but unigrams in combination with bigrams, trigrams, etc.1

As such, Och et al. (2001) propose a more general framework, the log-linear model, to replace the noisy-channel model (described in Equation (2.2)) that allows for the integration of an arbitrary number of features. The most likely translation can now be found by computing Equation (2.4).2 _{As in the previous equations, F represents}

the French source sentence, E the English target sentence and ˆE the most likely English translation. Additionally, hi(F, E) defines the feature functions, M the

number of feature functions and λi their weights.

E = arg max

λihi(F, E) (2.4)

As our work did not involve changing any of the underlying components of SMT systems, we have only touched upon the technicalities and computations involved in SMT. For a more complete and technical overview of all the components involved in language modeling, translation modeling and decoding, we refer the reader to: “Statistical Machine Translation” by Koehn (2010).

By 2000, PB-SMT had become the state-of-the-art in MT (Zens et al., 2002; Koehn et al., 2003). Although the PB-SMT approach provides a better way of dealing with the many-to-one and one-to-many mappings that occur in translations, it still has multiple drawbacks. Reordering within phrases, discontinuous phrases, the ability to learn across phrases (i.e. long-distance dependencies) or across sentences are just a few of them. Over the years, researchers worked on integrating additional knowledge and features into the existing framework. The integration of specific linguistic information in SMT will be further discussed in Section 2.2.

In document Diseño de guías para enseñanza - aprendizaje del concepto de la materia y su estructura, apoyadas en el software JClic para alumnos de grado quinto de básica primaria (página 57-69)