6. EL ESTADO Y EL ACCESO AL AGUA POTABLE
6.3 DEFINICION Y PRINCIPIOS FUNDAMENTALES
Each MT approach discussed above has its own advantages and disadvantages. To avoid the limitations and to muster the strengths of all the aforementioned methods, hybrid approaches were proposed to combine the best features of all or a selection of methods.
Much research in MT includes some degree of hybridization based on e.g., incorporating linguistic knowledge in terms of preprocessing or on purely statistical model combination.
Data preprocessing plays a crucial role in NLP, especially with regard to parsing and MT. NEs and MWEs in particular pose difficulties in terms of identification and translation. In parsing, NE and MWE identification is like a chicken and egg problem in the sense that where should it be best fit in the pipeline – before parsing or after parsing; the question is whether MWE is a tokenization problem or a parsing problem. Handling MWEs in SMT deals with two challenging tasks: identification of MWEs and their incorporation into state-of-the-art SMT. Much research has been carried out on both MWE extraction and incorporation within SMT, as described below.
A log likelihood ratio based hierarchical reduction algorithm to automatically extract bilingual MWEs was reported in (Ren et al., 2009). Venkatapathy and Joshi (2006) reported a discriminative approach to use the compositionality information of verb-based MWEs in order to improve the word alignment quality. Carpuat and Diab (2010) replaced the binary feature by a count feature representing the number of MWEs in the source language phrase in SMT. Pal et al. (2013b) and Tan and Pal (2014) used various statistical techniques to extract MWEs from bilingual data and used these bilingual MWEs as additional training material to examine the usefulness of these bilingual MWEs in SMT. Pal et al. (2013b) observed the highest improvement with an additional feature that identifies whether or not a bilingual phrase contains bilingual MWE(s). A hybrid approach to identify MWEs from English–French parallel data was proposed by Bouamor et al. (2012a), who aligned only many-to-many correspondences and dealt with highly correlated MWEs. These MWE are then integrated into the MOSES SMT System (Bouamor et al., 2012b) in three ways: (a) adding the extracted bilingual MWEs as additional parallel training material, (b) integrating bilingual MWE candidates into the phrase table3, and (c) adding a new feature indicating whether a phrase in the phrase table is an MWE or not. One key difference between Bouamor et al. (2012b) and Pal et al. (2013b) is that, Pal et al. (2013b) considered MWEs as single tokens, which ensures that the phrase extraction module never gets a chance to mark a phrase boundary inside an MWE and MWEs are always treated as a whole. MWEs in SMT was also investigated by Lambert and Banchs (2005) for the Verbmobil corpus. The work related to MWE handling in SMT presented
3Bouamor et al. (2012b) use the Jaccard Index to define the two directions translation probabilities and set the lexical probabilities to 1.
in Chapter 4 in this thesis will apply multiword NE and MWE knowledge directly to the SMT word alignment and phrase extraction step. Additionally, and orthogonally, we also investigate how EBMT phrases can provide further improvement in SMT.
A major characteristic of state-of-the-art PB-SMT is that phrase pairs are extracted solely based on the knowledge contained in the word alignment table (plus some additional heuristics). The extracted phrases in PB-SMT do not respect linguistically motivated phrase boundaries and may be fragments of linguistically motivated phrases or contain words from neighboring linguistic phrases. Recent research in SMT has investigated how to incorporate syntactic knowledge into PB-SMT systems to improve translation quality. Syntax based SMT systems have provided promising improvements in recent years. Syn- tax based SMT can be divided into two categories: formal syntax-based systems where there is no need for using any additional parser with a linguistically motivated grammar (Chiang, 2005), and linguistically motivated syntax-based systems that use PCFG (Liu et al., 2006; Huang, 2006; Mi et al., 2008; Mi and Huang, 2008; Zhang et al., 2009), syn- tactic word dependency (Ding and Palmer, 2005; Quirk et al., 2005; Shen et al., 2008) or other parsers, e.g., Wu et al. (2011) trained on tree banks. Translation rules can be extracted from aligned string-to-string (Chiang, 2005), tree-to-tree (Ding and Palmer, 2005) or tree/forest-to-string (Galley et al., 2004; Mi et al., 2008; Wu et al., 2011) data structures and their corresponding word alignment tables. The approach described in Chiang (2005) for incorporating syntax4 into PB-SMT targets mainly phrase reordering. Under this approach, hierarchical phrase translation probabilities are used to handle a range of reordering phenomena. Marcu et al. (2006) present a similar extension of PB- SMT with syntactic structure on the target side. Zollmann and Venugopal (2006) extend the work introduced in Chiang (2005) by augmenting the hierarchical phrase labels with syntactic categories derived from parsing the target side of the parallel corpus. They asso- ciate a target parse tree with the corresponding search lattice provided by lexical phrases on the source sentence and assign a syntactic category to phrases which align directly with the parse hierarchy. Similar to Chiang (2005), a chart-based parser with a limited language model was used.
4This approach is formally syntax based and uses synchronous context free grammar, it is not nec- essarily linguistically syntax-based because it induces a grammar from a parallel text without relying on any linguistic annotations or assumptions.
Systems adopting the same (or different) MT framework usually produce different trans- lations for the same input, due to their differences in training data usage, different prepro- cessing methods, different alignment strategies and adopting various decoding processes, etc. It is therefore beneficial to design a combined framework of multiple systems that combines the output of these MT systems and produces better translations compared to any single system. MT system combination provides an approach to hybrid MT where output from different MT engines belonging to same or different MT paradigms are con- sidered in a bid to either select the best hypothesis from among the candidate hypotheses, or to build a new hypothesis altogether by combining parts of the candidate hypotheses. Many MT system combination approaches have been proposed over the years. These can be roughly grouped into three different categories: (i) hypothesis selection (Rosti et al., 2007a; Hildebrand and Vogel, 2010), (ii) re-decoding (He and Toutanova, 2009; Devlin and Matsoukas, 2012), and (iii) confusion network decoding (Matusov et al., 2006; Rosti et al., 2007b). Further gains can be obtained by the lattice decoding model (Feng et al., 2009; Du et al., 2010) and the paraphrasing model (Ma and McKeown, 2015). Our own hybrid architecture is based on a confusion network based system combination. Confusion Network decoding typically follows four steps:
1. Backbone selection: This method selects a backbone/skeleton from all the candi- date hypotheses. The backbone defines the word order of the final translation. The backbone selection strategies generally follow Minimum Bayes Risk (MBR) decoding (Rosti et al., 2007b; He et al., 2008). Translation edit rate (TER) or modified BLEU score are often used as the loss function in MBR. The quality of the combination output depends on which hypothesis is chosen as the selected backbone since the backbone determines the word order of the final fusion translation.
2. Hypothesis alignment: All words of each hypothesis are aligned against the backbone. To establish alignment between the hypothesis and the backbone, many approaches have been proposed: the edit distance alignment algorithm (Bangalore et al., 2002) which only allows monotonic alignment, a heuristic-based matching algorithm which allows non-monotonic alignments (Jayaraman and Lavie, 2005), GIZA++ (Matusov et al., 2006), TER alignment toolkit (Rosti et al., 2007a,b), the ITG-based method (Karakos et al., 2008), the IHMM-based word alignment method (He et al., 2008) in which the parameters are estimated indirectly from a variety
of sources, and the systematic comparisons method (Chen et al., 2009; Rosti et al., 2012).
3. Confusion network construction: A confusion network is prepared based on hy- pothesis alignments. Hypothesis alignment algorithms produce many-to-one map- pings between the hypothesis and backbone. The word alignments need to be nor- malized to one-to-one word alignments by simply removing duplicate links since the confusion network is built from one-to-one word alignments. The hypothesis words need to be reordered according to the backbone word order.
4. Confusion network decoding: This step deals with choosing the best translation path from the confusion network through a beam-search algorithm with a log-linear combination of a set of feature functions. The chosen path achieves the highest confidence in the network. The feature functions include: a language model, word penalty, weights on word arcs and n-gram posterior probabilities. The total weights of feature functions are optimized using MERT (Och, 2003).