3. Criopreservación del semen equino
3.1 Daños en el espermatozoide como resultado de la criopreservación
In this section, we review applications of MapReduce to the translation pipeline as well as techniques for storing and retrieving from SMT models.
3.2.1
Applications of MapReduce to SMT
Brants et al.(2007) introduce a new smoothing method called Stupid Backoff (see Section 2.7.4). The Stupid Backoff smoothing scheme is recalled in
Equation 3.1: pstupid backoff(wi | wii−n+1−1 ) = c(wi i−n+1) c(wi−n+1i−1 ) if c(w i i−n+1) > 0 α pstupid backoff(wi | wii−1−n+2) if c(wii−n+1) = 0 (3.1) With respect to the traditional backoff scheme, the Stupid Backoff scheme uses no discounting and simply uses relative frequency for the non backed-off score and the backed-off score scaling parameter is independent of the n- gram history. Therefore this scheme does not define a conditional probability distribution over a word given its history.
The Stupid Backoff language model building and application fit the MapRe- duce framework well. The input to language model building is a large mono- lingual corpus. The first step is to build a vocabulary. This is done with the canonical example of word counting with MapReduce. Counts are needed in order to remove words occurring less than a threshold from the vocabu- lary. The second step is to obtain n-grams and their count. This is done again with MapReduce, and the map and reduce functions are analogous to the ones defined for word counting but this time unigrams are replaced by n-grams. For n-gram counting, the partition, or sharding, function hashes on the first two words of each n-gram. In addition, unigram counts and the total size of the corpus is available in each partition, i.e. to each reducer. This allows relative frequencies to be computed. Brants et al. (2007) also demonstrate that for large amounts of monolingual data, i.e. above 10 bil- lion tokens, Stupid Backoff smoothing and Kneser-Ney smoothing perform comparably. In addition, only Stupid Backoff smoothing can be scaled to datasets with more than 31 billion tokens. The scalability of Kneser-Ney smoothing has been improved in recent work (Heafield et al., 2013).4
Dyer et al. (2008) observe that translation model estimation has become prohibitive on a single core and that existing ad hoc parallelisation algo- rithms may be more fragile than using an existing framework such as the Hadoop implementation of MapReduce.5 They provide solutions to word
alignment model estimation and translation rule extraction and estimation using MapReduce and demonstrate the scalability of their method.
The convenience of the MapReduce framework for parallelisation has led to the building of end-to-end toolkits for entire phrase-based (Gao and Vogel,
4see alsohttp://kheafield.com/code/kenlm/estimation/, Scalability section 5https://hadoop.apache.org/
2010) and hierarchical phrase-based models (Venugopal and Zollmann,2009) for translation using the MapReduce framework.
3.2.2
SMT Models Storage and Retrieval Solutions
We now review techniques appearing in the literature that have been used to store SMT models and to retrieve the information needed in translation from these models. SMT models are usually discrete probabilistic models and can therefore be represented as a set of key-value pairs. To obtain relevant information from a model stored in a certain data structure, a set of keys called a query set is formed; each key in this query set is then sought in that datastructure. Strategies include:
• storing the model as a simple data structure in memory • storing the model in a text file
• storing the model in more complicated data structures such as tries (Fred- kin, 1960) (in memory or disk)
• storing fractions of the entire model
• storing data as opposed to a precomputed model • storing models in a distributed fashion
Each of these strategies is discussed below.
In some cases, it may be possible to fit a model into RAM. In this case the model can be stored as a memory associative array, such as a hash ta- ble. In-memory storage allows allows for faster query retrieval than on-disk storage, however only smaller models will fit in memory. In-memory storage has been used to store model parameters between iterations of expectation- maximisation for word alignment (Dyer et al., 2008; Lin and Dyer, 2010).
For larger models, the set of key-value pairs can be stored as a table in a single text file on local disk. Values for keys in the query set can be retrieved by scanning through the entire file. For each key in the file, its membership is tested in the query set. This is the approach adopted in the Joshua 5.0 decoder (Post et al., 2013)6, which uses regular expressions or
6Inferred from the decoder training scripts available athttp://joshua-decoder.org/
n-grams to test membership (see Section 3.5.4). Venugopal and Zollmann
(2009) use MapReduce to scan a file concurrently: the map function tests if the vocabulary of a rule matches the vocabulary of a test set.
The model can also be stored using a trie associative array (Fredkin,
1960). A trie is a type of tree where each node represents a shared prefix of a set of keys represented by the child nodes. Each node only stores the prefix it represents. The keys are therefore compactly encoded in the structure of the trie itself. Querying the trie is aO(log(n)) operation, where n is the number of keys in the dataset. The trie may also be small enough to fit in physical memory to further reduce querying time. Zens and Ney (2007) use tries to store a phrase-based grammar. All the source phrases are represented in a trie stored on disk. Only relevant parts of the trie are loaded into memory when a source phrase is sought. Ganitkevitch et al. (2012) extend this approach to store hierarchical phrase-based grammars. Both source sides and target sides are stored in a packed representation of a trie. Packed tries have also been applied for storing language models (Pauls and Klein, 2011; Heafield,
2011).
It is also possible to create a much smaller approximate version of the model. Talbot and Osborne (2007a) represent a set of n-grams as a Bloom filter (Bloom,1970). They first use a standard Bloom filter to define a binary feature that indicates whether an n-gram was seen in a monolingual corpus. They also use a Bloom filter to encode n-grams together with quantised counts in order to define a multinomial feature, that is a feature with a finite set of possible values—in this case the quantised values. Both these data structures can substantially reduce the disk space and memory usage with respect to lossless representations of language models. However, they allow false positives for n-gram membership queries and overestimates of n-gram quantised counts. The authors demonstrate that the features they introduce are useful in translation, despite this lack of exactness. In a related pub- lication (Talbot and Osborne, 2007b), the authors demonstrate how these techniques can be used as a replacement to represent a smoothed language model. Talbot and Brants(2008) use a Bloomier filter (Chazelle et al.,2004) to represent n-grams and necessary statistics such as probabilities and back- off weights. Unlike Bloom filters that only support Boolean characteristic functions on a set, Bloomier filters support arbitrary functions, in this case a mapping between an n-gram and its statistics. False positives can still occur but for true positives, correct statistics are retrieved.
Guthrie and Hepple (2010) propose an extension to previous work on randomised language models (Talbot and Osborne, 2007b) which prevents the random corruption of model parameters but does not stop the random assignment of parameters to unseen n-grams. Levenberg and Osborne(2009) extend randomised language models to stream-based language models. An- other way of building a smaller approximate version of a model is to retain items with high frequency counts from a stream of data (Manku and Mot- wani, 2002). This technique has been applied to language modelling (Goyal et al., 2009) and translation rule extraction (Przywara and Bojar,2011).
Instead of doing some precomputation on a dataset, it is possible to com- pute the sufficient statistics at query time using a suffix array (Manber and Myers,1990), so that the model can be estimated only when needed. A suf- fix array is a sequence of pointers to each suffix in a training corpus. The sequence is sorted with respect to the lexicographic order of the referenced suffixes. Suffix arrays have been used for computing statistics for language models (Zhang and Vogel,2006), phrase-based systems (Callison-Burch et al.,
2005;Zhang and Vogel,2005), and hierarchical phrase-based systems (Lopez,
2007). Callison-Burch et al.(2005) store both the source side and the target side of a parallel corpus in two suffix arrays. They also maintain an index between a position in the source or target side of the corpus and the sentence number. During decoding, all occurrences of a source phrase are located in the suffix array representing the source side of the corpus. This produces a source marginal count. For each of these occurrences, the corresponding sentence pair and word alignment are retrieved and rules are extracted. This produces a rule count, which is normalised with the source marginal count to produce a source-to-target probability. Lopez (2007) extends this approach to address the grammar extraction and estimation of hierarchical rules. Note that the use of suffix arrays for translation model estimation only supports the computation of source-to-target probabilities.
Finally, some approaches store language models in a distributed fashion. We saw in Section 3.2.1 how a Stupid Backoff language model can be built. After relative frequencies have been computed, another partition function that hashes on the last two words of an n-gram is applied so that backoff operations can be done within the same partition. At decoding time, the decoder architecture is modified in order to request a batch of n-grams rather than a single n-gram. The Stupid Backoff language model building can be scaled to a corpus of 2 trillion tokens and the language model distributed application can be made real time.
Zhang et al. (2006) propose a distributed large language model backed by suffix arrays. HBase has also been used to build a distributed language infrastructure (Yu,2008). The method we propose to use is closely related to the latter but we use a more lightweight infrastructure than HBase. In addi- tion, it is also possible to apply our method to language model querying (Pino et al., 2012), which demonstrates the flexibility of the infrastructure.
Note that there are many alternative solutions to HBase/HFile for storage of a large number of key-value pairs, such as Berkeley DB7 or Cassandra8. The purpose of this chapter is not to compare these alternatives but rather to compare existing solutions employed in the machine translation community to one of these solutions.