FORMULA PROYECTOS DESARROLLADOS - GOBIERNO DEL ESTADO DE YUCATÁN

The majority of previous work on idiom translation mainly augments machine translation models with features indicating whether there is an idiom in the source sentence [Fadaee et al., 2018, Salton et al., 2014]. In this case study, we investigate whether the usage information of idiom (extracted by our usage recognition model) can benefit machine translation on idiom translation.

6.2.1 Integrating Usage Information into Machine Translation Model

To conduct the study, an important challenge is to build a dedicated parallel corpus of reasonable size for learning and evaluating idiom translation. We find the English-German idiom corpus from [Fadaee et al., 2018] to satisfy our need. This corpus is built from the data used in the WMT German-English Shared Task from 2008 to 2016 [Bojar et al., 2017]. Specifically, we perform the English-to-German translation task and each English sentence in the testing data contains at least one idiom in the dict.cc online dictionary. The statistics of the dataset are listed in Table 16 .

Table 16: Statistics of English-to-German translation dataset.

Number of unique idioms 132

Training size 4.5M

Idiomatic sentences in training data 1998

Test size 1500

Another challenge of this study is to integrate our usage recognition model into modern machine translation models. The full pipeline has to address many problems. First, it needs to locate the potential idioms in the sentence. Second, it has to recognize the usages of the potential idioms. Finally, we need to find a way to encode the usage information into machine translation models. As we have addressed the second problem in the previous chapters (we use the generalized model in this study), we need to address the first and the third problem in this study.

For each sentence in the English-to-German translation dataset, the idiom information (e.g., whether there is an idiom and the standard form of the idiom) is provided; we only need to find the position of the given idiom. We employ lexico-syntactic patterns to recognize their occurrences. Specifically, we first use exact string matching to locate them in text. It cannot find all the idioms since many idioms can also undergo certain syntactic changes such as inflection. To resolve this problem, we further use regular expressions to recognize their occurrence. To encode the usage information into machine translation models, a straightforward method is to append a special extra token < f ig > to each source sentence containing a figurative usage of idiom. This simple ap- proach tends to be effective in machine translation systems which employ sequence-to-sequence architectures [Fadaee et al., 2018]. As this method ignores the position of the idiom, we also ex- periment with another method in which we insert a token < start f ig > before the idiom and a token < end f ig > after the idiom. We compare these two methods with the conventional setting in which no extra information regarding the usage of idiom is provided.

vocabulary is limited to the top 20K most frequent words in both languages. The hyperparameters are summarized in the following tables:

Table 17: Hyperparameters of our machine translation model.

Parameter Value

Encoder layer 4

Encoder LSTM hidden state size 1000

Dropout 0.1

Epoch 20

Batch size 100

We use BLEU to measure the quality of translations. From the result presented in Table 18, we can see that the baseline achieves a BLEU score of 17.2, which is lower than the performance of previously reported models on the standard test set (WMT 2008-2016) [Sennrich et al., 2016]. This suggests that it is much harder to translate sentences containing idioms. Further, simply appending the < f ig > token to indicate the usage of idiom gets a BLEU score of 16.6, which is slightly lower than the baseline model; using the < start f ig > and < end f ig > tokens outperforms the baseline by 2.3 BLEU. This suggests that the usage information and the position information of the idiom can help boost the performance of neural machine translation models on idioms.

Table 18: The performance on English-to-German idiom translation test set.

Model BLEU

NMT Baseline 17.2

with < f ig > token 16.6 with < start f ig > < end f ig >token 19.5

6.2.2 Limitations

As we have mentioned above, the idiom information is provided for each sentence in our study. In real application, however, we need to know whether there is an idiom in a sentence in the first place. One straightforward way is to rely on external idiom resources. For example, we can first build an up-to-date idiom dictionary of broad coverage and high quality (online dictionaries such as thefreedictionary.com and dict.cc are reasonable choices) and then use lexico-syntactic patterns to recognize whether an idiom in the dictionary occurs in the sentence. When the external idiom resources are not available, we can alternatively resort to idiom type classification methods to find potential idioms in a sentence [Fazly and Stevenson, 2006, Venkatapathy and Joshi, 2005, Katz and Giesbrecht, 2006].

Another concern is related to the figurative meanings of idioms. We only integrate the usage and position information of an idiom into machine translation models. Thus, we expect the models can learn the figurative interpretation of idioms from the training data. This is problematic for idioms with low semantic analyzability, especially when they do not have enough figurative in- stances for training. One solution to address this problem is to replace idioms with their figurative meanings in literal English. We have discussed this solution in [Liu and Hwa, 2016] and we will leave this as future work.

In document GOBIERNO DEL ESTADO DE YUCATÁN (página 190-195)