AL USO DEL TEJIDO
IV.1. IMPLEMENTOS TEXTILES
To allow similarity characteristics to be investigated in detail, it was important to gather these information for document pairs with varying degrees of similarity. One approach to do this is to measure the similarity scores between all interlanguage-linked pairs in Wikipedia and to carry out a stratified sampling to purposively include document pairs for different range of similarity scores. Since most of the language pairs were under- resourced, it was not possible to use methods that rely on translation resources as they were mostly unavailable for the 7 under-resourced language pairs. Automatic approaches, such as Google Translate, was available for these language pairs during the study. How- ever, it had a strict limitation on the amount of free translation per day1and therefore was infeasible to translate the entire Wikipedia.
Therefore, the author considered the use of language-independent approaches in selecting the evaluation documents. As described in the related work (Chapter 2), the
link-based bilingual lexicon method (Adafre & de Rijke, 2006) was shown to perform with
high accuracy in identifying translated sentences in Wikipedia documents, although it achieved a very low recall. Since this was the only method that had been evaluated and shown to work on Wikipedia documents, this method was applied in selecting the evalu-
1During this work (carried out in 2012), Google Translate had a limit of 2M characters to be trans-
lated each day (http://developers.google.com/translate/v2/pricing/). By the end of this study (February 2019), Google Translate was a paid service and did not provide any free translation service.
ation documents for the corpus.
Some adaptations were made into the approach, further referred to as the anchor
text and word overlap method (anchor + wor d method); a detailed description of this
method is described in Chapter 6. Firstly, the link-based bilingual lexicon method was adapted to consider both links and word overlap in order to increase the recall score. Although this approach was likely to affect the accuracy (precision) of the method in finding translated sentences across documents, this adaptation was intended to allow the method to perform better in identifying similar (yet non translated) sentences. Fur- thermore, since the document selection process required similarity to be measured at the document level, the link-based bilingual lexicon method (originally created to iden- tify translated sentences) was adapted to aggregate the sentence similarity scores to rep- resent similarity at the document level. The anchor + wor d method then was used to measure similarity across all Wikipedia articles, prior to carrying out a stratified sampling to select 100 document pairs for each language pair with varying similarity scores.
The use of this method in selecting the evaluation documents might introduce a po- tential bias. Firstly, the distribution of the overlap of links or word overlap in the se- lected documents might differ considerably to the distribution of these features in gen- eral Wikipedia articles. The purpose of this evaluation corpus, however, was not to cre- ate a corpus that represent the nature of Wikipedia. Instead, its purpose was to include document pairs with a wide range of similarity that allowed different approaches to be evaluated against human judgments, and to investigate the similarity characteristics be- tween Wikipedia documents with different similarity scores. This purpose was further shown to be achieved using this evaluation corpus.
Another bias that might have been introduced with this approach is that the anchor +
w or d method was the only method used to pool the evaluation documents, and there-
fore, this approach might have advantages in the evaluation corpus. The use of more methods in the pooling of evaluation documents would have been preferred. However, at the time of carrying out the document selection task, there was no other language- independent method that have been investigated and shown to work in Wikipedia ar-
ticles that could have been applied along side the anchor + wor d method. A random sampling of Wikipedia documents was considered; however, due to the large number of non-similar document pairs in Wikipedia (Patry & Langlais, 2011; Tomás et al., 2008), this approach was likely to select a large number of document pairs with low similarity which would not have been useful for use as the evaluation corpus. Furthermore, a maxi- mum of 100 document pairs per language pair was able to be evaluated due to the limited number of annotators. Using multiple methods to gather a larger number of documents, therefore, could not be pursued in this study.
Given this possible bias, however, the use of this approach was later shown to be able to achieve the two purposes of this evaluation corpus. Firstly, it was able to include doc- ument pairs with varying degree of similarity that allowed similarity characteristics to be investigated in more detail. Furthermore, this corpus also allow different approaches to be evaluated against human judgments to identify more approaches that can be used to identify similarity in WIkipedia. These approaches should be investigated as a future work to improve the selection process to increase the size of the evaluation corpus.