PREHISPÁNICO Y COLONIAL EN LOS ANDES CENTRALES Y CENTRO-SUR
V.2. TEJIDOS DE LOS ANDES CENTRALES
V.2.5. Horizonte Tardío (1450 1550 d.C.)
The author identified two limitations of the Wikipedia similarity dataset with regards to the size of the dataset and the size of the articles.
Firstly, the evaluation set only contains 100 document pairs per language pair. These document pairs were selected using a stratified sampling of the anchor word and text method (described in Chapter 6) in order to include 100 document pairs with a wide range of similarity into the evaluation corpus. As a result, this evaluation set may not represent the distribution of similarity in Wikipedia in general. Furthermore, this may
introduce a bias towards the proportion of document pairs with high overlap of links and words that are included in the evaluation set, although those document pairs may not occur very frequently in Wikipedia. On the other hand, the proportion of document pairs with low similarity was also shown to be smaller. Future work is required to improve the evaluation corpus to add more non-similar instances to provide a more balanced evaluation corpus.
Secondly, most of the document pairs contain only up to 1,000 words, which meant that larger Wikipedia articles were not represented in the evaluation set. This limitation was set to reduce assessors’ fatique. In the current form of the evaluation task, assessors were required to read the document contents prior to assigning a similarity and com- parability score, identifying matching contents and assessing the sentence similarity in the matching contents. These tasks became extremely difficult when assessing docu- ment pairs that were too long. Including these document pairs (i.e., documents with word length over 1,000 words) may instead introduce inaccuracies in the dataset due to a higher probability of assessors’ fatigue and human error. By limiting the size of doc- uments, assessors were able to focus more time on reading the document contents in order to reliably assess them.
Taking into account the limitations, however, this evaluation set still represents a valuable resource for measuring similarity methods. This evaluation set includes inter- language-linked articles with different similarity degrees to provide better resources for training and evaluation of different approaches. Moreover, the set also captures various issues that affect similarity of documents. These findings can be used for further under- standing the similarity in Wikipedia articles, improving automatic methods to measure similarity and performing an automatic evaluation of the methods.
5.7 Conclusion
This chapter has described the work in creating an evaluation corpus specifically for Wikipedia. In this section, the author answers the research questions presented earlier
in this chapter.
RQ1. What are the characteristics of similar interlanguage-linked articles in Wiki- pedia? The evaluation corpus has identified that Wikipedia articles with different scores exhibit different similarity characteristics. Similar document pairs were shown to con- tain similar structure, overlapping named entities, overlapping fragments, and translated
contents. A high proportion of translated contents characteristic was only found in doc-
uments with the highest similarity scores (Q1 scores of 5). However, the first three char- acteristics (i.e., similar structure, overlapping named entities and fragments) could still be found in document pairs with lower similarity scores (Q1 scores of 3 or above). A high proportion of non-similar document pairs (Q1 scores of 2 or below) was shown to contain
different information, although a small proportion of these documents may still contain
similarity at the sub-document level, e.g., same entities and similar sentences or phrases. RQ2. Can we create an evaluation benchmark for Wikipedia? I.e., do human as- sessors agree on Wikipedia similarity? The author has proposed an evaluation scheme to gather human judgments on 800 Wikipedia documents in 8 language pairs. These documents were selected using the anchor text and word overlap method and a strati- fied sampling in order to include documents containing a wide range of similarity. This corpus allows Wikipedia characteristics to be investigated in more detail, and for future similarity measures to be evaluated against the human judgments. Overall, a moder- ate agreement was achieved between the assessors across all the evaluation questions (mean weighted Cohen’s Kappa between 0.46 and 0.56). Since no specific guidelines were created to define the different scores in the evaluation questions, it was expected that assessors’ answers would differ slightly due to the assessors’ different point of views behind the different scores. This was proven by the increased weighted Cohen’s Kappa score when cases where assessors’ answers differed by one were considered as an agree- ment (mean weighted Cohen’s Kappa between 0.62 and 0.77), showing a good agreement between assessors for each evaluation question.
Future work should explore including more document pairs to increase the size of the corpus, and increase the size of the evaluation documents to make the similarity corpus
more representative of the state of Wikipedia. Extending this corpus is out of the scope of this work, but would be a promising step to move forward to strengthen the current evaluation benchmark.
Related publication
• Paramita, M., Clough, P., Aker, A. and Gaizauskas, R. 2012. Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In Proceedings
of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 790-797.
Anchor Text and Word Overlap Method
The findings in Chapters 4 and 5 have further confirmed previous work: that interlanguage- linked articles in Wikipedia exhibit varying degrees of similarity. Furthermore, different characteristics that contribute to the similarity between a document pair have also been identified. Based on this information and related literature, this thesis investigates four different methods (as shown in Figure 1.1 in page 11) to measure cross-lingual similarity in Wikipedia. This chapter reports the first of four experiments that have been carried out to develop and analyse methods to identify similarity in Wikipedia interlanguage- linked articles. In this experiment, the author developed a method to measure similarity between a document pair using the similarity of Wikipedia links between both articles. Similar information across languages are identified using the interlanguage links infor- mation in Wikipedia.
6.1 Background
Information derived from interlanguage links has previously been used to identify simi- lar information across different languages in Wikipedia. One approach is the link-based
bilingual lexicon approach (Adafre & de Rijke, 2006), previously described in Section 2.4.2.
This approach is language-independent and does not require any translation resources. Instead, it creates its own translation resources (further referred to as a bilingual lexi-
con) for a language pair by extracting titles of all Wikipedia interlanguage-linked articles
in that language pair. This bilingual lexicon is then utilised for identifying similar con- tent in different languages. Adafre and de Rijke (2006) showed that this approach was able to identify similar sentences in Dutch-English with high precision, although low re- call was observed. Its performance in other language pairs, especially under-resourced languages, has to date not been studied.
In this experiment, the author investigated the use of this approach in identifying cross-lingual similarity in 8 different language pairs. This method was selected because it relied only on information within Wikipedia and therefore could be applied to other language pairs (that were available in Wikipedia) without requiring any external linguis- tic resources. An adaptation of this method is proposed to identify similarity in Wikipedia at the document level.
The proposed method, referred to as the anchor text and word overlap method (or
anchor + wor d), differs to the link-based bilingual lexicon approach (Adafre & de Ri-
jke, 2006) in four ways. Firstly, prior to measuring similarity, Adafre and de Rijke (2006) represented each sentence using the links only, i.e., any words that are not linked to any Wikipedia article are discarded (see example in Table 2.5 in page 48). However, the author suggests that some of these non-linked words (such as numbers or named entities) may appear the same across languages and should be taken into account when measuring similarity. Therefore, the proposed anchor + wor d method represents each sentence using both the anchor texts (i.e., clickable texts or linked words in the articles) and the remaining words (i.e., non-linked words in the articles). This approach is proposed to increase the recall of the method.
Secondly, Adafre and de Rijke (2006) carried out a sentence alignment by allowing only a one-to-one correspondence between similar sentences. However, as identified in Section 4.4.3, similar contents appear in the document pair, but do not correspond to a one-to-one alignment at the sentence level. I.e., contents described in one sentence in one article may be represented in more than one sentences in the other article. To accommodate this, the anchor + wor d method allows a many-to-one correspondence
when aligning similar sentences.
Thirdly, Adafre and de Rijke (2006) also used information extracted from the redi- rection pages to build the bilingual lexicon. Further research, however, has shown that this significantly decreased the accuracy of the extracted bilingual terms (from 92.3% to 23.1% in German-English as reported in Erdmann et al. (2009)). Therefore, in this study, the author only extracted the titles from the interlanguage-linked articles when building the bilingual dictionary.
Finally, the link-based bilingual lexicon approach identifies similarity at the sentence level. The anchor + wor d method, on the other hand, identifies similar sentences and further aggregates the information to measure similarity at the document level.
This experiment aims to answer the third research question:
RQ3. Can language-independent approaches be used to identify cross-lingual similarity in Wikipedia?
(a) How does the method compare to approaches using linguistic resources, such as MT systems?
(b) How does the performance for the approach vary for different language pairs? (c) What language-independent features are best for measuring cross-lingual sim-
ilarity in Wikipedia?
First, the author describes the method in Section 6.2 and the experiments in Sec- tion 6.3. The method is evaluated against a baseline that utilises a MT system (in this case, Google Translate). The results are reported in Section 6.4. Finally, the author dis- cusses the results and concludes the experiment in Section 6.5 and Section 6.6.