5.1. Generalidades
5.1.1. Descripción de los componentes
After all anchor texts in the source documents have been translated using the bilingual lexicon, similarity is calculated in two stages, at the sentence level, and at the document level.
...
Veˇcinajih je v bližini [[družina Vesta|asteroidne družine Vesta]].
Imajo podobne [[izsred-
nost|izsrednosti]], toda njihova [[elipsa|velika polos]] leži v obmoˇcju od 2,18 [[astronomska enota|a. .e.]] do 2,50 a. e. ( kjer je [[Kirkwoodova vrzel|Kirkwoodova vrzel]] 3 : 1).
...
(a) Before anchor text translation
...
Veˇcinajih je v bližini [[vesta family]].
Imajo podobne [[eccentricity]], toda njihova [[ellipsis]] leži v obmoˇcju od 2,18 [[astronomical unit]] do 2,50 a. e. ( kjer je [[Kirkwood gap]] 3 : 1).
...
(b) After anchor text translation
Fig. 6.4 Slovenian text of “V-type asteroid”
Similarity at the sentence level
Given a document pair d1 and d2 written in language L1 and L2, respectively, where
sent enceC ount (d1)6sent enceC ount (d2)2, all sentences (s1, s2, ..., sm) in d1are paired to all sentences (t1, t2, ..., tn) in d2. Similarity between a sentence pair si and tj is calcu-
lated using the Jaccard coefficient:3
sent Al i g nScor e(si, tj) =
w or d ssi∩ wor d stj
w or d ssi∪ wor d stj
(6.1)
where w or d ssi and w or d stj represents a set of unique words in sentence si and tj, re-
spectively.
After all sentence pairs have been scored, similar sentences are identified by aligning each sentence si in document d1with the highest scoring sentence in document d2:
sent Al i g nScor e(si) = max
1≤j ≤nsent Al i g nScor e(si, tj) (6.2) I.e., for a sentence si, the highest scoring sentence tj is selected as its alignment. The
process then continues to align the next sentence si +1in d1. This process is carried out recursively until all sentences in d1 have been aligned. As mentioned in the previous
2I.e., d
1has the same or fewer number of sentences than d2.
...
Veˇcinajih je v bližini [[vesta family]].
Imajo podobne [[eccentricity]], toda njihova [[ellipsis]] leži v obmoˇcju od 2,18 [[astronomi-
cal unit]] do 2,50 a. e. ( kjer je
[[Kirkwood gap]] 3 : 1).
...
(a) SL article
...
A large proportionhave orbital el-
ements similar to those of 4 Vesta, either close enough to be part
of the [[vesta family]], or having
similar [[eccentricity (orbit)]] and
[[inclination]]s but with a [[semi- major axis]] lying between about
2.18[[astronomical unit]] and the 3:1 [[kirkwood gap]] at 2.50 AU. ...
(b) EN article
Fig. 6.5 Example of SL-EN sentences paired by the anchor + wor d method
section, many-to-one correspondences between sentences are allowed; an example of this is shown in Figure 6.5.
The author implemented a minimum similarity threshold to filter out irrelevant sen- tence pairs. If the score of the sentence pair is below the minimum threshold, the pairing information between both sentences is discarded. In this experiment, the author used a minimum threshold of 0.1, which was empirically determined by manually evaluating the similarity of sentence pairs in the evaluation corpus scored by this method.
A maximum threshold is also used in this method to reduce the noise caused by doc- ument pairs containing the same contents. As discussed in Chapter 5, although these duplicate information may be perceived to be similar, they do not represent high cross- lingual similarity, nor contain valuable cross-lingual resources because the contents were the same in both languages. In this study, the author used a maximum threshold of 1.0, i.e., exact sentences are discarded in this study. The remaining sentence pairs are then used to measure the similarity of the document pair at the document level, described in the next section.
Similarity at the document level
In the second stage, the scores of the remaining aligned sentence pairs are aggregated to represent the similarity of the document pair at the document level (d ocSi mi l ar i t yScor e). This section describes how this method was created.
Firstly, a document pair (d1, d2) containing sentence pairs with higher aligment scores are considered to be more relevant than a document pair (d3, d4) containing the same number of sentence pairs with lower scores. Therefore, similarity can first be identified by aggregating the alignment scores of the aligned sentences (sent Al i g nScor e). This is referred to as the t ot al Sent Al i g nScor es.
Secondly, a method is required to normalise the t ot al Sent Al i g nScor es as Wikipedia article lengths may vary significantly. Normalisation is often performed by taking ac- count the lengths of both documents. However, as shown in Section 5.3, Wikipedia arti- cle may differ in length. Furthermore, the shorter article may still contain content that strongly corresponds (i.e., content that is either highly similar or in a translation rela- tion) to a sub-content of the larger article. A normalisation using the length of the larger document or a combination of the two will punish these articles. Therefore, in this ex- periment, the sentence alignment scores is normalised using the length of the shorter article instead.
The algorithm to measure the document similarity in this experiment is shown in the following:
d ocSi mi l ar i t yScor e = t ot al Sent Al i g nScor es
n =
Pn
i =1sent Al i g nScor ei
n (6.3)
where sent Al i g nScor ei represents the sentence alignment score for a sentence siin the
shorter document (or 0 if the sentence is unpaired or has its alignment filtered out), and
6.3 Experiments
6.3.1 Language selection
In this experiment, the anchor + wor d method was used to measure similarity on eight language pairs: German (DE), Greek (EL), Estonian (ET), Croatian (HR), Lithuanian (LT), Latvian (LV), Romanian (RO) and Slovenian (SL); all were paired to English (EN). In the remainder of this chapter, the non-English languages are referred to as the ‘source’ lan- guages, and English is referred to as the ‘target’ language.
6.3.2 Corpus
This experiment utilised the Wikipedia corpus gathered in November 2009-March 2010. More information about this corpus is described in Section 3.6. Table 6.2 shows the num- ber of interlanguage-linked articles for the eight language pairs used in this study.4
Although extracted in the similar time period, the numbers of interlanguage-linked articles available in the corpus were extremely different between each language pair. The smallest language pair, LV-EN, has just above 21,000 pairs of interlanguage-linked arti- cles, whilst the largest language pair, DE-EN, contains almost 30 times more document pairs, with 637,382 pairs of interlanguage-linked articles.
4The number of interlanguage-linked article pairs in each language pair was previously shown in Ta-
ble 3.2 in page 82.
Table 6.2 Comparison of sizes across language pairs
Language pair Total interlanguage- Proportion of Bilingual lexicon linked articles same titles size
DE-EN 637,382 72% 181,408 EL-EN 36,752 23% 28,294 ET-EN 42,008 46% 22,645 HR-EN 51,432 48% 26,804 LT-EN 57,954 28% 41,497 LV-EN 21,302 27% 15,511 RO-EN 97,815 63% 35,774 SL-EN 51,332 51% 25,101
The author also reports the proportion of interlanguage-linked articles that contain the same (duplicate) titles. These duplicate titles were removed prior to creating the bilingual dictionaries. The size of the resulting bilingual dictionaries are also shown in Table 6.2.
6.3.3 Evaluation
The anchor + wor d method was evaluated using two approaches. The first approach compared the anchor + wor d method to a similar approach utilising a machine trans- lation system. The second approach evaluated the performance of the anchor + wor d method against a gold standard corpus. These approaches are described below.
The anchor +wor d method relies only on a bilingual lexicon extracted from Wikipedia to identify similarity across languages. In the first evaluation, the author analysed how well this approach performed if better translation resources were used. To investigate this, the author developed a similar method that utilised Statistical Machine Transla- tion (SMT) to perform the translation (instead of using Wikipedia as a translation re- source). This method is referred to as the t r ansl at i on method. In this method, Google Translate5was used to translate the 800 non-English documents in the evaluation corpus (Chapter 5) into English. After all non-English documents were translated, the similar- ity score of each document pair was calculated using the similarity identification method described in Section 6.2.4. Similar to the anchor +wor d method, similar sentences were aligned, and their scores were aggregated to represent the similarity scores of the docu- ment pair. The correlation between the two approaches were evaluated using Spearman’s
ρ (shown in Section 6.4.1).
In the second evaluation, the author evaluated how well the anchor + wor d method performs against the gold-standard (i.e., the evaluation corpus). As previously described in Chapter 5, each of the 800 document pairs was assessed by two assessors. Q1 scores in the corpus contain the assessors’ answers for the following question: “How similar are the two documents?”, specified in a 5-point Likert Scale. For this analysis, the Q1 scores given
by both assessors were averaged and used to represent the human-annotated score. Spear- man’sρ was then calculated between these human-annotated scores (average Q1 scores) to scores given by the anchor + wor d method. As a comparison, the correlation be- tween the gold-standard and the t r ansl at i on method is also reported in this evaluation (shown in Section 6.4.2).