CAPÍTULO II: PROPUESTA METODOLÓGICA: ANÁLISIS TÉCNICO TEXTIL
II.1. MATERIAS PRIMAS
II.1.3. Procesamiento y uso en tramas y urdimbres
Although a large number of studies have aimed to measure similarity between texts (pre- viously described in Section 2.1), there is no consensus on the definition of similarity.
Instead, various studies have defined similarity differently based on the tasks. Further- more, the different degrees of similarity have also been defined differently depending on various factors, such as the tasks (e.g., search tasks, semantic similarity task, etc.), the do- mains (e.g., Web articles, news articles, etc.), and the granularity of the applications (e.g., at the document level, sentence level, etc.). In this section, the author reviewed literature that have described the concept ‘similarity’ and how the different levels of similarity have been defined.
In information retrieval tasks, similarity is often measured between a query and a document (Manning et al., 2008). Similarity is therefore defined as the relevance between a document to the given query. Human annotations on relevance have often used binary judgments (e.g., “not relevant” and “relevant”) or graded relevance judgments, such as a 3-point relevance used in the Web Track TREC (Voorhees, 2001). Milios, Zhang, He, and Dong (2003) also used a 3-point relevance for identifying the relevance of a document to a query: 0 (“not related”), 0.5 (“somewhat related”) and 1 (“related”). In contrast, Paepcke, Garcia-Molina, Rodriguez-Mula, and Cho (2000) argued that the relevance of a document to a query is not the same to the similarity of the document to the given query. Instead, it should be more linked to the “information value” that was given by the document to the users.
The study of defining similarity has also been carried out extensively in the news do- mains. E.g., in the work in topic detection or tracking, news articles were annotated based on their similarity to a particular topic (Allan, Carbonell, Doddington, Yamron, & Yang, 1998). In developing the TDT3 (Topic Detection and Tracking) corpus for this work, annotators were asked to annotate the similarity between 9,000 news articles and 120 topics, specifying whether each article was “on-topic” or “off-topic” to a given topic. News stories that were annotated to be within the same topic were further assumed to be similar and relevant, whilst the rest were considered to be irrelevant and dissimilar.
Braschler and Schäuble (1998), on the other hand, argued that a more fine-grained scheme is required to identify the similarity between two news articles and proposed five similarity classes to describe the different alignments of retrieved multilingual news
documents. The first class, ‘Same story’, represents two documents covering exactly the same story or event (e.g., the presidential election results for the same candidate). Doc- uments covering different yet related events (e.g., election results for two different can- didates) are categorised into the second class, ‘Related story’. The third class, ’Shared
aspect’, represents two documents addressing multiple topics but sharing at least one of
them (e.g., one document about updates on US politics, and another about the upcom- ing presidential election). Unrelated documents which share a large number of terms (e.g., one document about the US presidential election and another document about the French presidential election) are categorised into the fourth class, ‘Common terminol-
ogy’. Lastly, the class ‘Unrelated’ represents two documents with no apparent relation
(e.g., one about the presidential election and one about vacation traffic in Germany). Similar to Braschler and Schäuble (1998), M. D. Lee et al. (2005) also used a 5-point Likert Scale (1=highly unrelated, 5=highly related) to gather human annotations on the similarity of news articles. However, no definition was provided to define the different levels. In the work of tracking similar news, Pouliquen et al. (2004) proposed a four-point scheme for defining similarity between news articles: “same news story”, “interlinked news story” (e.g., Madrid bombing vs Spanish decision to pull troops out of Iraq), “loosely connected story” (e.g., documentary on drinking vs alcohol policy), and “wrong link”.
Meanwhile, Barker and Gaizauskas (2012), whose work focused in identifying cross- lingual information between news articles, argued that news articles describing the same
event could differ widely in content if they had different focal events (i.e., focus of the
story). For example, articles describing a particular flood (the same news event) may have different focal points, such as the flood victims, the rescue efforts, or the disaster aid information. These differences will directly affect the amount of shared contents across the multiple news texts. To accommodate these issues, Barker and Gaizauskas (2012) created a two-level news relatedness scheme that categorised articles based on both the news events and the focal events. A comparison between this scheme and the previous two literature is summarised in Table 2.1.
Table 2.1 A comparison of similarity in the news domains Allan et al. (1998) Braschler and Schäuble
(1998)
Barker and Gaizauskas (2012)
Similar and relevant
represent documents that discuss the same topic.
Same story represents documents covering ex- actly the same story or event.
Same news events - same focal events represents documents
covering the same news event and the same focus of the story.
Same news events - different focal events represents docu-
ments covering the same news event but different focus of the story.
Dissimilar and irrel- evant represent doc-
uments that discuss different topics.
Related story represents
documents covering dif- ferent yet related events.
Different news events (same type) - focal events (same type)
represents documents describ- ing different topics of the same type (e.g., news about different hurricanes) and the same type of focus of the story.
Different news events (same type) - focal events (different type) represents documents de-
scribing different topics of the same type but having different focus of the story.
Shared aspects represents
documents address- ing multiple topics but sharing at least one of them.
Different news events (different type) - related via background
represents documents describ- ing different news events but share the same background (e.g., the same previous events, people or places).
Common terminology
represents unrelated doc- uments that still share a large number of terms.
Unrelated represents two
documents with different topics and no apparent relation.
Different news events, different type - other represents articles
ers as it focused specifically on identifying shared content across languages rather than the similarity of the topic discussed in the articles in general. Similar tasks have also been performed to identify similar Web articles for the purpose of building comparable corpora for enhancing resources for under-resourced languages (Maia, 2003; McEnery & Xiao, 2007; Munteanu et al., 2004; Skadin,a et al., 2012). In these tasks, similarity is measured cross-lingually for the purpose of retrieving alignable fragments from bilingual documents (e.g., such as translated sentences or words). This specific aspect of similarity is often referred to as comparability (Fung & Cheung, 2004; Tomás et al., 2008).1
In assessing comparability between two documents, terms such as ‘parallel’ and ‘com-
parable’ have been used to represent the different proportion of translated sentences
found in the document pair. ‘Parallel’ documents are defined to be a pair of documents which have been translated sentence-by-sentence (Fung & Cheung, 2004; Skadin,a et al., 2012; Tomás et al., 2008). Fung and Cheung (2004) also used a comparability level named ‘noisy parallel’, to represent parallel documents with insertion or deletion which resulted in non-aligned sentences.
Documents which are similar yet do not correspond in a sentence-by-sentence trans- lation, meanwhile, are often referred to as ‘comparable’ documents. The definitions of comparable documents, however, vary in different studies. Tomás et al. (2008) de- scribed comparable documents as a pair of documents which were not parallel but con- tained some translated sentences. Meanwhile, Fung and Cheung (2004) defined com- parable documents as documents with no aligned sentences but containing the same topic. Meanwhile, the ACCURAT (Analysis of Comparable Corpora for Under Resourced Languages for machine Translation) project2(Skadin,a et al., 2012) noted that compara- ble documents could be further categorised into two classes, namely ‘strongly compara-
ble’, and ‘weakly comparable’. Strongly comparable documents were texts containing the
same subject and having the same source, while weakly comparable documents repre-
1Although the term ‘comparability’ has mostly been used for cross-lingual tasks, it has also been used
to represent monolingually similar documents. E.g., in their work, Barzilay and Elhadad (2003) referred to a corpus of rewriting examples in the same language as a monolingual comparable corpus.
Table 2.2 A comparison of comparability in Web articles
Fung and Cheung (2004) Tomás et al. (2008) Skadin,a et al. (2012)
Parallel represents texts
which are translated sen- tence by sentence.
Parallel represents texts which are translated sen- tence by sentence (preserv- ing the sentence order).
Parallel represents texts
which are accurate trans- lations, or approximate translations with some addition or omissions.
Noisy parallel represents
texts which are mostly parallel but contain non- aligned sentences which may be caused by para- graph insertions or dele- tions.
Comparable describes texts
that contain a noticeable number of translated sen- tences.
Comparable describes texts which do not con- tain aligned sentences but are about the same topic.
Unspecified Strongly comparable repre-
sents texts coming from the same source or containing the same subject.
Non parallel represents
disparate bilingual docu- ments which may or may not be in the same topic.
Weakly comparable repre-
sents texts in the same do- main but different events.
Not comparable
sented texts describing different events but still in the same domain.
Different terms have been used to categorise the least similar documents. Skadin,a et al. (2012) proposed a class named ‘not comparable’ to classify documents with no sim- ilarity. Meanwhile, Fung and Cheung (2004) named this category ‘non-parallel‘ and de- fined it as dissimilar bilingual documents which may or may not be in the same topic. A comparison of these different comparability levels is shown in Table 2.2.
The terms ‘parallel’ and ‘comparable’ have also been used in representing degrees of similarity of sets of documents in a corpus (or corpora), rather than between two docu- ments. Parallel corpora are identified as sets of parallel texts, i.e., bilingual texts which are translated sentence by sentence (Fung & Cheung, 2004; Skadin,a et al., 2012). Compara- ble corpora, on the other hand, have been defined differently in various studies. Zanettin (1998) defined comparable corpora as sets of bilingual texts which shared similar criteria of composition, genre and topic; meanwhile, Munteanu and Marcu (2005) defined com-
parable corpora not by the similarity of topics, but instead as “bilingual texts that, while not parallel in the strict sense, are somewhat related and convey overlapping informa- tion” (Munteanu & Marcu, 2005, p. 477).
Identifying similarity based on the proportion of similar sentences between the doc- uments has also been explored in the context of identifying near-duplicate monolingual articles. In this work, Cooper et al. (2002)[p. 246] defined similar articles as “those in which a large percentage of the sentences, or words in the sentences, are the same”. They also defined duplicate documents as “ones that have essentially the same words in the same sentences and paragraphs” although they were allowed to be “in a somewhat dif- ferent order” (Cooper et al., 2002, p. 246).
Similar scheme was proposed by Brants and Stolle (2002) who also differentiated the degrees of similarity between two texts based on the amount of syntactical similarity (i.e., overlap of words or sentences) shared between the texts. They referred to this concept as surface similarity. Different to previous works, their work focused on measuring sim- ilarity between troubleshooting manuals for photocopiers for the purpose of reducing redundant information found in search. In this work, a three-point scale was used to describe the different degrees of surface similarity between two documents: ‘same’ to represent identical or almost identical documents, ‘similar’ to represent cases where one document may use different words or synonyms and different order of sentences, and ‘different’ to represent cases where the texts were different. In the same work, they also proposed another dimension of similarity which took into account the semantic sim- ilarity within the documents; they referred to this dimension as conceptual similarity. Four-point scale was used in this work: ‘same’ (i.e., documents with (almost) the same contents and may include paraphrasing), ‘similar’ (i.e., documents with significant over- lap of conceptual contents, e.g., those offering different solutions for the same problem), ‘subset’ (i.e., where the content of one document is a subset of the other) and ‘different’ (i.e., conceptually different documents).
Similarity has also been measured at the sub-document level, such as between sen- tences (Agirre et al., 2012; Negri, Marchetti, Mehdad, Bentivogli, & Giampiccolo, 2012)
Table 2.3 Semantic textual similarity levels (Agirre et al., 2012)
Class Definition Example
5 The two sentences are completelyequivalent, as they mean the same thing.
1) The bird is bathing in the sink. 2) Birdie is washing itself in the water basin.
4 The two sentences are mostlyequivalent, but some unimportant details differ.
1) In May 2010, the troops attempted to invade Kabul.
2) The US army invaded Kabul on May 7th last year, 2010.
3 The two sentences are roughlyequivalent, but some important information differs/missing.
1) John said he is considered a wit- ness but not a suspect.
2) “He is not a suspect anymore.” John said.
2 The two sentences are not equiva- lent, but share some details.
1) They flew out of the nest in groups. 2) They flew into the nest together. 1 The two sentences are not equiva-
lent, but are on the same topic.
1) The woman is playing the violin. 2) The young lady enjoys listening to the guitar.
0 The two sentences are on different topics.
1) John went horseback riding at dawn with a whole group of friends. 2) Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.
and between words (Camacho-Collados et al., 2017). Similarity between sentences has been measured to represent the degree of semantic equivalence between the two sen- tences; this is represented using the term ‘semantic textual similarity’ (STS). Agirre et al. (2012) defined six different classes to represent the different levels of STS; each class and its example of sentence pair is shown in Table 2.3. Another work, however, focused on identifying the similarity at the sentence level from the perspective of ‘textual entail-
ment’ (Negri et al., 2012). In this case, the aim is to define the directional relationship
between two sentences, i.e., whether one text entails the other. In this case, four different relations were used: “forward”, “backward”, “bidirectional” and “no entailment”.
A 5-point Likert scale has been used to describe the semantic similarity at the word level. This is for the purpose of a cross-lingual and multilingual semantic word simi- larity tasks across 5 languages (English, Farsi, German, Italian and Spanish) (Camacho-
Table 2.4 Word similarity levels (Camacho-Collados et al., 2017)
Class Definition
Very similar The two words are synonyms.
Similar The two words share many of the important ideas of their meaning but include slightly different details (e.g., “lion- zebra” or “firefighter-policeman”).
Slightly similar The two words do not have a very similar meaning but shar ea common domain (e.g., “house-window” or “airplane-pilot”).
Dissimilar They describe clearly dissimilar concepts but may share some small details, a far relationship or a domain in com- mon. These words are also likely to be found together in a longer document in the same topic (e.g., “software- keyboard” or “driver-suspension”).
Totally dissimilar and unrelated
The words do not mean the same thing and are not on the same topic (e.g., “Playstation-monarchy”).
Collados et al., 2017). The different similarity levels proposed in this work are shown in Table 2.4.
The literature described in this section has illustrated that there is no consensus on the definition of similarity. Instead, various studies have defined different classes or cate- gories to represent the varying degrees of similarity based on their research aims (Barker & Gaizauskas, 2012; Braschler & Schäuble, 1998; Fung & Cheung, 2004; Skadin,a et al., 2012; Tomás et al., 2008). Although a large number of work have proposed different similarity schemes for news domains and Web articles, no available schemes have been specifically developed for Wikipedia articles.