• No se han encontrado resultados

CAPÍTULO II: PROPUESTA METODOLÓGICA: ANÁLISIS TÉCNICO TEXTIL

II.1. MATERIAS PRIMAS

II.1.1. Fibras textiles: origen, identificación y procesamiento

Figure 1.1 shows the different components investigated in this study and how they re- late to each other. These components can be categorised into two main tasks. The first one is to understand the similarity in Wikipedia. In this first task, the author reviewed related work in the area (Chapter 2) and carried out an initial study on Wikipedia similar- ity (Chapter 4). Based on these findings, the author created an annotation task to gather human judgments on similarity (Chapter 5). A pilot study was conducted prior to carry- ing out the final study, in which human judgments for 800 document pairs in 8 language pairs were gathered. These annotations are referred to as the evaluation corpus.

The second task in this study is to develop approaches to measure cross-lingual sim-

ilarity. The findings from the first task were used to inform a set of features that are

valuable for measuring cross-lingual similarity in Wikipedia. Four different experiments were then carried out to develop approaches using these different selection of features (Chapter 6-9). The performance of these approaches were evaluated against the evalua- tion corpus. The remainder of the thesis discusses these findings and how they relate to the existing literature (Chapter 10). Finally, the last chapter concludes the work (Chap- ter 11).

Related Work

Chapter 1 has described the motivation of developing methods to compute cross-lingual similarity in Wikipedia. In this chapter, the author reviews previous studies that have been conducted in this area.

Firstly, the author reviewed studies that were aimed at measuring similarity to iden- tify the different tasks that relied on measuring similarity (Section 2.1) and how simi- larity was defined in previous work (Section 2.2). Previous literature has also developed approaches to measure similarity between texts written in the same language (mono-

lingual similarity) and texts between different languages (cross-lingual similarity). The

author reviewed these monolingual similarity and cross-lingual similarity approaches in Section 2.3 and Section 2.4, respectively.

The aim of this thesis is to measure similarity in Wikipedia. To further identify ap- plications that benefit from measuring similarity in Wikipedia, the author reviewed pre- vious work that utilised Wikipedia as a linguistic resource in Section 2.5. Previous stud- ies that specifically analysed the degree of similarity (and dissimilarity) in Wikipedia are then highlighted in Section 2.6. Finally, the gap in literature that this work aims to fill is identified in Section 2.7.

2.1 Measuring similarity

Measuring similarity between texts is an important task for many fields, such as informa- tion retrieval (Manning et al., 2008), plagiarism detection (Maurer et al., 2006), cluster- ing (Bigi, 2003; A. Huang, 2008) and text classification (Wu et al., 2017). These different tasks, however, require similarity to be measured at different granularities. In informa- tion retrieval, for example, similarity measures are used to compute the relevance be- tween a query (usually a few words) and a collection of documents in order to retrieve the most relevant documents to the query (Manning et al., 2008). Similarity between sen- tences are investigated for the purpose of identifying text reuse, both to identify monolin- gual text reuse (Clough et al., 2002; Hoad & Zobel, 2003; Maurer et al., 2006; Shivakumar & Garcia-Molina, 1995) and cross-lingual text reuse (Potthast, Barrón-Cedeño, Stein, & Rosso, 2011). Sentence similarity methods have also been investigated for the purpose of identifying paraphrases (Mihalcea, Corley, & Strapparava, 2006), semantic textual simi- larity between a pair of sentences (Agirre et al., 2012; Bär, Biemann, Gurevych, & Zesch, 2012) and textual entailment (Dagan, Glickman, & Magnini, 2006). Textual entailment task aims to identify whether information in one sentence can be inferred by the infor- mation in another sentence and has been utilised for the purpose of content synchroni- sation between two document versions (Mehdad, Negri, & Federico, 2010; Vilarino, Pinto, Tovar, León, & Castillo, 2012; Wäschle & Fendrich, 2012). Identifying similar sentences is also a valuable task for summarising text, in order to identify and include diverse infor- mation into the summary (Do, Roth, Sammons, Tu, & Vydiswaran, 2009).

Meanwhile, tasks such as clustering or classification tasks, often measure similarity between documents instead (A. Huang, 2008; Wu et al., 2017). Measuring similarity at the document level has also been carried out in the news domains, such as for tracking re- lated news articles, both monolingually (M. D. Lee, Pincombe, & Welsh, 2005) and cross- lingually (Pouliquen, Steinberger, Ignat, Käsper, & Temnikova, 2004). Document simi- larity has also been measured specifically for academic publications in previous work (Elsayed, Lin, & Oard, 2008; Lakkaraju, Gauch, & Speretta, 2008; Trivison, 1987) in or- der to suggest similar publications to readers and to investigate relations between cited

and citing articles. Its application in the Web domain has also been researched for the purpose of finding similar documents (Cooper, Coden, & Brown, 2002), near-duplicate documents (Shivakumar & Garcia-Molina, 1995) and finding translated documents in the Web (Resnik & Smith, 2003).

This last work was aimed for creating bilingual parallel corpora to be used as trans- lation resources. However, in the past decades, the research has further progressed to measuring cross-lingual similarity for the purpose of finding similar (yet non-parallel) articles across languages for building a corpus of comparable documents, or more fre- quently referred to as comparable corpora (Maia, 2003; Skadin,a et al., 2012). Similar to parallel corpora, comparable corpora have also been utilised as translation resources be- cause they have wider availability than parallel corpora for languages and domains that are under-resourced.

Approaches to measure similarity (which are further reviewed in Section 2.3 and Sec- tion 2.4) can rely on measuring syntactical similarity between two texts, such as by mea- suring the overlap of words between the texts. However, many have identified limita- tions of these methods since similar texts may not use the same words. A large number of studies have aimed at identifying semantic similarity (i.e., similarity of meanings) be- tween words or concepts (J. J. Jiang & Conrath, 1997; Y. Jiang, Zhang, Tang, & Nie, 2015; Kandola, Cristianini, & Shawe-Taylor, 2003; Lakkaraju et al., 2008; Pedersen, Patwardhan, & Michelizzi, 2004; Taieb, Aouicha, & Hamadou, 2014). These approaches often require the use of a lexical database, such as WordNet (Miller, 1995), or a large corpus to learn the co-occurrence between related words or concepts (Agirre et al., 2009). These approaches have been further utilised in measuring the semantic similarity between two texts (Wan & Peng, 2005).