1.4 Representaciones en ciencia, de la ciencia y de la enfermedad
1.4.2 La representación social de la ciencia y los científicos
We used three different datasets in the experiments. The first two, Multext and JRC- Acquis, are parallel corpora that are used for mate retrieval. The third one, the TEL dataset, was published as corpus for an ad-hoc retrieval challenge. In the following, we describe these datasets in detail.
Multext and JRC-Acquis Datasets. We use the following two parallel corpora in our experiments:
IV.5. EXPERIMENTS 91
<div type=RECORD id="FXAC93006ENC.0001.01.00"> ...
<div type="Q"> ...
Subject: The staffing in the Commission of the European Communities
...
Can the Commission say:
1. how many temporary officials are working at the Commission?
2. who they are and what criteria were used in selecting them?
... </div>
<div type="R"> ...
1 and 2. The Commission will send tables showing the number of temporary staff working for the Commission directly to the Honourable Member and to Parliament’s Secretariat.
... </div> </div>
Figure IV.8: Example record of the Multext dataset.
• Multext6 consisting of 3,152 question/answer pairs from the Official Journal of European Community (JOC).
• JRCAcquis7consisting of 7,745 legislative documents of the European Union. Both corpora contain manually translated equivalents of each document in English, German, French and Spanish. In our experiments, we applied a preprocessing pipeline as commonly used in IR systems consisting of stop word removal, nor- malization and stemming (see Section II.1). In particular, we first eliminated stop words (for example and or the) and extremely short terms (length < 3). Then we substituted special characters to get a more consistent representation of terms (for example ¨a → a or ´e → e). Finally we applied stemming to all words using the
Snowball stemmer in the according languages.
Figure IV.8 contains the English version of a sample record of the Multext dataset, which is also available in all other languages. Each document consists of
6http://aune.lpl.univ-aix.fr/projects/MULTEXT/(last accessed April 8, 2011) 7http://langtech.jrc.it/JRC-Acquis.html(last accessed April 8, 2011)
Record Title or Subject Annotation Terms
1 Strength, fracture and complexity: an Fracture mechanics, international journal. Strength of materials 2 Studies in the anthropology of North -
American indians series.
3 Lehrbuch des Schachspiels und Einf¨uhrung Chess in die Problemkunst.
Table IV.6: Example records of the TEL dataset.
Field Description BL ONB BNF
title The title of the document 1 .95 1.05 subject Keyword list of contained subjects 2.22 3.06 0.71
alternative Alternative title .11 .50 0
abstract Abstract of the document .002 .004 0 Table IV.7: Average frequency of content fields of the TEL library catalog records. Each record may contain several fields of the same type.
questions posted to the European Commission and the answers to these questions.
TEL Dataset. The TEL dataset was provided by the European Library in the con- text of the CLEF 2008/2009 ad-hoc track. This dataset consists of library catalog records of three libraries: the British Library (BL) with 1,000,100 records, the Aus- trian National Library (ONB) with 869,353 records and the Biblioth`eque Nationale de France (BNF) with 1,000,100 records. While the BL dataset contains a major- ity of English records, the ONB dataset of German records and the BNF dataset of French records, all collections also contain records in multiple languages.
All of these records consist of content information together with meta informa- tion about the publication. The title of the record is the only content information that is available for all records. Some records additionally contain some annotation terms.
This dataset is challenging for IR tasks in different ways. Firstly, the text of the records is very short, only a few words for most records. Secondly, the dataset consists of records in different languages and retrieval methods need to consider relevant documents in all of these languages.
Table IV.6 shows the content information of some records of the BL dataset (the English part of TEL). As can be seen in these examples, each record consists of fields which again may be of different languages. Not all of these fields describe the content of the record but contain also meta data such as the publisher name or year of publication.
IV.5. EXPERIMENTS 93
BL ONB BNF
Lang Tag Det Lang Tag Det Lang Tag Det
English 61.8% 76.7% German 69.6% 80.9% French 56.4% 77.6% French 5.3% 4.0% English 11.9% 8.0% English 12.9% 8.2% German 4.1% 2.9% French 2.8% 2.1% German 4.1% 3.8% Spanish 3.1% 2.0% Italian 1.8% 1.5% Italian 2.3% 1.4% Russian 2.7% 1.7% Esperanto 1.5% 1.5% Spanish 2.0% 1.4% Table IV.8: Distribution of the five most frequent languages in each dataset, based on the language tags (Tag) and on the language detection model (Det).
fied all potential content fields. Table IV.7 contains a list of the selected fields and the average count of each field for a record. Further, we reduced additional noise by removing non-content terms like constant prefix or suffix terms from fields, for example the prefix term Summary in abstract fields.
In order to be able to use the library catalog records as multilingual documents, we also had to determine the language of each field. Our language detection ap- proach is first based on the language tags provided in the dataset. These tags are present at all records of the BL dataset, for 90% of the ONB dataset and for 82% of the BNF dataset. The language distribution in the different datasets based on the language tags are presented in Table IV.8 in the column Tag.
However, there are several problems with language tags in the TEL dataset. Our analysis of the datasets showed that relying merely on the language tags introduces many errors in the language assignment. Firstly, there are records tagged with the wrong language. Secondly, as there is only one tag per record, language detection based on tags is not adequate for records containing fields in different languages. We therefore applied a language classifier to determine the language of each field of the records in the TEL dataset.
In order to identify the language for each field, we exploit a language detection approach based on character n-grams models. The probability distributions for char- acter sequences of the size n are used to classify text into a set of languages. We used a classifier provided by the Ling Pipe Identification Tool8which was trained on corpora in different languages. We used the Leipzig Corpora Collection9 that con- tains texts collected from the Web and newspapers and the JRC-Acquis dataset that consists of documents published by the European Union and their translations into various languages as training data.
We conducted multiple tests in order to verify the effectiveness of the language detection model. The results showed that using a 5-gram model and a 100,000 char- acter training leads to optimal results. The classifier achieves high performance of more than 97% accuracy for text containing more than 32 characters. As this is the case for most fields in the TEL dataset, this classifier is applicable for the language detection task in our framework.
8http://alias-i.com/lingpipe/(last accessed April 8, 2011) 9http://corpora.uni-leipzig.de/(last accessed April 8, 2011)
Our language detection model determines the language for each field based on evidence from tags and from text based classification. Table IV.8 contains the lan- guage distribution in the TEL datasets based on the detection model in column Det. There are significant differences compared to the language assignment using only language tags, which clearly motivates the application of the language detection model. This step will probably also help to improve results in the retrieval task, as the retrieval models rely on correct language assignments.