• No se han encontrado resultados

CAPITULO III: RESULTADOS Y DISCUSION

3.3. DETERMINACIÓN DE EFECTOS Y RIESGOS AMBIENTALES

3.3.4. Evaluación del Riesgo Ambiental

The evaluation of information retrieval systems focuses on what is called the pro- cessing level [40]: the performance of specific algorithms and techniques are com- pared in isolation from the larger systems of which they are part of and from the users who employ them. These experiments measure the effectiveness of a re- trieval algorithm using a test collection, composed of a set of documents, a set of query topics and a set of relevance judgements which map topics to the documents relevant to them. Test collections are the basic tools for conducting repeatable ex- periments on retrieval algorithms [41].

Relevance judgements for a test collection are difficult to gather for several rea- sons. At first, for modern collections containing millions of documents it is not fea- sible to assess every document with respect to each topic. One solution is to use pooling: a method through which a fraction of the collection is selected for assess- ment. If the subset examined contains a representative sample of the relevant doc- uments, the pooling method closely approximates the results of assessing the entire collection.

Since retrieval systems attempt to rank documents according to their degree of probability of relevance, the highest ranking documents produced by effective retrieval systems should be a good candidate for inclusion in the assessment pool, even though some relevant documents may be missed [42]. This assumption is the basis of how relevance judgements are created for TREC collections [43].

In TREC, each participating system reports the 1000 top-ranked document for each topic. Of these, the top 100 are collected into a pool for assessment. In his analysis of the TREC results, Zobel concluded that the results obtained given a pool of 100 are reliable even though many relevant documents may be missed. Moreover, the relative performance among systems changed little by limiting the pool depth to 10, although actual precision scores did change for some systems [44].

A second challenge in creating relevance judgements is that people can dis- agree or assess differently the same document. Even though, as Saracevic states, intuitively, we understand quite well what relevance means, many factors influence a person’s assessment of relevance [45]. Moreover, even a single individual may be inconsistent in judging relevance. This issue has received attention from the research community. Harter found that not a lot of experimental studies looked at assessor disagreement. He also noticed that many studies assume that assessor disagreement has little influence on the relative effectiveness of the system [46].

3.2. RELATED WORK 27 Analysing three independent sets of relevance judgements from TREC-4 and TREC-6, Voorhees found that, despite a low average overlap between assessment sets and a wide variation in the overlap among particular topics, the relative ranking of systems remained largely unchanged across the different sets of relevance judge- ments [47]. Furthermore, hybrid sets of judgements were created for the TREC-4 analysis choosing each possible combination of judgements from the three asses- sors over all 49 topics. On average, the Kendall’sτ 1correlation between the rank- ings of systems obtain with these assessments and the actual TREC ranking was 0.938.

To overcome the difficulty of creating relevance judgements, researchers pro- posed various automatic methods to compare the retrieval effectiveness of IR sys- tems [41], [48]. Chowdhury [48] mined search engine logs to generate query and use documents from the Open Directory Project (ODP) 2 to form query-document

pairs. Afterwards, the queries were issued to several search engines and a rank at which an engine returns the document was computed. The score for a search en- gine was represented as the mean reciprocal rank for all query and document pairs. Moreover, the query-document pairs need to be reasonable and unbiased.

Soboroff asked the following question: how much would system ranking change if relevant documents are chosen randomly from the pool? [41] So, if relevant doc- uments occur in a pool of retrieved documents according to some distribution and if human assessors can disagree widely without affecting relative system perfor- mance, can the occurrence of relevant documents be modelled? He concluded that the difference among human judges is not randomly distributed with respect to their impact on precision scores, but instead is concentrated onfringe cases which don’t affect many cases.

Furthermore, disagreement about the number of relevant documents does not seem to have a net impact on system rankings at all, probably because having more relevant documents benefits most systems uniformly. Lastly, with simple sampling models and minimal-effort, the method provides a good first-approximation of the results.

Because academic search was recently included in TREC OpenSearch track, there are not many open collections for evaluating the performance of such sys- tems. One attempt to create an academic test collection can be found in [49] where Harpale et al. gather a set of articles from CiteSeer3 and CiteULike4 together with

manual personalized queries and relevance judgements. In this chapter we assess

1Kendall’s τ correlation is a function of the minimum number of pairwise swaps required to turn

one ranking into another

2http://dmoztools.net 3http://citeseer.ist.psu.edu 4http://www.citeulike.org

if automatic methods for query generation and relevance judgement can be used to automatically evaluate academic search engines.

Documento similar