Capítulo 5: MATRIZ O-D Y FACTORES SOCIOECONÓMICOS
5.6. CAPACIDAD DEL FERROCARRIL NÚMERO DE TRENES
In the previous sections, we introduced various retrieval models for an IR system. In order to compare such models, evaluation metrics of effectiveness need to be defined.
The most common effectiveness measures are precision and recall, which were introduced in the Cranfield study [119]. Given a document collection and a few queries, it is assumed that assessors can provide relevance judgements, mapping each query to all the relevant documents in the collection, which is practical in small collections. To compare the performance, those queries are issued to the competing retrieval models, result sets are respectively retrieved, and the results can be either relevant or non-relevant. We denote the relevant set of documents for the query by A, all retrieved set of documents for the query by B, ¯Aby the non-relevant set and ¯B by the non-retrieved set.
Relevant Non-Relevant Retrieved A∩B A¯∩B Non-Retrieved A∩B¯ A¯ ∩B¯
Table 2.1: Different possible outcomes in the collection.
If we regard the Relevant as Positive and Non-Relevant as Negative, A∩B istrue positive(TP) as the retrieved results are relevant, ¯A∩B isfalse positive
(FP) because the retrieved results are non-relevant, A ∩ B¯ is false negative
(FN) as relevant data is considered non-relevant by the system and ¯A ∩ B¯ is called true negative (TN) because non-relevant data is correctly recognized by the system. Precision and recall are defined as:
P recision= |A∩B| |B| = |T P| |T P|+|F P| (2.7) Recall= |A∩B| |A| = |T P| |T P|+|F N| (2.8) In this scenario, a retrieval model can be viewed as a binary classifier which can distinguish between relevant and non-relevant results from the whole collection. Precision and recall are widely used in practice when evaluating the performance of a binary classifier.
For the multi-class problem, microaveraged and macroaveraged metric are used. The microaveraged precision is defined as:
M icroprecision= P|C| i=1|T P|i P|C| i=1|T P|i+ P|C| i=1|F P|i , (2.9)
The microaveraged recall is defined as: M icrorecall= P|C| i=1|T P|i P|C| i=1|T P|i+ P|C| i=1|F N|i , (2.10)
where|C|is the number of the classes,|T P|i is the number oftrue positives for
positive class i, |F P|i is the number of false positives for positive class i and
|F N|i is the number offalse negatives for positive classi.
The definition of macroaveraged is as follows:
M acroprecision= P|C| i=1P recisioni |C| , (2.11) M acrorecall= P|C| i=1Recalli |C| , (2.12)
Here precision and recall are defined in the usual way. The difference between microaveraged and macroaveraged metric is that for the former, it gives equal weight to each per-document classification decision, whereas macroaveraged gives equal weight to each class. Therefore, for microaveraged, the accuracy can be poor for classes with few positive examples without affecting the overall numbers much.
In certain cases, a single metric is used to summarise the overall performance of the system. The F measure proposed by Jardine and Rijsbergen [91] measures the harmonic mean of precision and recall, which is defined as follows.
Fβ=(β
2+ 1)RP
(R+β2P) , (2.13)
whereRandP represent recall and precision respectively, andβ is a parameter between 0 to 1. In practice, the most commonFβ measure isF1, i.e.,
F1= 2RP
R+P (2.14)
Similarly, the microaveraged and macroaveraged F1 is defined as follows:
M icroF1 = 2×M icroprecision×M icrorecall
M icroprecision+M icrorecall , (2.15)
M acroF1 =
P|C| i=1F1i
|C| , (2.16)
Another common evaluation metric is accuracy. The definition of it is as follows:
Accuracy= |T P|+|T N|
|T P|+|F P|+|T N|+|F N| (2.17) A further measure of document retrieval system performance is called “ex- pected search length” [39]. Given a list of ranked results, search length measures the number of irrelevant results until the first relevant result is encountered. Such an ordering is called “simple ordering” by William Cooper. Nevertheless, a retrieved set of documents can be divided into multiple levels, each level con- tains a subset of documents. For example, if a query contains 5 words, one subset could be the documents having all 5 words, another subset could be the documents containing only 4 words and so on. If we assume the documents within each level are randomly ordered, expected search length is defined as the average number of documents that must be examined to retrieve a given number of relevant documents.
For a single search engine, the number of retrieved documents can be ex- tremely large. So we need to choose a cut-off point to compute the precision
2. For example, precision at top 10, measures the accuracy within the top 10
retrieved documents. However, it does not take the document position into ac- count. A model which ranks relevant documents in higher positions is better than a model that ranks them in the lower positions. However, precision at the top 10 can not distinguish between such models. To compare two ranked lists more accurately, average precision is proposed as follows:
averprec= 1 k k X i=1 P recision(ri), (2.18)
whereriis the set of ranked retrieved results from the top ranks until document
di andP recision(ri) is the precision in the topiresults.
In recent years, mean average precision (MAP) has been used in many re- search papers, which is the mean of average precision over all queries in the test set. Other measures include the normalised discounted cumulative gain (NDCG) [69], which is based on two assumptions:
2It is usually impossible in practice to measure recall because there is lack of knowledge of
• Highly relevant documents are more useful than marginally relevant doc- uments.
• The lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined.
The discounted cumulative gain (DCG) is the total gain accumulated at a par- ticular rankp. It is defined as:
DCGp=rel1+ p X i=2 reli log2i (2.19)
where reli is the graded relevance level of the document retrieved at rank i
[24]. For the perfect ranking of one given query, the DCG value will be maxi- mized. This maximized DCG value is called ideal DCG value. The normalised discounted cumulative gain (NDCG) for a given query can be defined as:
N DCGp= DCGp
IDCGp (2.20)
where IDCG is the ideal DCG value for the query.