3.6 FUNCIONES UNIDADES DE MANDO
3.6.3 Unidad de Cumplimiento
The standard approach to IR system evaluation relates to the notion of relevant and irrelevant documents. Relevance is evaluated relative to an information need, not a query [2]. Therefore, the evaluation in IR system is to measure how well the system meets the information needs to the users. It is possible to define approximate methods which have a correlation with the preferences of a population of users [10]. There are various methods for evaluating the retrieval quality of the IR system.
Precision and recall are the basic measures used in evaluating search strategies. Precision is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved or the number of true positives divided by the sum of true positives and false positives. Recall is the ratio of the number of relevant records retrieved to the total number of relevant records in the database or the number of true positives divided by the sum of true positives and false negatives.
Table 2.3 The contingency table
Relevant Irrelevant Retrieved True Positive (tp) False Positive (fp) Not retrieved False Negative (fn) True Negative (tn)
Precision and recall are calculated by equations (2.8) and (2.9), respectively:
= (2.8)
= (2.9) Recall is difficult to calculate in a large collection. Precision and recall are not always useful. They assume that all the documents in the search results have been seen. However, the user is not usually presented with all the documents in the
search results. Only top ranked documents are concerned. Precision@k (P@k) [150] [151] is the precision for the top-k ranked results. For example, precision@10 (P@10) and precision@20 (P@20) are the precision for the top 10 and the top 20 query suggestions or documents, respectively. They are calculated by equations (2.10) and (2.11):
@10 = (2.10)
@20 = (2.11)
It has the advantage of not requiring any estimate of the size of the set of relevant documents. However, the disadvantage is that it is the least stable of the commonly used evaluation measures and does not average well, since the total number of relevant documents has a strong influence on precision at k.
F-measure or F1 score is the harmonic mean which combines precision and
recall into a single number. It can be interpreted as a weighted average of the precision and recall. The F1 score reaches its best score at 1 and worst at 0. It is a
popular metric for evaluating text classification algorithm. F-measure is defined by equation (2.12):
( ) =
( ) ( )
(2.12) where r(j) is the recall at the j-th position in the ranking. P(j) is the precision at the j-th position in the ranking [10].
In addition, accuracy is often used for evaluating classification problems. It is the fraction of its classifications that are correct. The accuracy is defined by equation (2.13):
Precision, recall, F-measure, and accuracy are set-based measures. They are the evaluations for unranked documents. However, ranked retrieval results are very important in IR applications. Mean reciprocal rank, mean average precision, and discounted cumulated gain are used for evaluating ranked documents [2].
Mean reciprocal rank (MRR) [151] [152] is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by the probability of correctness. For a sample of queries, the reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. MRR is suitable for web document/query suggestion’s ranking evaluation. For query j, the reciprocal rank of a relevant document or good query suggestion i, RRji, is the multiplicative inverse of the rank of this document/query
suggestion in the list of potential documents/query suggestions made by a document/query suggestion method, rji. It equals 0 if no such document/query
suggestion is in the list. RRjiis defined by equation (2.14):
= (2.14) MRR is the average of the reciprocal ranks of all the relevant documents or good suggestions for all queries which is defined by equation (2.15):
= ∑ ∑ (2.15) where Qj is the number of the relevant documents or good suggestions for query j,
q is the number of queries. In this thesis, its relevant document or good query suggestions for a query are determined partly by users’ decisions and partly by the Google query suggestions.
Mean average precision (MAP) [2] [151] supposes that users are concerned about finding many relevant documents/suggestions, and highly relevant documents/suggestions should appear first in a suggested list.
Let the rank of the ith relevant document/suggestion in the potential documents/suggestions made by a document/suggestion ranking method for query j be rji. The precision of the ith suggestion is defined by equation (2.16):
= = (2.16)
For an irrelevant suggestion, the precision is set to 0. MAP is defined as the average precision of all the documents/query suggestions for the queries, as shown in equation (2.17):
= ∑ ∑ (2.17)
where Qj is the number of relevant documents/suggestions for query j and q is the
number of queries.
MAP allows only binary relevance assessment: relevant or irrelevant. It does not distinguish highly relevant documents/suggestions from mildly relevant documents/suggestions. On the other hand, discounted cumulated gain (DCG) [2] [53] is a metric that combines graded relevance assessments effectively. This grade is the rating or weighting factor of the rank of the ith document/suggestion.
Cumulative gain (CG) is designed for situations of non-binary notions of relevance. Cumulative gain of the Qj documents/suggestions for query j is defined
by equation (2.18):
where wi is the rating or weighting factor of the ith document/suggestion.
Discounted Cumulative Gain (DCG) is defined by using a discount factor 1/(log2i), which is shown in equation (2.19):
= + + + … (2.19)
Normalised discounted cumulative gain (nDCG) of query j is defined by equation (2.20):
= (2.20) where IDCG is the maximum possible DCG. Average DCG (DCG) and nDCG over q queries are defined by equations (2.21) and (2.22), respectively:
= ∑ (2.21)
= ∑ (2.22) Regarding the most standard IR task, the system aims to provide information or documents which the user desires to know more correctly and quickly. Therefore, a user’s information needs are the most important issue. To decide whether a document is relevant or not relevant, users play the most important role in this evaluation task. The system and user utility are comprised of how satisfied each user is with the results the system gives for each information need. These might include quantitative measures in both objectives, such as time to complete a task, and subjective, such as a score for satisfaction. The system utility is a satisfaction score of the system which users are given. The user utility is a way of quantifying aggregate user happiness, based on the relevance, speed, and user interface of a system. For example, they are happy if customers click through to
their site. User happiness is an elusive measure, and this is partly why the standard methodology uses the representative of relevance for search results. The participants are observed, and ethnographic interview techniques are used to get qualitative information on satisfaction. Questionnaires provide data about users’ opinions and the results are reported to researchers. For the evaluation methods of ranked retrieval results, the users or participants are involved to choose the relevant results and to rank them in order with respect to the query. User studies are very useful, but they are time consuming and expensive to do [2].