6. DISCUSIÓN
6.2 Efecto de las dietas sobre el metabolismo de los ácidos grasos
The final step after implementing an IR system is the system evaluation process. There are several measurements to evaluate IR systems and all of these measurements depend on the objectives of building the IR system. One of these objectives is the system performance and the measurements of this objective are related to the computational runtime and the problem size. The accuracy is the second evaluation metric that has been widely used. The evaluation metrics related to accuracy depend on the relevance degree of the retrieved documents responding to the user queries (user information needs). This type of evaluation is referred to retrieval accuracy evaluation or in other words, system effectiveness (Baeza-Yates and Ribeiro-Neto,2011).
The most common metrics for measuring IR system effectiveness are Recall and
Precision. Figure2.3 illustrates the use of recall and precision as a measurement for IR effectiveness. There are two types of classical Recall-Precision approaches to measure the IR system effectiveness: Singular Value Recall-Precision and Interpolated Recall- Precision (Baeza-Yates and Ribeiro-Neto, 1999; Kwok, 1997; Chang and Hsu., 1999). Singular Value Recall-Precision is the way of computing both the recall and correspond- ing precision for a given user query from the IR system. Interpolated Recall-Precision is the way of retrieving the documents responding to the user query until reaching a specific recall value assigned to the IR system, and then measuring the corresponding precision ratio. In the web-scale IR, there are another metrics such as the precision of the retrieved list at the first k documents retrieved (P@k) (Li, 2014). In the experimental study of
Figure 2.3: Recall and Precision Ratios in IR Systems.
the thesis,Mean Average Precision (MAP),Average Precision (AP),Precision at top-10 document retrieved (P@10), Normalised Discounted Cumulative Gain (NDCG@10),
Reciprocal Rank (RR@10), Error Rate (Err@10)andRoot Mean Square Error (RMSE)
were used (Li,2014;Baeza-Yates and Ribeiro-Neto,2011;Chai and Draxler,2014). We now turn to the explanation of these evaluation measures.
Letd1, d2, ..., dk denote the sorted documents by decreasing order of their similarity measure function value, where k represents the number of retrieved documents. The functionr(di)gives the relevance value of a document di. It returns 1 ifdi is relevant, and0 otherwise. The Precision of top-k relevant query-document retrieved per queryq
(P@k)is defined as follows:
P@k =
Pk
i=1r(di)
k (2.1.1)
On the other hand, the Interpolated Average Precision at specific recall point r = ¯r
can be calculated as follows:
AvgP = maxr=¯rP(¯r) (2.1.2) whereP(¯r)is the precision at recall pointr = ¯r over all queriesQ. TheAvgP value is calculated for a point recall value. In this thesis, we calculated theAvgP for nine-point
recall values as threshold for top k document retrieved. The interpolated mean of the average precision values for M-point recall values (MAP) can be given by the following equation:
M AP =
PM
L=1 AvgP
M (2.1.3)
For considering the graded relevance levels in the datasets for LTR techniques eval- uationr(dj) returns graded relevance value (not binary relevance value as in MAP and
Pq@k equations) in equations 2.1.4, 2.1.5and2.1.6 for other fitness evaluation metrics. The Normalized Discounted Cumulative Gain of top-k documents retrieved (NDCG@k) in equation2.1.4can be calculated by:
N DCG@k = 1 IDCG@k · k X i=1 2r(di)−1 log2(i+ 1) (2.1.4)
whereIDCG@kis the ideal (maximum) discounted cumulative gain of top-k documents retrieved. TheDiscounted Cumulative Gain of top-k documents retrieved (DCG@k)can be calculated by the following equation:
DCG@k = k X i=1 2r(di)−1 log2(i+ 1) (2.1.5)
If all top-k documents retrieved are relevant, theDCG@kwill be equal toIDCG@k.
The Reciprocal Rank metric at top-K retrieved query-document pairs (RR@K) is as follows: RR@K = k X i=1 1 i i Y j=1 (1−r(dj))∗r(dj) (2.1.6)
The Error Rate (Err) is usually used to measure the error of the learning model if it is used on another benchmark different from the training set. It is the subtraction between the training evaluation value to the predictive evaluation value, while the Mean Absolute Error and Root Mean Square Error are calculated by equations2.1.7and2.1.8.
M AE = 1 n n X i=1 |ERRi| (2.1.7) RM SE = v u u t 1 n n X i=1 (ERRi)2 (2.1.8)
wheren is the number of benchmark instances (documents) used for evaluating the IR system effectiveness. Each evaluation metric has a purpose for measuring the quality of the proposed ranked model and the retrieved search results by this model. P@K is used to measure how many relevant documents in the top-K documents. However, this metric does not consider the graded relevance levels of each retrieved document, but it considers if the query-document retrieved if the relevant or not. The MAP evaluation metric consid- ers the average precision on the whole search results rather than top-K query-document pair retrieved. On the other hand, NDCG@K and RR@K metric take in their calculations the graded relevance level of each query-document pair into consideration for the top-K query-document retrieved. The difference between NDCG@K and RR@K is that RR@K considers the impact of the position for each retrieved query-document pair in the search list more than NDCG@K metric. Finally, MAE and RMSE calculate the difference be- tween the relevance labels produced by the ranking model with the query-document pair features against the ground truth relevance labels. The MAE and RMSE consider the rank- ing problem as ranking and regression problem. In this thesis, we used MAP, NDCG@10, P@10, RR@10 and RMSE as fitness evaluation metrics for extensive evaluation and op- timisation to produce by the proposed techniques.