The previous section provided some initial insights into the e↵ect of unjudged documents. In this section, we analyse the e↵ect of unjudged documents on precision @ 20 across all 85 TREC MedTrack queries. The plot in Figure 7.1
shows, for each query (x-axis), the number of unjudged documents (left y-axis) in the top 20 results — for both Bag-of-concepts (lvl0) and the GIN (lvl1). The plot also shows the overlapping documents between lvl0 and lvl1, i.e. the number of unjudged documents that appear in both lv0 and lvl1 top 20 results. Finally, the plot shows the change in precision @ 20 (red line, right y-axis) between lvl0 and lvl1 (i.e., lvl1 minus lvl0). The queries on the x-axis are ordered according to the number of unjudged documents retrieved by lvl1.
142 184 158 170 123 125 169 143 180 136 157 178 101 144 147 179 126 115 140 146 164 109 112 114 135 165 185 105 116 167 181 172 113 149 155 161 168 122 134 145 151 177 141 153 150 182 127 117 132 103 171 163 139 128 137 111 148 152 176 106 154 162 119 108 160 130 173 118 129 102 174 104 124 183 110 120 156 175 133 131 107 121 138 159 166 QueryId 0 5 10 15 20 25 ●● ●● ●● ● ●● ●● ● ● ●●● ● ●● ● ● ●● ●●●● ●●●● ● ●● ●● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● Unjudged docs − lvl0 Unjudged docs − lvl1 Overlap ∆ Prec@20 − 0.4 − 0.2 0.0 0.2 0.4
Number of unjudged docs in top 20 results
∆
Prec@20
Figure 7.1: The number of unjudged documents in top 20 results (left y- axis) for each query (x-axis), and the corresponding change in precision @ 20 (right x-axis). Queries ordered according to the number of unjudged documents retrieved by lvl1.
The figure provides a number of insights. Clearly, there were far more un- judged documents for lvl1 than lvl0. Therefore, the evaluation was more likely to have underestimated the performance for lvl1. This was highlighted previously and was the initial motivation for obtaining more relevance assessments. In ad- dition, the overlap between the unjudged documents returned by lvl0 and lvl1 was relatively small. This shows that the rankings were quite di↵erent. The GIN relies on di↵erent information and returns a di↵erent set of documents.
Finally, the righthand side of the plot shows a number of queries where lvl1 was returning a significant number of unjudged documents but without a significant degradation in precision @ 20. These queries exhibit the same characteristics as the example query 119 presented in the previous section: many more unjudged documents without a significant loss in precision. We conjecture that these were the queries where the GIN is returning new relevant documents never judged by the TREC assessors. The question, therefore, is: what portion of the un- judged documents returned may have actually been relevant but were never seen by TREC assessors? It is this question that motivates the need for additional relevance assessments.
7.2.1
Simulated Precision
If the GIN was returning many relevant but unjudged documents, then judging these documents would lead to improvements in the measure of retrieval e↵ect- iveness. To understand better the potential gains, we provide an analysis in the form of a “simulated” precision measure if all the unjudged documents were assessed. This is done both to understand the potential gains and to contrast how accurate a simulated measure might be compared to the actual measure once complete judgements were obtained through a new assessment exercise. The simulated precision is derived as follows:
• For each query qi a set of unjudged documents Ui was returned by our
system.
• Some portion of Ui may be relevant. The probability of being relevant is
P (r|Ui).
• We could assume a uniform probability of relevance, for example, by con- sidering the ratio of the number of judged relevant to total number of judged documents in the TREC qrels (i.e., uniform across all TREC quer- ies). Instead, a better estimate could use other indicators of relevance that are more informative of the potential performance for a given query. One indicator would be the portion of judged relevant to total judged docu- ments in the top 20 results for a given query, i.e., P (r|Ui) =|judged relevant||judged| .
The intuition here is that if a query contained only relevant and unjudged documents, then the unjudged documents were more likely to be relevant than a query that contained only not relevant and unjudged documents. • Using the above method of estimating P (r|Ui), we can assign a certain
124 133 137 138 159 166 167 173 179 118 121 125 130 152 174 175 183 107 136 146 151 168 185 108 176 103 131 181 111 120 126 150 154 156 104 129 144 163 171 177 102 110 116 141 149 106 134 139 140 148 160 119 145 153 164 172 123 113 117 122 115 128 155 162 142 157 165 127 161 184 112 132 135 180 101 114 158 182 109 105 170 147 178 143 169 QueryId 0 5 10 15 20 25 Unjudged docs (lvl1) Original P@20 Simulated P@20 0.0 0.2 0.4 0.6 0.8 1.0
Number of unjudged docs in top 20 results
Prec@20
Figure 7.2: Simulated precision for each query, if a portion of unjudged doc- uments are judged relevant.
for each query.) Precision @ 20 is then recalculated using the additional relevant documents, providing a simulated precision measure.
The results of the simulated precision are provided in Figure7.2. For each query, we show the number of unjudged documents returned by the GIN in the top 20 results. The dashed line is the original precision @ 20 for lvl1 using TREC qrels.4 The solid line is the simulated precision @ 20. The plot is ordered by
increasing original precision @ 20. We observe that the worst performing queries tend to have a higher number of unjudged documents; unsurprising, as these are treated as not relevant. However, there are a number of queries that contain nearly only relevant and unjudged documents — few or no irrelevant documents. These are the queries with the largest gains in simulated precision @ 20 (e.g., the peaks at query 131, 102, 110). Overall, we see increases in simulated precision @ 20 across a large portion of queries.
Although artificially created, these results aim to provide an indication of the improvement we may find from new relevance assessments. These simulated results are revisited after obtaining new assessments to determine how accur- ate they have been. Further research could investigate other (more reliable) indicators of P (r|U).
4We use the term ‘original’ to denote the evaluation results using the TREC MedTrack. This is used later to contrast against the evaluation results obtained with addition relevance assessments.
There has been previous research into evaluating systems with limited rel- evance assessments. This includes the development of inferred measures [Yilmaz et al.,2008], which are proposed as a means of obtaining more accurate estimates of retrieval e↵ectiveness when judging a relatively small number of documents (this being the case for TREC MedTrack). These measures are used as part of an approach aimed at evaluating many more queries but with fewer assessed documents per query (as opposed to the more common practice of assessing a small number of queries, each judged to near-completeness) [Carterette et al.,
2008]. The reason such methods are not used as part of our evaluation is that the problem is not just that a limited number of documents from each system can be judged. Instead, the problem is that no semantic search systems con- tributed any documents to the pool. Irrespective of how the pool was formed, if some semantic search system never contributed documents, then potentially relevant documents retrieved by such a system would never be assessed (un- less those documents were returned by one of the other keyword-based systems contributing to the pool). The problem is not the limited number of relevance assessments but the type of documents that were available for assessment in the first place.