While the final judgment on the quality of a system should be based on the applica- tion task, Natural Language Processing systems are usually evaluated and compared using standardised technical measures such as precision, recall, and F-measure. The basis for this measure are the sets resulting from the overlap of the obtained se- lection of items and the expected selection of items, which are the correct selected expected items (true positives), the wrongly selected items (false positives), the ex-
TRUE POSITIVE
FALSE POSITIVE pected items the method missed to select (false negatives), and the items not selected FALSE NEGATIVE and not expected to be selected (true negatives).
TRUE NEGATIVE
expected
obtained 1 0
1 true positive false positive
0 false negative true negative
Table 2.4.Contingency matrix (2x2) for two binary variable for the expected and obtained observation
of the selection of some item.
Definition 2.2 (Precision). The measure precision (or specificity) is defined as the propor-
PRECISION
tion of selected items that a method selected correctly.
precision= true positive
A high precision indicates that most retrieve items have been correct. A low pre- cision means that a system retrieved many incorrect items. A precision of 1 means that a system retrieves only correct item.
Definition 2.3 (Recall). The measure recall (or sensitivity) is defined as the proportion of RECALL
items a method selected.
recall= true positive
true positive+ f alse negative
A high recall indicates that most of what could have been found was found. A low recall means that a system failed to find what should have been found. A recall of 1 can theoretically be achieved by just returning everything. Therefor, for most evaluation tasks the performance of a system is a trade-off between precision and recall. For an easier comparison of systems Van Rijsbergen (1979) introduced the F-measure, as the harmonic mean between precision and recall.
Definition 2.4 (F-measure). The F-measure (F) (or F-score) is defined as the harmonic mean F-MEASURE
between precision and recall. A value α=0.5 is equal weighting recall and precision.
F−measure= 1
αprecision1 + (1−α)recall1
Learning accuracy Sometime Natural Language Processing systems are also eval-
uated in terms of learning accuracy. The measure was introduced by Hahn and LEARNING
ACCURACY
Schnattinger (1998). It “measures not only the overall correctness of the final classi- fication but also incorporates the distance between the position f predicted by the algorithm and the correct one s”, see Witschel (2005).
Accuracy Generally accuracy is defined as the percentage of items selected cor- ACCURACY
rectly (true positives + true negatives) and the corresponding error is defined as the percentage of wrongly selected items (false positives + false negatives).
Average precision When measuring the performance of rankings of elements (e.g document or term rankings) also the recall at a certain rank needs to be reflected in the measure. The measurement of precision only takes the correct or relevant ele-
ments in the set of retrieved elements into account. Average precision does incorporate AVERAGE
PRECISION
recall at a rank and is therefore a retrieval order dependent precision measure. Over- all the average precision values are smaller or equal then precision values depending on the number of not relevant elements retrieved prior the retrieval of all relevant elements. Average precision can be defined as follows:
Definition 2.5 (average precision). The average value of the precision p at rank r, p(r)is defined as
avgP= ∑
N
1 p(r) ∗rel(r)
Number o f relevant elements retrieved ,
where rel(r) is a binary relation returning 1 if the element at rank r is relevant and 0
To illustrate the underlying principle see the example below of how to calculate average precision for binary and probabilistic values.
Example 2.6. In the following table there are two example shown with overall the
same precision of 1.0 but different rankings.
Example A Example B rank(i) rel(i) p(i) r(i) rel(i) p(i) r(i)
1 1 11 13 1 11 13 2 1 22 13 1 22 13 3 0 23 0 1 33 13 4 0 24 0 0 34 0 5 1 35 13 0 35 0 6 0 36 0 0 36 0
Table 2.5.Example data for the precision at a cutoff rank to illustrate the calculation of average preci-
sion.
The average precision avgP(A)and avgP(B)can be calculated as follows.
avgP(A) = 1 1+ 2 2+ 3 5 ∗1 3 =0.87 avgP(B) = 1 1+ 2 2+ 3 3 ∗1 3 =1.0
No matter which measure was chosen for evaluation it has to be judged on the meaningfulness of measure on a case-by-case basis. Several difficulties have been discussed in literature. For text-mining dependent systems an evaluation can either be done based on an experts judgement or based on a gold standard. A gold stan- dard can be a data set produced by a method that is widely accepted as being the
GOLD STANDARD
best available or it can be manually created by experts to compare different sys- tems. Gold standard evaluation are common in biomedical information retrieval but few benchmarks are available. Known reference corpora (solutions) for text retrieval are created at the Text REtrieval Conference (TREC) workshops which already had tracks on the retrieval of genomic data, general question answering, recently on large scale search in chemistry-related documents, and explore information seeking behaviors common in general web search. For the task of named entity recognition (NER) Hakenberg (2007) named five facts, that make the evaluation difficult. Four of them are general to evaluation task in biomedical text mining.
1. Availability of corpora (data sets): Few corpora are available, that are sufficiently large for meaningful comparisons. Very often tools are only evaluated on 10 - 100 PubMed abstracts.
2. Annotation is subjective: for NER in particular it cannot be assumed that the annotator is aware of all gene and protein names. In other areas, maybe not
all annotators will have to have wider understanding to decide on the validity of an annotation. Therefore annotation guidelines are important and the same document or data should be annotated by several persons to obtain the inter- annotator agreement.
3. Matching accuracy: One can be variably strict in the decision on true positives. For question answering one could require for instance the only exact answer, allow similar but correct answers, or reward correct facts contained in answers. Not quite correct answers can be penalised twice as they are counted as false negative for the missed fact and as false positive for the not quite correct retrieval, and other way around.
4. Unbiased evaluation: The evaluation should always be performed in a task- specific manner to avoid and recognise the tuning of methods for just one of the tasks.