Calvin Richard Klein
CARACTERISTICAS Y PROPIEDADES APLICADAS A LA LENCERIA:
In the previous section, we analyzed how DoSeR performs on general knowledge from Wikipedia and DBpedia. In this section, we evaluate our system on a specialized domain, namely the biomedical domain. Similar to Chapter 5, we use the CalbC for training and evaluation purposes. An in-depth description of the respective CalbC subcorpora CalbCSmall and CalbCBig can be found in Section 5.4. To provide a better comparison, we contrast the DoSeR results with those achieved with the federated Learning To Rank (LTR) approach proposed in Chapter 5. Since the LTR approach does not collectively link all surface forms within a document, we report the DoSeR results after collective and non-collective EL. In the collective configuration, our algorithm is not able to retrieve a ranked list of (correct) entity assignments for each surface form. As a consequence, to return multiple correct EL results, we modified our approach and returned the list of remaining candidate entities in the Final Linking and Abstaining step (cf. Section 8.2.3) sorted according to their PageRank score. In the non-collective configuration, the DoSeR algorithm relies on the Sense Prior probability and the textual context matching score (computed with Doc2Vec) and allows us to return a relevance-sorted entity list. In both approaches we return a list containing at most 10 entities (equally to the LTR approach in Section 5.5.3).
In our general-domain evaluation, we leveraged the Wikipedia article pages as Doc2Vec training corpus. In CalbC, the documents do not describe entities as it is the case in Wikipedia. For that reason, we created our entity-context embeddings based on the
surrounding context of annotated entities (cf. Section 7.2). Hereby, we used a context window of 100 words before and after the surface form during the training phase (as suggested in Chapter7).
Since the CalbC provides multiple correct entity annotations per surface form, we report the mean reciprocal rank (MRR), recall and mean average precision (MAP) in this evaluation. All these measures were averaged over 5-fold cross validation runs. For every cross-validation run, we used the unified set of the 4 training partitions to train our entity embeddings (i.e., entity embeddings and entity-context embeddings).
Figure8.4shows the MRR, recall and MAP values of DoSeR (collective and non-collective) and the federated LTR approach on the CalbCSmall data set. Overall, the non-collective approach of DoSeR performs worse than our LTR approach. Obviously, our LTR feature set is superior (≈ 4−6 percentage points on our measures) to the DoSeR feature set only comprising the Sense Prior and surrounding context matching with Doc2Vec. By contrast, our collective approach achieves the best results overall with outperforming the LTR approach. A MRR of 0.937 indicates a high level of reliability in terms of ranking a correct entity on top. In terms of recall and MAP, DoSeR-collective tops the LTR approach by≈3 percentage points. An evaluation on the CalbCBig data set results in nearly the same result values for all approaches.
0.5 0.6 0.7 0.8 0.9 1 MRR Recall MAP DoSeR(Collective) DoSeR(Non-Collective) LTR(Federated) 0.938 0.747 0.732 0.889 0.704 0.674 0.927 0.718 0.709
Figure 8.4: MRR, recall and MAP values of DoSeR (collective), DoSeR (non-collective) and the federated LTR approach on CalbCSmall
We also conducted an experiment with our default DoSeR settings as used in the general- domain experiments. Here, we analyzed whether the retrieved entity (only one entity is retrieved by default) is located among the ground truth entity list. Using the 0-1 loss, i.e., we lose a point if we get a wrong entity, we obtain an accuracy value of ≈0.871. When we
apply the same measure on our LTR approach, we obtain an accuracy value of ≈0.848. Wesummarize, that DoSeR outperforms our federated LTR approach and performs well in the biomedical domain. Although, the LOD cloud lacks relevant entity data for EL [Zwi13b], DoSeR is able to leverage the evidences in form of annotated entities in the document-centric KB to provide strong EL results.
8.4.5 Abstaining
Abstaining is an important task in EL algorithms when it comes to link surface forms whose referent entity is not in the entity target set𝛺. It is also used if there is uncertainty about the correct entity due to insufficient context information.
In this experiment, our algorithm returns the pseudo-entityNIL in the following situa- tions:
• If no candidate entities can be found during the candidate generation step (cf. Section 8.2.4).
• If the algorithm is uncertain about the correct entity after the last PageRank iteration (cf. Algorithm 8.2.3).
For experimental purpose, we downloaded the original IITB data set, which additionally contains 7652 NIL annotations in addition to the default annotations (18 897 annotations overall), and report the EL accuracy. We also rerun the GERBIL experiments with abstaining to investigate to what extent the results decrease on data sets which do not provide NILground truth annotations.
Conducting the experiment on the manually downloaded IITB data set resulted in an EL accuracy of 0.757 (micro-averaged). With returning 6120 NIL annotations overall, our algorithm does not find candidates for surface forms in 3823 cases (≈ 62.5%) and abstains 2297 surface forms (≈37.5%). When we tune our abstaining parameter to abstain more aggressive, our overall accuracy slightly decreases. Unfortunately, the authors of the topic-model, state-of-the-art approach [Han12] on this data set did not provide abstaining results for comparison in their work. However, Table 8.6 reports the micro F1 values of our algorithm with abstaining on all data sets in the GERBIL evaluation.
Table 8.6: F1 values of our approach with abstaining on data sets withoutNIL annotations
Data Set F1 Change in F1
percentage points ACE2004 0.892 -1.65 AIDA-TestB 0.782 -0.26 AQUAINT 0.820 -2.61 DBpedia Spotlight 0.773 -4.57 MSNBC 0.906 -0.55 N3-Reuters128 0.809 -4.82 IITB 0.722 -2.56 Microposts-2014 Test 0.607 -7.07 N3 RSS-500 0.738 -1.73
As a result of GERBIL not querying surface forms withNIL annotations in the ground truth, our results (slightly) decrease. Nevertheless, the number of abstained surface forms is very limited and, thus, our approach still outperforms Wikifier on 6 out of 9 data sets. On the Microposts2014-Test data set, the F1 decrease is the highest with 7 percentage points. Obviously, our algorithm is sometimes uncertain about the correct entity and abstains, which is due to a small number of surface forms per document. In other words, our algorithm lacks sufficient evidences about the correct entity and, hence, abstains due to exceeding the abstaining threshold.
In summary, we state that our algorithm is able to successfully abstain entity anno- tations if evidences about the correct entities are missing. Our abstaining mechanism performs well even if data sets do not provideNILannotations (as simulated by GEBRIL).