ENCUESTA A DOCENTES
6. PROPUESTA ALTERNATIVA
ch in 2009 – The Rebholz-Schuhmann Gr oup
Semantic standardisation of the scientific literature
Dietrich Rebholz-Schuhmann
Master in Medicine, 1988, University of Düsseldorf. PhD in immunology, 1989, University of Düsseldorf. Master in Computer Science, 1993, Passau. Senior scientist at gsf, Munich and LION bioscience AG, Heidelberg. At EMBL-EBI since 2003.
INTRODUCTION
Text mining comprises the fast retrieval of relevant documents from the whole body of the literature (e.g. Medline database) and the extraction of facts from the text thereafter. Text-mining solutions are now becoming mature enough to be automatically integrated into workflows for research work and into services for the general public, for example delivery of annotated full-text documents as part of UK Pubmed Central (UKPMC).
Research in the Rebholz-Schuhmann group is focused on fact extraction from the literature. It is our goal to automati- cally connect literature content to other biomedical data resources (e.g. bioinformatics databases) and to evaluate the results. Ongoing research targets the recognition of biomedical terms (genes, proteins, Gene Ontology labels) and the identification of relationships between them.
The work in the research group is split into different parts: 1) research work in named entity recognition and its quality control (e.g. UKPMC project); 2) knowledge discovery tasks, e.g. for the identification of gene–disease associations; and 3) further development of the IT infrastructure for information extraction. All parts are tightly coupled.
RESEARCH IN NAMED ENTITY RECOGNITION
Standardisation of the scientific literature: UKPMC and CALBC
Vivian Lee, Jung-Jae Kim, Piotr Pezik, Anika Oellrich, Menaka Naraysamy
The research work of the Rebholz-Schuhmann group is concerned with the integration of the scientific literature with the bioinformatics data resources. One important part of this research work is the identification of named enti- ties, e.g. genes, proteins, diseases, species, from the scientific literature, and subsequently linking the entities to an entry in a reference database, for example UniProtKB for proteins. Both steps are challenging and require the use of natural language processing techniques as well as statistical methods. Several solutions are underway to normalise the representation of concepts in the scientific literature: 1) provision of a standardised lexical resource (BioLexicon); 2) definition of a schema that enables the annotation of entities in the scientific text; 3) availability of an IT infrastructure that annotates the documents with named entities and links the entities to the reference data resources; and 4) means to measure the performance and improve the quality of the annotations.
In order to provide full coverage of domain knowledge in molecular biology, the Rebholz-Schuhmann group has undertaken research to generate a complete terminological resource (BioLexicon) for gene and protein names (GPNs), chemical entities and ontological terms (e.g. Gene Ontology) as part of the European research project ‘BOOTStrep’ (www.bootstrep.org). A number of bioinformatics resources have been incorporated into this BioLexicon, for example, the BioThesaurus (Liu et al., 2006), to cope with nonsense names and identify ambiguous terms. The quality of the BioLexicon has been assessed in its capability to improve the performance for named entity recognition for genes and proteins. Furthermore, the BioLexicon has been enriched with information from other resources, such as the scientific literature, and includes novel terms and confidence values for their relevance to the contained concepts.
In recent years, we have proposed a schema for the enrichment of the scientific literature with concept mentions (Rebholz-Schuhmann, Kirsch & Nenadic, 2006). This solution has now been implemented into the literature analy- sis services of the Rebholz-Schuhmann group (WhatizitIeXML) and is used for the comparison and evaluation of annotations delivered from different annotation services. The BioLexicon serves as a standard reference database for
136
biomedical terms and is similar to the UMLS lexical resource for the medical domain which supports research on the annotation of scientific literature. All the different resources have been integrated into a text-mining solution that indexes the full body of scientific literature as part of the UKPMC project. The annotations are delivered through CiteXplore (www.ebi.ac.uk/citexplore/) to the British Library for public use via the UKPMC interface.
The Rebholz-Schuhmann group is preparing a competition for the annotation and standardisation of the scientific literature called the ‘Collaborative Annotation of a Large Biomedical Corpus’ (CALBC) (see next page).
Identification of gene/protein named entities, species and diseases in scientific literature
Jee-Hyub Kim, Ian Lewin, Romain Tertiaux, Abhishek Dexit, Anika Oellrich
The identification of named entities for genes and proteins is ongoing work and is embedded into the research work for the UKPMC project (figure 1). The research team is collaborating with the National Centre for Text Mining (NaCTeM, Professor Sophia Ananiadou) in this project. New solutions have been developed over the past year, which combine dictionary-based gene mention identification with a machine learning solution.
The connection of gene/protein entities to database entries requires the identification of species-specific terms from the context of genes and proteins. To improve the normalisation, the Rebholz-Schuhmann group has advanced species identification from the scientific literature using a dictionary-based method. In this approach, statistical information on the distribution of species names in the literature has been used to reduce the false positive rate. Furthermore, the lexical resource has been adapted to the demands of literature analysis by reducing the false identification of species. The new solution for species recognition also provides advantages for general information retrieval of full-text docu- ments since it identifies not only the species but also the genus and any other name from the upper parts of the taxo- nomic hierarchy. The final solution is available through the Whatizit infrastructure and is integrated into the UKPMC prototype.
Further research work is concerned with the identification of disease terms and phenotypic information. Identification of chemical named entities in patent texts
Delphine Bas, Piotr Pezik, Adam Bernard
In collaboration with the ChEBI team and the European Patent Office (EPO), the group is identifying Named Chemical Entities (NCEs) in biochemical patent documents. Members from the EPO and the ChEBI team have pro- vided a manually annotated gold standard corpus that serves as training and test data. The ultimate goal is the auto- matic extraction of NCEs in patent data, which can then be considered for addition to the ChEBI resource.
Figure 1. Overview of the use of bioinformatics data resources for the standardisation and semantic enrichment of full-text documents as part of the UKPMC project.
Resear ch in 2009 – The Rebholz-Schuhmann Gr oup
137