Many of the software components developed for the implementation of the ontology generation methods have been shared projects. In the following, the participating developers are listed for shared project.
Ontology Learning
• Idavoll – algorithms for extracting and ranking terms and definitions
• IdavollPlatform – web application build on Google GWT to access to demo the term generation methods
General text mining data structures
Shared work with Loic A. Royer and Andreas Doms. Text-mining data structures and general Taggers (Tokenizers, Stemmers, etc)
• ElivagarCore – data structures and annotation framework • Elivagar – general text-mining and word sense disambiguation
Ontology learning web services
Web services to provide access to ontology learning methods for the ontology editors OBO- Edit, Protégé and in GoPubMed.
• GoPubMedOntologyGenerationServiceLogModule • GoPubMedDefinitionGenerationService
• GoPubMedOntologyLookupService • GoPubMedTermGenerationService
Resource web services
• GoogleNgramService – web service to provide a cached access to an index over the large of WebCT n-grams source.
• PubMedNGramWebService – web service to provide access to n-grams extracted from 18 million PubMed abstracts
• PubMedTokenStatisticsWebService – web service to access token frequencies and sentence-wise co-ocurrences extracted from 18 million PubMed abstracts
Programmatic access to GoPubMed
Software to access GoPubMed documents and annotations
• GoMeshPubMed – access all documents and annotations like in GoPubMed • PubMedSearch – provide search in PubMed
• PubMedSearchViaYggdrasil – provide search like in GoPubMed caches • YggOntologies – access the ontologies used in GoPubMed
Lucene Indexing
The fulltext indexing of PubMed is shared work with Heiko Dietze.
• LuceneGoogleIndexing – indexing all n-grams contained in the Google WebCT corpus • LucenePubMedIndexing – indexing PubMed abstract
• ElivagarDatasourcesLucene – framework adapter to access Lucene indices
• PubMedFullTextIndex – indexing and accessing an Lucene PubMed fullext index
Other software component
• MSNLiveSearchClient – Microsoft Live Search client
Integration of Ontology Generation in Ontology Editors
References
Wächter, T. and Schroeder, M. (2010). Semi-automated ontology generation within OBO-Edit. In ISMB (Supplement to Bioinformatics), Impact factor 2009: 4.3 (accepted for publication)
Winnenburg, R., Wächter, T., Plake, C., Andreas, D., and Schroeder, M. (2008). Facts from text: Can text mining help to scale-up high-quality manual curation of gene products with ontologies? Briefings in Bioinformatics, 9(6):466–478, Impact factor 2009: 4.6)
Conferences / Workshops
Wächter, T. and Schroeder, M. (2009). An Ontology Generation Plugin for OBO-Edit. 3rd International Biocuration Conference, April 16-19, Berlin, Germany
Wächter, T. (2009). The Ontology Generation Tool for OBO-Edit.
Presentation at the GO Consortium & SAB Meeting, September 23-25, Cambridge, United Kingdom
The DOG4DAG ontology generation methods developed in this thesis have been seam- lessly integrated in OBO-Edit and Protégé, two widely used ontology editors in the life sci- ences. The systems offers either to submit a query to PubMed or the Web or to upload text or PDF documents. While PubMed is the default source for terminology, the Web is often use- ful since full-text articles and other on-line resource can be implicitly included in the search. When adding a term to the ontology, possible parents are suggested on the basis of generated definitions. Existing terms from other ontologies are automatically cross-referenced.
It has been shown on recent examples (Section 7.6) how the OBO-Edit Ontology Genera- tion Tool can support the annotation of genes and gene products and the associated extension of the Gene Ontology.
In addition, a novel collaborative taxonomy editor has been specified as user-friendly, web- based alternative to existing ontology editors. It allows domain experts to contribute to the Go3R ontology without having to install or learn new complex software systems. The editor directly modifies the ontology of the semantic search engine which immediately has effect on subsequent searches.
Fig. 7.1. Overview on the integration of ontology learning methods in ontology editors. The term generation use several corpus statistics for frequencies and co-occurrences of terms, retrieves docu- ments from PubMed and the Yahoo search engine. The definition generation uses beside the Yahoo the Windows Live Search engine. The ontology look-up is performed using the Ontology Look-up Service provided by the European Bioinformatics Institute, Cambridge, UK.
7.1 Introduction
As the scientific truth advances, ontological knowledge needs to evolve. Ontologies need to be maintained. This evolution process includes adding new concepts, the deletion of obsolete concepts, re-structuring of already defined concepts as well as adding synonyms, definitions, and relations. Creating and maintaining such ontolo- gies is a labour intensive, difficult, manual process.
In previous chapters it has been evaluated to what extent semi-automatic ontol- ogy generation methods can support this process. In order to contribute to automa- tion of ontology generation, algorithms and methods as developed in this thesis have been integrated into Protégé and OBO-Edit, two widely used editors in the life sciences.
Figure 7.1 provides an structural overview for the presented software in this chapter. All three editors share the same service infrastructure for term generation, definition extraction, taxonomy generation, and ontology look-up. OBO-Edit and Protégé share the DOG4DAG GUI widget which encapsulates all ontology genera- tion functionality and the communication to the web services. For each editor spe- cialised adapters had to be implemented for the different ontology models and the plug-in mechanisms.
OBO-Edit Ontologies and taxonomies have proven highly beneficial for biocura- tion. The Open Biomedical Ontology Foundry (www.obofoundry.org) alone lists over 90 ontologies mainly built with OBO-Edit. To address the needs of biocurators on- tology generation methods have been integrated in OBO-Edit the ontology editor developed and maintained by the Gene Ontology Consortium.
Protégé To give equally support to developers of ontologies in OWL, the term and definition generation has been integrated in Protégé, a widely used ontology editor.
Go3R Editor Ontology development, as performed in this thesis, is also largely motivated by the application of ontology-based literature search. The major bottle- neck here is the availability of suitable ontologies. To be able to transfer the tech- nology to other knowledge domains new ontologies need to be created. A review on existing editors revealed, that none of the existing tools meet the requirement for the collaborative creation of taxonomies. To overcome this limitation a novel ontol- ogy editor has been specified to support collaborative ontology development and integrate ontology generation methods.
Outline
Following a brief overview on existing ontology editors (Section 7.4), in this chapter a detailed description of the OBO-Edit Ontology Generation Tool is provided. It will be explained on real tasks performed by researchers editing the Gene Ontology, how this new tool can be used in the process of annotating genes and proteins as well as for the resulting extension of the Gene Ontology (Section 7.6).
Secondly, a novel user-friendly editor for taxonomies as the one used by the semantic search engine Go3R is introduced.
• Integration of ontology learning methods in the widely recognized editors OBO- Edit (Section 7.2) and Protégé (Section 7.3)
• Design and development of an ontology editor for Go3R and the integration of ontology learning methods (Section 7.4)
• Introduction to the web-based term generation platform (Section 7.5) • Summary and Discussion (Section 7.7)
• Future Work (Section 7.8) • Contributions (Section 7.9)