MFOrepresents the information required to build and validate anSBMLmodel in a way accessible to machines. The three separate sources of this information meet the needs of distinct user groups and, while optimal for the existing range of human-driven applications, this disparate arrangement is not suitable for computational approaches to model construction and annotation. This is primarily because much of the knowledge about what constitutes valid and meaningfulSBMLmodels is tied up in a non-machine-readable format. Once available to machines, the model is exposed to reasoners which can be used to check for internal consistency and correct structure.
MFOis not intended to be a replacement for any of the APIs or software programs available to the
SBMLcommunity. Rather,MFO is a complementary approach that facilitates the development of the rule-based mediation integration strategy (Chapter5) for model improvement and manipulation by drawing upon internal consistency and inference capabilities. WithMFO, inconsistencies both in the specification documents and in the models themselves can be identified. It addresses the very specific need of a sub-community within SBMLthat wishes to be able to express their models in
OWLfor the purpose of reasoning, validation and querying.
The scope of MFO does not extend to mapping other domains or integrating other data sources. These integrative tasks have a logically distinct purpose. MFO can be used on its own or as one of the source ontologies integrated via rule-based mediation. Chapter5goes beyond the standalone usage described here to explain how rule-based mediation utilisesMFOwith other source ontologies to integrate and query systems biology data. In rule-based mediation, MFOhas a dual role: firstly, it acts as the format representation for SBML models such as those stored within the BioModels database, and secondly it exports query results as new annotation onSBMLdocuments.
Chapter 5
Rule-based mediation for the semantic
integration of systems biology data
Introduction
The creation of accurate quantitative SBMLmodels is a time-intensive manual process. Modellers need to know and understand both the systems they are modelling and the intricacies of theSBML
format. However, the amount of relevant data for even a relatively small and well-scoped model can be overwhelming. Systems biology model annotation in particular can be time-consuming; the data required to perform the annotation is represented in many different formats and present in many different locations. The modeller may not even be aware of all suitable resources.
There is a need for computational approaches that automate the integration of multiple sources to enable the model annotation process. Ideally, the retrieval and integration of biological knowledge for model annotation should be performed quickly, precisely, and with a minimum of manual effort. If information is added manually, it is very difficult for the modeller to annotate exhaustively. Current systems biology model annotation tools rely mainly on syntactic data integration methods such as query translation. A review of data integration methodologies in the life sciences is available in Section1.6, and the Saint syntactic integration system for model annotation is described in detail in Chapter3.
While large amounts of data can be gathered and queried with syntactic integration, the integration process is limited to resolving formatting differences without attempting to address semantic hetero- geneity, which occurs when a concept in one data source is matched to a concept in another data source which is not exactly equivalent. Semantic data integration methods can be used to resolve differences in the meaning of the data. This chapter describes rule-based mediation, a novel method- ology for semantic data integration. While rule-based mediation can be used for the integration of any domain of interest, in this instance it has been applied to systems biology model annotation. Off- the-shelf Semantic Web technologies have been used to provide novel annotation and interactions for systems biology models. Existing tools and technology provide a framework around which the system is built, reducing development time and increasing usability.
There are three main parts to rule-based mediation: the source ontologies for describing the disparate data formats, the mapping rules for linking the source ontologies to a biologically-relevant core, and the core ontology for describing and manipulating semantically-homogeneous data (see Figure5.1). In rule-based mediation, the data is materialised in the core ontology by using a set of rules to translate the data from the source ontologies to the core ontology, avoiding the need for complex query translation algorithms.
Figure 5.1: An overview of rule-based mediation. One or more data sources sharing the same format (shown here as small circles) can be represented using a source ontology (shown as pentagons). Rules, or mappings, from the source ontology to the core ontology provide a route through which information from the data sources is mapped. The information within the core ontology can then be queried and reasoned over. Relevant knowledge for a particular systems biology model or models can then be exported from the core ontology via a chosen source ontology and ultimately saved in a underlying data source format such as SBML. In the upper left corner, the names of the core and source ontologies created or used for the research described in this thesis are listed.
• the data from syntactically and semantically different data sources are materialised within a core ontology for easy access and stable querying;
• the core ontology is a semantically-meaningful model of the biological domain of interest; and • implementation can be performed using off-the-shelf tools and mapping languages.
Integrating resources in this way accommodates multiple formats with different semantics and pro- vides richly-modelled biological knowledge suitable for annotation of systems biology models. This work establishes the feasibility of rule-based mediation as part of an automated systems biology model annotation system.
Section 5.1 describes the type of integration performed in rule-based mediation as compared with existing data integration methodologies. The first step in rule-based mediation is the syntactic con- version of heterogeneous data sources into ontologies, as detailed in Section 5.2. These source ontologies are then aligned to a small core ontology by applying a rule base (Section 5.3). This core ontology is a model of the biological domain of interest which can be queried and reasoned
over (Section 5.4). As demonstrated in Section 5.5 with two use cases, selected results from the core domain ontology can then be applied to systems biology models as a variety of types of bio- logical annotation. Section 5.6describes possible future directions and applications for rule-based mediation, and Section5.7provides a final discussion and summary of the methodology.
Availability
The rule-based mediation website ishttp://cisban-silico.cs.ncl.ac.uk/RBM/, and the on- tologies and software toolset is available from Subversion1. The ontologies stored within the rule- based mediation project are licensed by Allyson Lister and Newcastle University under a Creative Commons Attribution 2.0 UK: England & Wales License2. The rule-based mediation software toolset is covered under the LGPL3. For more information, see LICENSE.txt in the Subversion project. TheSWRLrules described in this chapter can be run programmatically or via the Protégé 3.4 editor4. Programmatic access requires the rule-based mediation software toolset, which in turn requires the OWL API. The SWRLTab plugin within Protégé 3.4 requires the installation of the Jess Rule Engine5 for Java. All mapping rules are available for viewing within swrl-mappings.owl in the Subver- sion distribution. Instructions for viewing the telomere ontology, the source ontologies, theSWRL
mappings, Semantic Query-Enhanced Web Rule Language (SQWRL) queries, and theSBMLmodel before and after annotation in these use cases are available in the Subversion repository6.