Ministerio de Gobierno - Ministerio de Educación

2.4. DataMiningGrid, ACGT and p-medicine

Parts of the results of this thesis were achieved in the context of the European research projects DataMiningGrid [35], ACGT [34] and p-medicine [97]. These projects included the development of grid- and workflow-based environments supporting data mining. In the following, the projects will be introduced briefly. The approaches which will be presented in the Chapters 3, 4 and 5 are implemented and evaluated in the context of the analysis environments of these projects.

2.4.1. The DataMiningGrid Project

The DataMiningGrid project [35], which was supported by the European Commission under FP6 grant No. IST-2004-004475, was a large-scale effort which aimed at developing a system that brings WSRF-compliant grid computing technology to users and developers of data mining applications. The output of the DataMiningGrid project is the DataMin- ingGrid system.

A challenge of DataMiningGrid was to develop an environment suitable for executing data analysis and knowledge discovery tasks in a wide range of different application sectors, including the automotive, biological and medical, environmental and ICT sectors [123]. Based on a detailed analysis of the diverse requirements of these applications, generic technology for grid-based data mining was developed in the DataMiningGrid project.

Existing grid technology already provided a diverse set of generic tools, but due to the generality the available functionality partially lacked to specifically support advanced data mining use-cases. In the DataMiningGrid project, enhancements to open source grid middleware were developed in order to provide the specialised data mining functionality required by the use-cases. This included functionality for tasks such as component discovery, data manipulation, resource brokering, and component execution. The result is a grid-based system including all the generic functionality of its middleware and additional features that support the development and execution of complex data mining scenarios.

The contribution of Chapter 3 is implemented in the DataMiningGrid system. The technical architecture of the DataMiningGrid system will be described in more details in Section 3.4.1.

2.4.2. The ACGT Project

The Advancing Clinico-Genomic Clinical Trials on Cancer (ACGT) project [34], which was partially funded by the European Commission under the project identifier FP6-2005- IST-026996, aimed at providing an open environment for supporting clinical trials and related research through the use of grid-enabled tools and infrastructure.

With the accelerating development of high-throughput technologies in the domain of biomedical research and of their use in the context of clinical trials, hospitals and clinical research centres are facing new needs in terms of data storage and analysis [149]. E.g., in the context of microarray analysis of a tumour biopsy the data for a single patientincludes 10’000s to 100’000s of gene-expression values summarizing up-to millions of microarray features [150]. New technologies based on imaging, genome sequencing and proteomics

are pushing even further the needs for data processing in the clinical research. The ACGT project aimed at addressing the analysis of such complex data sets by developing appropri- ate data exploitation approaches, integrating the know-how acquired in many independent fields into a powerful environment that physicians can easily and safely use, for the bene- fit of the patients [149]. Several initiatives with a similar goal started worldwide, among which NCI’s caBIG (Cancer Biomedical Informatics Grid) [25] in the USA and CancerGrid in the UK [134].

The ACGT project aimed at providing an IT infrastructure supporting the manage- ment of clinical trials (e.g. patient follow-up), as well as the data mining involved in the translational research that often occurs in parallel. It is in relation to the latter aspect of the project that the need for a high-performance environment supporting the R language (see Section 2.1.3) and the vast collection of already existing biostatistics algorithms was recognized. The contribution of Chapter 4 is implemented in the ACGT system. The technical architecture of the ACGT system will be described in more details in Section 4.7. Furthermore, parts of the case study in Chapter 5 are based on the ACGT project

In ACGT, the problem of describing the content and semantics of heterogeneous data was addressed in the following way (summarized from [140]): The semantic mediation layer included in ACGT is designed to allow end users and client applications performing integrated queries against a set of heterogeneous biomedical databases. The ACGT Mas- ter Ontology was specifically designed within the ACGT environment to serve as unified schema of this mediation layer [23]. This ontology acts as database schema for performing integrated queries to a set of underlying databases. The main component of the layer - the semantic mediator - accepts queries in SPARQL [5], thus sticking to the most widespread standard for querying of resources in the web. Employing a self-designed mapping architecture, the mediator is able to detect which data sources are needed to solve a query and generate the queries for such sources. At this level, the mediator delegates on the database access wrappers, which hide the syntactic peculiarities of each database by of- fering SPARQL access to all of them. The mediator is then able to collect all sub-results and combine them in a single result-set that is returned back to the user.

The ACGT Query Editor is a web-based graphical tool designed to allow simple and intuitive access to the semantic mediator. SPARQL is the most widespread query language for RDF-based databases, but requires certain degree of technical knowledge to be able to construct queries with it. The aim of the ACGT semantic mediation layer was to offer integrated querying services to clinicians and biostatisticians who would presumably lack advanced technical background. The goal of the Query Editor was to allow such users to take advantage of the querying capabilities. The tool guides the user in the construction of SPARQL queries. The user simply needs to click on the elements which he wants to retrieve, and then select which restrictions (if any) should be included.

By the use of an ontology to describe the underlying data, differences in terminology, data representation and structure could thus be (to a certain degree) easily tackled. In addition, the use of an ontology offers additional interesting features. The mediator takes advantage of the hierarchical organization of the information in the ontology to offer functionalities such as testing if a query has a translation to a specific dataset. The Query Editor also manages the information of the ontology so that the user can build more complex queries with ease. For example, the Query Editor is able to generate

In document Boletín Oficial. Gobierno de la Ciudad Autónoma de Buenos Aires (página 145-151)