Edictos Oficiales Ministerio de Educación
ADMINISTRACIÓN GUBERNAMENTAL DE INGRESOS PÚBLICOS
This chapter presents the conclusion of this thesis. First, the main contributions of this work will be summarized in Section 6.1. Second, Section 6.2 presents a discussion on the limitations of the contributions.
6.1. Summary
This thesis provided several contributions for supporting users from the medical and bioin- formatics domain in the integration of data mining into scientific data analysis processes. In the following, the main contributions that represent the answers to the research ques- tions formulated in Section 1.2 will be summarized:
Q1: What is a suitable integration mechanism that allows users with a lot of domain knowledge but without knowledge on grid systems to integrate data mining components that have been developed in single computer environments into distributed grid-based anal- ysis environments used for bioinformatics scenarios?
In Chapter 3 we presented an approach to support users in integrating already available data mining components into distributed grid environments. The approach is based on describing data mining components that have been developed for single computer environ- ments with the help of a predefined XML schema (the Application Description Schema) in such a way that, in addition to the already available executable file of the component, only the metadata description is needed for the integration into a distributed grid en- vironment. By this, the procedure for the integration is facilitated for the users. The application descriptions are used for interacting with core services of a grid system to register and search for available data mining components on the grid, to match analysis jobs with suitable computational resources, and to dynamically create user interfaces. The presented approach allows for an integration of data mining components by users with- out deeper knowledge on the underlying grid technology and without intervention on the component side. We have shown that it is possible to cover all information necessary for the execution of data mining components in OGSA-based grid environments with a single XML schema and that it is possible to create a technical system for the execution of data mining components based on data exchange via the XML schema. Our approach allows for a web-site based procedure of grid-enabling data mining components.
We validated our approach in several case studies in the context of the DataMiningGrid project. In the first case study it was shown that it is possible to implement the architec- ture of our approach in the context of the DataMiningGrid project. The second case study demonstrated that it is possible to create a user interface for grid-enabling data mining components based on the Application Description Schema which is easy to use by users who do not have knowledge on grid technology. In the third case study we have shown
that our approach is applicable for the integration of standard data mining components into grid environments by grid-enabling algorithms of the Weka data mining toolkit. The forth case study demonstrated that our approach also supports the data mining scenarios Data Partitioning, Classifier Comparison and Parameter Variation, which are based on grid-enabled components.
Q2: Is it possible to create a technical system that allows scientific users to interactively develop data mining scripts consisting of one or more data mining components for bioin- formatics scenarios in distributed grid-based analysis environments?
In addition to reusing existing components for data mining from single computer envi- ronments, scenarios in bioinformatics demand for developing new data mining components and developing them further within distributed analysis environments. In Chapter 4, this thesis presented an approach for interactive development of data mining scripts in the context of bioinformatics scenarios in grid environments. The approach is implemented in the GridR toolkit, consisting of the GridR service and the GridR client. The GridR service is a single grid service with complex inputs and outputs that allows for providing scripts as parameter instead of registering each single script as separate component in the grid. By this, it allows for developing and executing data mining scripts based on the scripting language R in grid environments without the need for developing, describing and deploying atomic components, thus making the development process more efficient. In addition, the GridR client allows for interactively developing data mining scripts in grid environments, which allows users to use a well known tool as user interface for the grid environment.
The presented approach was evaluated in two case studies that demonstrated the ap- plicability of the approach in different application scenarios in the context of the ACGT project. In detail, we demonstrated a data analysis scenario from bioinformatics imple- mented with GridR and a scenario from an industrial application that is parallelized using GridR.
Q3: Can we define a description for data mining processes that allows for the reuse of existing data mining processes based on data mining components and scripts?
As today’s analysis processes of scenarios in bioinformatics include complex process chains and due to increasing collaboration, the reuse of processes and components becomes more important. In Chapter 5 we presented an approach for supporting the reuse of data mining based data analysis processes based on describing the steps of these processes at different levels of abstraction. The basic idea is to abstract tasks from a concrete executable workflow to create a reusable process pattern. This process pattern can then be reused by specializing it to an executable workflow for another problem. Our approach is based on CRISP and includes the definition of data mining process patterns, a hierarchy of tasks to guide the specialization of abstract process patterns to concrete processes, and a meta-process for applying process patterns to new problems. The data mining process patterns allow for the description of process tasks and requirements based on the task hierarchy. Such process patterns represent a flexible representation for different levels of generality of tasks in the analysis process and allow for describing processes between the CRISP process as most abstract process and executable workflows as most concrete
6.2. Discussion
processes. The meta-process guides the user when applying a process pattern and includes the steps for the specialization of generic tasks to executable tasks.
Our approach was evaluated in 3 case studies in the context of the projects ACGT, p-medicine and iWebCare. In the first case study we presented how to create and apply data mining process patterns in the context of a clinical trial scenario. It was shown that it is possible to create a process pattern by abstracting executable tasks of a script and to apply this pattern by specializing it into a workflow including a manual task. In the second case study we transformed a data mining process pattern of a multi-center-multi-platform scenario into a process pattern for describing the abstract process of meta analysis in bioinformatics. By this it was shown that it is possible to describe abstract processes consisting of conceptual tasks. In the third case study we described how data mining process patterns can be integrated into business processes in a fraud detection scenario from the health care domain. We demonstrated how a data mining process pattern can be created based on information from a data mining paper and how to integrate this pattern into business processes. Furthermore, we showed how our approach can be evaluated according to best practices in business process redesign.
6.2. Discussion
In this thesis we tackled the research question on how to support users in the bioinformatics and medical domain in data mining based data analysis in the context of heterogeneous set- tings including heterogeneous user groups, heterogeneous data sources and heterogeneous computing environments. Summarized, we developed tools and methods that facilitate the use and reuse of data mining components, scripts and processes in scenarios from this domain to answer the question stated above. By our contributions, the development of new scenarios becomes more efficient, as these can be founded on existing components, scripts and processes more easily.
It has been shown that the presented tools and methods help bioinformaticians in their work. However, although we delivered important building blocks to address the problem of supporting users as described above, there remain open issues for future work.
The challenge of heterogeneous group of users has been addressed by providing a method for reusing data mining components that is flexible enough to support a variety of tools used by the different users. By focussing on OGSA-based grid environments, which allow to build secure distributed systems, the challenges of users in different locations, multi- computer environments and distributed data sources have been addressed. The solutions presented in Chapters 3 and 4 are focussed on OGSA-based grid computing environments. The Application Description Schema including the associated services as well as the ar- chitecture of the GridR service are based on the batch job processing functionality that such environments currently provide.
However, the tools used and the distributed environments will be further developed and new architectures and infrastructures will emerge. Thus, if these environments evolve in future, our approaches might not longer directly match. Today, bioinformaticians are working with a huge set of different tools ranging from data mining toolkits such as Weka or RapidMiner and scripting environments such as R up to workflow environments such
as Taverna or Galaxy and their workflow sharing mechanisms, and they probably will work with lots of different analysis tools and process environments in future. Thus, the integration of such tools, including both the reuse of the functionality of the tools and the reuse of their user interfaces, with up-to-date distributed environments and the reuse of the solutions developed with these tools will still remain a problem.
The direct integration of scalable distributed environments into data mining toolkits, as described for R and grid computing environments in the context of the GridR client in Chapter 4, can also be useful for other environments. E.g., in [144] we investigated the integration of Weka and Hadoop, which is a system for cluster and cloud computing environments.
The challenge of complex process chains has been addressed by providing a method for supporting the reuse of processes based on different levels of abstraction. The data mining process pattern approach presented in Chapter 5 provides a step into the direction of describing processes for reuse from the perspective of the bioinformatician and the steps that he needs to perform to reuse a process. However, there is a need for a tool that guides through the process of reuse that can be attached to different process environments. In addition, there is no support for deciding whether a process pattern is a good process pattern for a given data mining problem. The question remains on how the quality of a process pattern can be determined and how to select a process pattern from a process pattern database. Furthermore, it has to be analysed whether the hierarchy of tasks is adequate or if it would make sense to combine the current user focussed view on the tasks that the user needs to perform with additional structures for domain-oriented tasks that could be provided by the use of a task ontology. Furthermore, a detailed user study on our approach for reusing processes would be beneficial to provide more evidence for its usefulness.
In summary, we provided a solution based on existing technology based on methods that allow for the reuse of existing data mining components in grid environments, for the interactive development of data mining scripts consisting of one or more components in a grid environment, and for the reuse of processes including data mining components and scripts.