• No se han encontrado resultados

Edictos Oficiales Ministerio de Salud

ADMINISTRACIÓN GUBERNAMENTAL DE INGRESOS PÚBLICOS

become the solutions deployed and provide a simple communication tool between end- users, business analysts, developers and the management. Workflow patterns provide a proven and simple technique to shorten the learning curve and improve productivity and quality of the processes designed as they are simple to understand, learn and apply immediately.

Given that these powerful BPM environments and the CRISP model exist, one could assume that it is very straightforward to efficiently reuse data mining processes. However, in practice still many redundancies and inefficiencies exist [146].

In [83] the authors outline how a system can be built that supports users in the design of data mining workflows out of distributed services for data understanding, data integra- tion, data preparation, data mining, evaluation and deployment. The support they aim at includes checking the correctness of workflows, workflow completion as well as storage, retrieval, adaptation and repair of previous workflows. The authors present a data mining ontology (DMO), in which all services including their inputs, outputs, preconditions and postconditions are described. They propose to build a support system based on the DMO, which would also allow for meta-learning for algorithm selection [69]. Based on the ontol- ogy from [69], workflow templates have been defined [84] that can mix executable tasks and tasks that need to be refined into sub-workflows. The workflow templates contain only tasks that are described by concepts at the upper level of the ontology. Such ontologies represent a good way for describing tasks and components that are part of workflows, but lack in covering the description of manual steps and actions that need to be performed to abstract and specialize tasks. The workflow templates are useful for describing auto- mated workflows and allow, in combination with the ontology, for precondition checks for individual tasks, but do not cover manual tasks that need to be performed for executing the process.

Enabling the reuse of existing solutions for similar scenarios has the potential of mak- ing the development of analysis process much more efficient. In principle, processes are reusable in different scenarios, just by performing changes on certain components of them. But, it is not obvious how to exactly do this, as this knowledge is typically not formal- ized. Thus, data mining can only be reused efficiently and successfully in this context if the user is supported during the task of constructing and reusing processes. Data mining based analysis processes in bioinformatics need to be described in a meaningful way. For doing so, it is necessary to know which characteristics and parts of the process have to be described and which have not.

5.4. Analysis of the CRISP Model for Reuse

In the following, we will describe the CRISP phases and tasks in detail and present how these differ in the case of reusing existing solutions compared to executing CRISP from scratch. Depending on the aims of the tasks of the individual CRISP phases, the tasks are either considered to be part of a data mining process pattern if they are reusable, considered to be part of the meta-process if if they are related to following the procedure of reusing a data mining process, or considered to be obsolete for reuse. Figure 5.2 visualizes the mapping of the CRISP tasks.

Figure 5.2.: Mapping CRISP tasks to data mining process patterns and the meta-process.

In the following, we give details on the generic CRISP tasks (based on [30]) and their mapping to the process patterns and the meta-process:

• Phase Business Understanding

This phase focuses on understanding the project objectives and requirements from a business perspective, converting this knowledge into a data mining problem definition and a preliminary plan to achieve the objectives.

– Determine Business Objectives: The task Determine Business Objectives is a general task that sets the goal of the overall process. Business Objective means here answering the research question of the bioinformatics scenario. We arrange this task at the start of the meta-process, as it provides the information needed for the choice of a process pattern.

– Assess Situation: The task Assess Situation involves the set-up of an inventory of resources, a collection of requirements, assumptions, etc. In our scenario, this task does not apply as the important information is already available through the existing process.

– Determine Data Mining Goals: We transform the task Determine Data Mining Goals into a task that checks if the data mining goal is matching and arrange it at the beginning of the data mining process pattern, as the data mining goal is already specified in a data mining pattern.

– Produce Project Plan: The task Produce Project Plan is outside of the scope, as the project plan is following the procedure for the reuse.

5.4. Analysis of the CRISP Model for Reuse

• Phase Data Understanding

This phase is based on an initial data collection and includes activities in order to get familiar with the data, to identify data quality problems and to discover first insights into the data. The description of data and data requirements will be discussed in detail later in Section 5.6.

– Collect Initial Data, Describe Data and Explore Data: The tasks Collect Initial Data, Describe Data and Explore Data are considered as obsolete as we assume the data to be available through the modelled analysis process.

– Verify Data Quality: The task Verify Data Quality is mapped to a task at the pattern level.

• Phase Data Preparation

This phase includes the activities to construct the final dataset from the initial raw data, which can then be fed into the modelling tools. Data preparation tasks include, e.g., table, record and attribute selection as well as transformation and cleaning of data. They are likely to be performed multiple times and not in any prescribed order.

– Select Data, Clean Data, Construct Data, Integrate Data, Format Data: These tasks are all preprocessing tasks at the pattern level.

• Phase Modeling

This phase deals with selecting and applying various modelling techniques including the calibration of their parameters to optimal values. Typically, there exist several techniques for the same data mining problem with different specific requirements on the form of the data. Therefore, stepping back to the data preparation phase is often necessary.

– Select Modelling Technique, Generate Test Design, Build Model, Assess Model : These tasks are part of the patterns.

• Phase Evaluation

This phase involves evaluating the outcome of the modeling phase, the built models that appear to have high quality from a data analysis perspective, from the business perspective. It leads to a decision on the use of the data mining results.

– Evaluate Results: The task Evaluate Results involves a matching with the business objectives. Thus, we arrange this task at the meta-process.

– Review Process: The task Review Process is implicitly contained in loops of the meta-process (changing the specification of tasks of a data mining pattern or choosing another pattern).

– Determine Next Steps: The task Determine Next Steps does not apply as the next steps are defined by the meta-process.

• Phase Deployment

This phase deals with organizing and presenting the knowledge gained in a way that the customer can use it, e.g. by applying models within an organization’s decision

making process. In addition to the data mining expert, the customer or end-user responsible for the higher processes is involved in this phase.

– Plan Deployment : The planning of the deployment by the task Plan Deploy- ment does not apply, as in our context the deployment is always an executable process. Thus, we transform it into a task for deploying the process which is arranged at the level of the meta-process.

– Plan Monitoring and Maintenance: The task Plan Monitoring and Maintenance does not apply as well, as monitoring and maintenance are handled by the process environments anyway.

– Produce Final Report, Review Project : The tasks Produce Final Report and Review Project are outside of the scope, as we are not interested in such a kind of deployment.

Figure 5.3 visualizes the mapping of the generic CRISP tasks.

Figure 5.3.: Mapping of the CRISP tasks. White tasks are part of patterns, blue tasks are mapped to the meta process, red tasks are obsolete.