• No se han encontrado resultados

We have discussed the topic of knowledge discovery in inductive databases. We pro- vided an overview of the technology as well as definitions of the main concepts and paradigms. We also introduced an architecture for the implementation of an inductive database that is suited for computing and querying both locally and remotely and dis- cussed its various components. More importantly, we argued that services are suited for usage in this type of knowledge discovery. We have also described how a query would be handled by the architecture, and what optimizations patterns could have for the knowledge discovery process.

In the presented use case we illustrated that constraint-based data mining in in- ductive databases can have a significant impact on performance. Using statistical methods to transform patterns to constraints, the number of new patterns generated in experiments could be reduced by 90% or more. Measured in time, this is not a lot when applied to the relatively small datasets of the UCI, but for bio-informatics and life-sciences, where data sets usually have a size of gigabytes or more, the impact could be significant.

We need to extensively test our architecture with various data in order to ensure it is maximized for extensibility and efficiency, two traits that can become contradicting in an architecture. Furthermore, much research still needs to be done on pattern repre- sentation and inductive query languages in areas other than frequent pattern mining.

Another area that needs to be researched thoroughly is distributed data mining within inductive databases, as well as the usage of web services within the frame- work. Using web services allows for easy and large-scale parallelization, but there is some overhead to be considered when executing a query on a remote site, overhead which might prove to be a burden if the query is small. Therefore, research must be done on how to represent query and data mining primitive metrics, which could help the query optimizer to make a choice between local and remote execution of queries.

Chapter 4

Service-Oriented Knowledge

Discovery

Due to advances in software engineering and architecture, as well as the increased popularity of scientific workflows, new ways of performing knowledge discovery ex- periments can be devised. In this chapter we investigate how the service orientation paradigm and scientific workflows can improve knowledge discovery. We compare the non-service-oriented, constructed process model with the service oriented orches- trated process model, and point out the benefits of service oriented technology in workflows. After that, we propose a model for the design of a service-oriented knowl- edge discovery process, and provide guidelines for individual knowledge discovery service design based on the types of functionalities it requires. We also provide a use case design to show the application and benefits of the proposed model in practise.

4.1

Introduction

Despite the fact that knowledge discovery (KD) in data has proven to be valuable in many scientific fields over the last few decades, one of its main drawbacks is that setting up a KD experiment is not a simple task. Usually KD processes are very resource-intensive, requiring lots of memory space for huge amounts of data. Further- more, they usually need one or multiple processing units to transform or mine this data. KD processes often consist of several algorithms connected together, whereby data flows from one algorithm to another.

Commonly, a KD process is created as follows: a KD researcher either imple- ments or obtains the required algorithms and connects the in- and outputs together, executes the process, and eventually gets a result as an output. We perceive this situ- ation as far from optimal, as it comes with quite a few problems and vulnerabilities; assuming not every scientist is a superb programmer with years of programming ex- perience and education, implementations might suffer from errors and suboptimal

performance, and a similar argument holds for the connection of algorithms together, which mostly is done in an ad-hoc way and usually not according to a standardized format or protocol.

Instead we can consider the following scenario. Suppose a researcher wants to create a certain KD experiment involving several algorithms. The researcher only has some of these algorithms available on her own computer, knows that there are a few available at a remote location, and some that either she needs to create, or that might be found by looking for them on the internet. The ideal situation for the researcher would be to just use a search engine to look up the missing algorithms, use a tool to connect the algorithms together, and then execute the experiment.

The scenario presented above is not at all unrealistic. Due to advances in work- flow management research [LLF+09, DGST09], experiments can now be designed in such a way that individual parts of the experiment can be easily connected to each other, often by using a simple graphical interface. Furthermore, the service-oriented (SO) paradigm allows for relatively simple and secure remote computing, and easy lookup of publicly available services.

In this chapter we investigate how the SO paradigm and related technologies can improve KD in scientific workflows. The SO paradigm allows users to design applications (in this context we will see a KD process as an application) in terms of individual components than can be connected to each other through standardized communication. These components can be either locally or remotely available, and can be found through public lookup facilities. We argue that combining SO with sci- entific workflows makes KD processes easier, faster, and better understandable.

Until recently the focus and application domain of SO technology has mostly been the commercial sector and large-scale business applications. In this part we ex- plore the benefits and drawbacks of SO applications in KD by conducting design and implementation case studies, whereby we focus our design interests on the design of a KD service and a KD process.

This chapter is organized as follows. In Section 2, we briefly discuss some recent work related to our research. In Section 3, we examine the two scenarios in more detail, discussing their differences, weaknesses and strengths. In Section 4, we will present ideas on the design of service-oriented knowledge discovery (SOKD) ser- vices and processes, which will serve as design patterns for the implementation of our use cases, which will be discussed in Section 5. In Section 6, we will present the experimental results of the use cases, and in Section 7 we will draw a few conclusions and look at future work.

Documento similar