• No se han encontrado resultados

2. LA EPIDEMIA DE CÓLERA DE 1991 EN EL PERÚ

2.4 El impacto de la epidemia

RDF stream processing in general and one of its existing implementations (i.e., C-SPARQL) in particular, at their current state suffer from performance and scala- bility issues – two aspects which can make our EXCLAIM framework potentially incapable of monitoring and analysing large data volumes within CAPs, given the fact that with our framework we aim to address real-world enterprise-level CAPs, which can potentially generate thousands and even millions of RDF triples per second. In our view, these amounts are large enough to be considered as Big Data not only because of the volume aspect, but also because of the existing velocity, variety and veracity of these datasets.

In these circumstances, a possible solution to overcome the problem is to par- allelise processing tasks across several instances of the framework by fragmenting incoming data streams into sub-streams, so that each instance only deals with a separate subset of incoming values. To demonstrate this and prove out concepts, we required an existing infrastructure, which would allow us to implement such a parallel deployment with least possible refactoring and reconfiguration efforts. Given this, we were motivated to utilise an existing technological solution from the Big Data processing domain, which would exhibit the following characteristics:

• Minimal effort to integrate with our framework • Support for processing streamed data

• Support for data stream fragmentation and task parallelisation • Enough capacity to address the Big Data challenges of CAPs

As a result, among other alternatives IBM Streams was chosen as the target platform to implement a parallelised deployment of the EXCLAIM framework. Detailed description of this platform for processing streaming Big Data can be found in Section 3.1.1.

In the rest of this chapter, we will demonstrate and explain the parallelised deployment of the framework on top of the IBM Streams with the same use case scenario, which was presented in the previous chapter. We first explain the frag- mentation logic, which we applied to partition RDF data on the stream, and then continue with a number of experiments. To a great extent, these experiments are similar to the ones explained in the previous chapter, and primarily serve to demonstrate an increase in performance when running the parallel deployment of the framework as opposed to the initial, pipelined deployment.

We also want to bring to the reader’s attention that the primary goal of demon- strating the IBM Streams deployment is to show that the emerging Big Data issue can be successfully addressed, and to do so, the EXCLAIM framework has the capacity to be accordingly extended and integrated with existing solutions. Even though the experimental results show an increase in performance, our goal here is not to design and implement a novel efficient algorithm for data stream frag- mentation and parallel processing. Rather, our intention is to demonstrate that it is possible in principle.

7.3.1 Stream parallelisation

So far, we focused on one particular service connected to the user’s application – namely, the PostgreSQL data storage service. To demonstrate the potential of applying Big Data parallelisation techniques, we will take into account all five add-on services Destinator is connected to.

For the sake of demonstrating the benefits, we will apply a simple fragmen- tation logic, based on separating the main stream with RDF triples coming from different add-on services into several sub-streams, so that each of them only con- tains RDF triples originating from one particular service. Then, we will launch several identical instances of the EXCLAIM framework and attach them to the

resulting sub-streams. Eventually, we aimed at creating a distributed architec- ture with several replicated frameworks, each of which would only deal with data coming from a single add-on service.

With IBM Streams, which allows developing custom Java operators, we had to implement three main operators, which are depicted in the screenshot below (see Figure 7.3).

Figure 7.3: Parallel architecture with the main RDF stream split into five separate sub-streams, each of which is processed independently by several instances of the EXCLAIM framework.

• SourceOp: this operator acts as an entry point for RDF triples to be pro- cessed by the EXCLAIM framework. It is a data source operator, which picks up monitored RDF triples from the RabbitMQ monitoring queue, and then forwards them to the splitter operator.

• SplitterOp: this operator is responsible for the actual fragmentation of in- coming data. Depending on the add-on service generating monitored values, the splitter directs data to a corresponding output port. Since in the consid- ered case study, Destinator is coupled with 5 add-on services, the splitter has 5 respective output ports.

• Exclaim_{1-5}: this operator is essentially a Java wrapper for the EX- CLAIM framework. Since it is called and executed from within the Streams platform, it does not have a GUI and a management console; therefore, all the configurable parameters are predefined. In the considered use case, there exist 5 identical instances of the Exclaim operator, each of which is attached to one of the output ports of the splitter. Thus, we achieve an architecture, in which each instance of the EXCLAIM framework (i.e., Exclaim_{1-5}) only deals with RDF triples belonging to a particular add-on service. Simply put, we minimise the number of incoming triples on each sub-stream, and thus achieve better performance.

To a certain extent, the splitter operator performs routing and filtering func- tions, and thus can be seen as an implementation of a routing node, existing in physical sensor networks. Such an intermediate node in the context of the EX- CLAIM framework is responsible for:

• transferring monitored values from physical (e.g., server, data centre), vir- tual (e.g., application container, virtual machine), or logical (e.g., application system, database) components of the monitored platform to a corresponding processing location;

• performing initial processing of the incoming values – that is, by filtering and aggregating monitored values it is possible to offload some of the computa- tional tasks from the central monitoring component (which otherwise may become a bottleneck of the whole system), and make the whole framework more scalable.

Unlike static data fragmentation, where the set of values is finite, partitioning of streamed data, due to its unbounded nature and unpredictable rate, is asso- ciated with a risk of splitting semantically connected RDF triples into separate streams, which in turn may result in incorrect deductions. For example, in our case, by splitting the main RDF stream with respect to the source of the monitored data, we deliberately broke existing links between the PostgreSQL service and its backup service PGBackups. Since respective triples are handled in isolation from each other, it is now impossible to detect situations when some of the established client connections are actually connections established during a temporary, short- lasting back-up process. Therefore, careful design of the fragmentation logic is crucial in order to confirm that no valuable data is misplaced or lost.