• No se han encontrado resultados

PROBLEMAS EN EL MANEJO DEL AGUA

3) Del agua cruda al agua potable

Even though technological advances in handling static data – or ‘data at rest’ – seem to be coping with the quantity of data (e.g., the MapReduce comput- ing model (Dean and Ghemawat, 2008) or so-called NoSQL databases (Pokorny, 2013)), the problem may often appear artificial – it is not the vast amounts of static data which have to be stored and processed in novel ways, it is the lack of effi- cient, reliable and optimised solutions for processing data in motion, which forces enterprises to permanently store their data on hard drive first and then process them in a static manner (Wähner, 2015). Indeed, if we take a more precise look at the data volumes generated by the world today, we will see that big data sources feed data unceasingly in real time and the majority of the data is streaming by its very nature – social network updates, stock exchange fluctuations, sensor read- ings, purchase transaction records, mobile phone and GPS signals, live video and audio content, etc. An increasing number of distributed applications are required to process continuously streamed data from geographically distributed sources at unpredictable rates to obtain timely responses to complex queries.

Raw data generated on a daily basis comes from everywhere – sensors net- works, posts to social media sites, digital pictures and videos, purchase transac- tion records, stock exchange fluctuations, to name a few. Even though existing technologies seem to succeed in storing these overwhelming amounts of data, on- the-fly processing of newly generated data is an inherently pressing task. If tack- led by the traditional DBMSs, the task of processing continuously streamed data from geographically distributed sources at unpredictable rates to obtain timely re- sponses to complex queries, will be hindered by two main factors (Margara and Cugola, 2011). Firstly, in relational databases data is supposed to be (persistently) stored and indexed before it can be processed. Secondly, data is typically pro-

1https://en.wikipedia.org/ 2https://answers.yahoo.com/ 3https://www.quora.com/

cessed only when explicitly queried by users (i.e., asynchronously with respect to its arrival).

Information Flow Processing (IFP) – a key research area addressing the issues involved in processing streamed data – investigates potential solutions address- ing these limitations of the traditional static approaches. IFP focuses on data flow processing and timely reaction (Margara and Cugola, 2011). The former assumes that data is not persisted, but rather continuously flowing and being processed in memory, and the latter means that IFP systems aim to operate in real-time mode, and time constraints are crucial for them. These two key features have led to the emergence of a family of computer systems specifically designed to process incoming data streams based on a set of pre-deployed processing rules. A data stream consists of an unbounded sequence of values continuously appended and annotated with a timestamp, usually indicating when it has been generated (Cal- bimonte et al., 2012). Timestamps allow for stream processing solutions then to order incoming tuples in a chronological order. Usually (but not necessarily) re- cent tuples are more relevant and useful, because they represent a more up-to-date situation, and therefore are more helpful in achieving near real-time operation. Examples of data streams include environmental sensor readings, stock market tickers, social media updates, etc.

Querying over data streams

To cope with the unbounded nature of streams and enable data processing, so- called continuous query languages (Calbimonte et al., 2012) have been developed to extend the conventional Structured Query Language (SQL) semantics with the notion of windows. A window is a temporal operator, which uses tuple timestamps to transform unbounded sequences of values into bounded ones, allowing the tra- ditional relational operators to be then applied to the resulting collection of tuples. This approach restricts querying to a specific window of concern, which consists of a subset of most recent tuples, while older values are (usually) ignored (Barbi- eri, Braga, Ceri, Della Valle and Grossniklaus, 2010). Windows can be specified in terms of:

• number of elements (tuples), when a window consists of a number of latest elements regardless of the arrival time, and

• time, when a window consists of all elements which have arrived during the specified time frame1 (in this case, the window can be potentially empty). 1This division into tuple- and time-based windows is also referred to as physical and logical extraction respectively (Barbieri et al., 2009).

Depending on how the window operator ‘moves’ along the data stream, we can distinguish between overlapping and non-overlapping windows. With the former approach (also known as sliding), the transition between windows is smooth, such that two neighbour windows may overlap with each other and same tuples may appear in both of them. In the latter case, also known as tumbling, the transition between windows is ‘discrete’, so that a tuple can appear only in at most one window.

Figure 3.1: Continuous query languages address the problem of querying an un- bounded data stream by focussing on a well-defined window of interest (excerpted from (Dautov et al., 2014b)).

The concepts of unbounded data streams and windows are visualised in Figure 3.1. The multi-coloured circles represent tuples continuously arriving over time and constituting a data stream, whereas the thick rectangular frame illustrates the window operator applied to this unbounded sequence of tuples. As time passes and new values are appended to the data stream, old values ‘fade away’ – they are pushed out of the specified window, i.e., become no more relevant and may be discarded (unless there is a need for storing historical data for later analysis).

Data stream management systems (DSMSs) – an evolution from traditional static DBMSs – were specifically designed and developed to process streaming data, which comes from different sources to produce new data streams as output (Margara and Cugola, 2011). Examples of DSMSs include SQLstream,1STREAM,2 Aurora,3TelegraphCQ,4etc. These systems target transient, continuously updated data and run standing (i.e., continuous) queries, which fetch updated results as

1http://www.sqlstream.com/

2http://infolab.stanford.edu/stream/ 3http://cs.brown.edu/research/aurora/

new data arrives. CEP goes beyond simple data querying aims to detect com- plex event patterns, themselves consisting of simpler atomic events, within a data stream (Margara and Cugola, 2011). Accordingly, from CEP’s point of view, con- stantly arriving tuples can be seen as notifications of events happening in the external world – e.g., a fire alarm signal, social status update, a stock exchange update, etc. Accordingly, the focus of this perspective is on detecting occurrences of particular patterns of (lower-level) events that represent higher-level events. A standing query fetches results (i.e., notification of a complex event to the inter- ested parties is sent) if and only if a corresponding pattern of lower-level events is detected. For example, a common task addressed by CEP systems is detecting situation patterns, where one atomic event happened after another. To achieve this functionality, CEP systems also rely on tuple timestamps; they extend continuous query languages with sequential operators, which allow specifying the chrono- logical order of tuples or, simply put, whether one tuple arrives before or after another in time.