• No se han encontrado resultados

our spatial-partitions are no larger than the MR block size in the cluster configuration (i.e. 64MB by default), so that every partition can be processed by a MR task, the MR block size can be adjusted according to the cluster configurations. We also build our quad-index using a sampling-based approach which can be combined with Spark.

Spatial Data on Spark: A number of existing works provide unified systems for spatial queries using Spark, aiming to achieve better performance and scalability for spatial data. Simba [168] uses Spark’s RDD to support spatial indexing and spatial operations natively. Simba adds spatial keywords to SparkSQL grammar, so that users can express spatial operations in a SQL-like fashion. Similarly, GeoSpark [175], SpatialSpark [174], and SparkGIS [16] have been proposed to process spatial data on top of Spark. However, although achieving high performance and scalability for spatial queries, none of these systems fully support trajectory data processing nor map-matching. OceanST [178] does provide support for trajectory data on Spark, however, only for selection queries using uniform static data partitioning. Our proposed framework, on the other hand, provides support for map-matching using balanced space partitioning.

2.4

Distributed Parallel Computation

There is vast body of literature on large-scale data processing, wherein distributed-based solutions outperforms centrally-based in aspects such as scalability, fault tolerance and I/O operations [1] [24] [74] [114] [141] [170]. Distributed computation architectures are designed to share the data across many CPU nodes in a cluster to be processed separately by parallel processes running on each node, making the data processing faster, more efficient and scalable.

The two most popular paradigms for distributed computing on shared-nothing2architectures are Parallel Database Management Systems (Parallel DBMS) and MapReduce (MR).

2.4.1

Parallel DBMS

Parallel DBMSs extend common database systems by allowing the parallelization of many data-driven tasks, while still supporting standard relational table schemas, SQL operations, and user-defined functions (UDFs). In a parallel DBMS the data schema and query operations are shared across all the data nodes. The systems architecture are generally designed to hide from users lower level system details, such as data indexing options, storage schema, and join operations strategies [1] [114].

2.4.2

MapReduce Framework

MapReduce (MR) [39] is a open-source programming model proposed by Google for massive data processing in a cluster-based environment, usually composed of inexpensive machines, assisting the solution of many real-world problems in a distributed manner. A MR program is composed of a Map()

a and Reduce() methods. The Map() method performs filtering, sorting, or other preprocessing operations, and map each data input to a hkey, valuei pair. While the Reduce() method merge data with same key, applying some operation or summary on the values. The MR library was built so that programmers do not need to worry about hardware and network failures, resources allocation, process communication and synchronization, data transfer between nodes, and other problems inherent of distributed programming. MR is also more extensible and presents better capacity to deal with unstructured and semi-structured data than parallel DBMS [1] [114] [170].

Map

Map Map

Map

Data Input Map Phase

(HDFS) Output (K, V) Pairs

Reduce

Reduce

Shuffle: Sort and Group Values by Key (K, [V1, V2, ...]) K1 K2 Output Data (HDFS) Map Map Map Map Reduce Reduce Split 1 Split 2 Split 3 ... Split M Output 1 Output 2 Map Reduce Phase

Figure 2.8: MapReduce framework execution overview.

In the MR framework, Figure 2.8, a dataset is partition into M splits across nodes in a cluster. The MR file system (GFS [57]) partition each input split into blocks (64 MB by default), and send each data block to be processed in parallel by map tasks. The mapper read each input in the data block and outputs a hkey, valuei pair for the input. When the map task is complete, the results are sorted and grouped by key, this is the shuffle process, so that all the occurrences of the same key are processed by the same reducer. Each reduce task receives a hkey, list(value)i, with a list of values for a certain key, and outputs a hkey, resulti pair after processing the list of values. The number of map and reduce tasks, as well as the data block size, can be controlled by the user; however, some cost-based parameters optimizer exist [70].

Here is a simple example of the word-count problem that can be easily expressed as MR com- putations from [39]. “Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write the MR functions similar to the Algorithm 1 pseudo-code.”

The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example). The reduce function sums together all counts emitted for a particular word.

There may be hundreds or thousands of machine nodes in a MR cluster. There are Worker nodes, which are responsible to process map and reduce tasks on each data block; and the Master node, which is responsible to assign map and reduce tasks to the Workers, as well as allocate resources and manage fault-tolerance. The file system manages fault-tolerance by replicating the data to different nodes, two copies by default, so that if a Worker fails the master resets the failed tasks on another machine that

2.4. DISTRIBUTED PARALLEL COMPUTATION 31

Algorithm 1 MapReduce Word Count

Input: key: document name, value: document contents

1: function MAP(String key, String value)

2: for each word w in value do 3: EmitIntermediate(w, 1)

4: end for

5: end function

Input: key: a word, values: a list of counts

1: function REDUCE(String key, Iterator values) 2: result ← 0

3: for each v in values do 4: result ← result + v

5: end for

6: Emit(result)

7: end function

contains the data replication.

MR-based works developed strategies to optimize I/O and reduce the network and I/O cost of sorting and grouping the data output from mappers to reducers (i.e. shuffle); and improve data locality and grouping strategies, to keep data with same key fiscally close to one another. Large spatial database applications this can be achieved by means of locality-aware partitioning, which can reduce the number of data objects access, thus reducing network and I/O costs [184]. Surveys about applications and data management using MapReduce can be find at [42] [88].

Hadoop: Apache Hadoop [68] is an open-source framework of tools built on top of MR that enables the processing of massive amounts of data in a distributed parallel fashion; it makes the process of developing solutions for large and distributed data easier for programmers, providing a set of tools to abstract the complexity of writing programs over MR. Hadoop provides its own Distributed File System (HDFS); similar to the GFS [57], the HDFS splits the whole dataset into smaller blocks distributed throughout the cluster (blocks of 64 MB by default), so that each map and reduce functions can be executed over smaller subsets of the whole dataset, providing the necessary scalability for large-scale data processing.

Apart from that, Hadoop allows the processing of data streams using the framework Storm [143] and can be integrated with Parallel DBMS [1] [170], providing both SQL (e.g. Hive [149]) and NoSQL (e.g. Pig [107]) programming tools.

2.4.3

MapReduce vs. Parallel DBMS

Although both paradigms share many common features, such as high scalability, compressing op- timization, and indexing optimization, they are complementary technologies [141] [170]. Parallel DBMS, for instance, shows better performance, more indexing efficiency, and less code to implement queries if compared to MR for large-scale data; whereas MR is more extensible and schema free,

hence presents better capacity to deal with unstructured and semi-structured data; MR is also more fault-tolerant and faster to load data than Parallel DBMS [114]; furthermore, MR scales better for large numbers os nodes [1]. Frameworks like Hadoop [68] and Spark [139] can assist data scientists to achieve a trade off between both technologies [1] [170].

Documento similar