• No se han encontrado resultados

Despite many efforts to process large-scale trajectory and time-series data in parallel, e.g. [27] [41] [53] [73] [93] [121], trajectories of moving objects are difficult to fit into the MR model due to their multi-dimensional and sequential nature. Furthermore, even for single thread processing applications, trajectory management still pose a great challenge (recall Section 2.2). However, some efforts have been done to deal with trajectories using MR.

Figure 2.14: Trajectory, partitioning and query in TRUSTER. Source [172].

2.6. SPATIAL DATA PROCESSING IN MAPREDUCE 43

cells for space partitioning, and a 1D tree to index time within each cell. During the partitioning phase in MR, trajectories are split into segments and every segment is assigned to the partition it overlaps with, if a segment spans for more than one grid cell the segment is split according to the cells it overlaps with. During query processing the segments in the partitions containing the query range are selected. However, TRUSTER uses a uniform grid cells partitioning, hence does not handle load balancing. Figure 2.14 shows an example of partitioning and query in TRUSTER. In [98] the authors presented PRADASE, an improvement of TRUSTER. PRADASE index both space and time using a quad-tree-based structure for better load-balancing. PRADASE uses GFS [57] based data storage for replication and dynamic partitioning of temporal dimension. Trajectories are indexed using two spatial indexes to optimize trajectory queries, i.e. PMI and OOI. PMI provides a quad-tree-based space partition with multiple assignment strategy for boundary objects, wheres OOI is used to associate moving objects with their respective trajectories. However, both TRUSTER and PRADASE are disk-based, and do not consider trajectory data preprocessing nor the query-workload.

CloST [146] is a Hadoop-based storage system for spatial-temporal range queries. CloST proposes a new data model and file format to store trajectory data in HDFS. CloST uses a three-level hierarchical partitioning in MR, where in the first level trajectories are grouped into coarse buckets according to the moving objects OID; in the second level each bucket is partitioned into spatial regions using quad-tree; in the third level each region is divided into fine-grained 1-D blocks of time. Figure 2.15 illustrates the hierarchical partitioning in CloST. Input records from same moving object are grouped together and stored in a table format into the HDFS using delta and running-length compression. The goal of CloST is to support efficient single-object queries (i.e. spatial-temporal selection) and all-object queries (i.e. selection by object OID). Although CloST also provides a dynamic partitioning according with the data utilization ratio, it is a disk-based approach, and does not consider trajectory data preparation nor preprocessing.

OceanST [178] is a Spark-based system designed for spatial-temporal Mobile Broadband (MBB) data. OceanST adopts the same hierarchical partitioning of CloST, and provides an additional set of inverted indexes to attributes associated with MBB data (e.g. textual information). OceanST uses Spark to speed up exact and sampling-based aggregate queries over distributed data (e.g. count, distinct, max, min, sum, avg). OceanST also includes an API with some basic spatial- temporal analytics, such as frequent path identification, transportation mode prediction, and activities prediction, for instance. However, OceanST only provides a static off-line data partitioning, hence does not consider the query-workload and it’s not memory-wise, moreover, OceanST aims for MBB data, and does not support indexing for similarity search over GPS trajectories, and does not consider data preprocessing.

Another work by Li et. al [87] uses MR to calibrate bus trajectories and identify bus routes directions using a k-NN query; however, they simply run a k-NN on the whole dataset using MR, and do not use any index structure or data partitioning, which negatively affects the performance. Jinno et al. [76] proposed a grid representation of trajectories, and a quad-tree-based search with MapReduce for frequent movement pattern mining; the grid resolution can be modified to identify different types

Figure 2.15: Hierarchical partitioning in CloST. Source [146].

of patterns, however they use a lossy representation of trajectories, and can only be applied for few movement patterns identification.

Chapter 3

Trajectory Data Preparation and

Preprocessing

In this chapter we present our contribution on trajectory data preparation and preprocessing. In Section 3.1 we introduce a novel script model for trajectory data representation, and designed a system for trajectory data integration and compression. In Section 3.2 we introduce a framework for trajectory data preprocessing using map-matching on top of Spark, in order to achieve data quality with performance and scalability.

3.1

Trajectory Data Integration and

Representation

Raw trajectories should go through a series of preprocessing steps before they become suitable for indexing and querying. This chapter introduces a novel parallel system for trajectory data integration and representation, with support for lossless trajectory Delta compression, and synthetic trajectory data generation. This system also provides templates for trajectory data representation (e.g. spatial-temporal attributes, textual attributes) providing a single data model for integration of different input datasets. Moreover, this application is responsible to collect statistics of the input dataset (i.e. metadata).

Documento similar