PLAN DE NEGOCIOS - Asesor Financiero y Agente Colocador

In this chapter, we first reviewed existing works on the large scale data management as a background of this thesis. We then reviewed the related works on multi-way join, real-time aggregation and real-time search query processing in existing large scale systems. In summary, the following limitations exist in these works:

1. For large scale multi-way join query processing, only rule based query optimizer has been implemented. A cost-based query optimizer can sig- nificantly improve the performance of multi-way join. In addition, the existing plan iteration algorithms for cost-based optimization were only designed for the centralized DBMSs. A more adequate algorithm can further improve the effective of the cost based optimizer and reduce the execution time of the query.

2. In existing large scan query processing, the OLTP query and OLAP query are usually processed in separate systems. The data stored in OLTP module are periodically exported to OLAP module, and the freshness of the OLAP results become an issue. Though the idea of supporting real- time query processing has been studied in traditional DBMSs, it is still a difficult problem in distributed environment. The distributed streaming systems such as S4 tries to return timely results to the users, but they assume that the new tuples are only appended to the data. Our work considers the scenario where the existing data might be updated.

3. Real-time search has become an important requirement of microblogging systems, but the existing ranking scheme offered by the microblogging vendors such as Twitter cannot return meaningful results. The search results are only sorted by the uploading time of each microblog. It would helpful if the ranking scheme considers the relationships between the microblogging users, and the reply/retweet relationships between the mi-

croblogs. Furthermore, it is also important to efficiently index the mi- croblogs and rank the search results in a distributed environment.

In the rest of this thesis, we will first discuss the system overview of our pro- posed microblogging data management system, ART. We then show how ART addresses the above three challenges respectively.

CHAPTER 3 System Overview

With the development of social networking, the amount of data in the web is growing exponentially. Taking microblogging as an example, there are more than 115 million number of active twitter users every month, and million of tweets are published every day. The valuable information contained in these data and the challenges to manage these data have attracted many researchers’ interest on designing the “big data” systems for microblogging. In this thesis, we propose ART, a full-functional, scalable and efficient microblogging data management system. It is capable of processing major queries required by a microblogging system and is optimized for three types of queries (multi-way join, aggregation and real-time search). In this chapter, we will discuss the design philosophy and architecture of ART in detail.

3.1 Design Philosophy of ART

ART is designed to support the following features:

1. Functionality. First, ART must be a full-functional system that is able to process all the fundamental queries required by a microblogging system. In a microblogging system, there are mainly two groups of queries: (1) the user queries such as the OLTP queries (update, insert, delete) and real- time search query; (2) the data analysis queries (including offline analytics and real-time analytics) that are issued by the system administrators. We

design ART to support the above two groups of queries so that it can be directly used as a back-end microblogging data management system without further extension.

2. Modularity. As ART is required to support various queries, it is not feasible to implement all these functionalities within one module. Based on the query types, we divide a ART into three modules. The first one is the OLTP module that is responsible for processing the OLTP queries issued by the users. To enable real-time analytics, the latest updates caused by user actions must be reflected in the result of the real-time analysis queries. Thus, the real-time analysis queries are also handled by this module. The second one is the offline analytics module, which processes the analysis queries on the data that are periodically loaded from the OLTP module. The third module is real-time search module that maintains a real-time inverted index to serve the real-time search query. To ensure that the processing logic inside each module is independent of the implementations of other modules, the higher level modules can only load data from the lower level system through the data loading API.

3. Scalability. The most important requirement of ART is to scale up as the data volume increases. To ensure that ART has high scalability and to minimize the efforts on implementing ART, we extend the existing “big data” systems (such as Hadoop, HBase and Hive) to implement the modules of ART. These stable systems have already been widely used to provide scalable service, and ART can inherit the scalability feature of these systems as well.

4. Efficiency. In addition to the above features, we also optimize each module of ART so that they are more efficient than exiting systems. Specifically, the offline analytics module is optimized to improve the performance of multi-way join query, and the real-time analytics module is optimized to efficiently process the real-time aggregation query. For the real-time search query, ART offers a better ranking scheme than exiting method within an acceptable response time.

CHAPTER 3. SYSTEM OVERVIEW Offline Analytics Hadoop SQL Query AQUA Real-Time Search: TI HDFS MapReduce Tweets

OLTP and Real-Time Analytics: R-Store HBase Meta Store Streaming System Real-Time

Query Twitter Tables

Data Cube Search Query OLTP Query Log MapReduce Index Processors

Distributed Inverted Indexes

Query Processors

Figure 3.1: Architecture of ART

In document Asesor Financiero y Agente Colocador (página 35-39)