• No se han encontrado resultados

5. MARCOS DE REFERENCIA

5.3. MARCO CONCEPTUAL

One of the unique commercial SQL-on-Hadoop tools out there is Jethro. Jethro leverages the age-old technique of making queries run faster, using indexes. We are all so familiar with indexes in the world of databases, and with how they are used to speed up SQL queries in the relational world. So why did no one think about using indexes in the world of big data, to speed up queries? Well, Hive has had support for indexes for a while, although, to my knowledge, not too many real-world use cases and implementations have used it extensively.

The world of big data, especially HDFS, has a problem with indexes. Because HDFS is a WORM (write once read many [times]) kind of file system, keeping indexes updated as data is added and updated becomes a problem on HDFS. Indexes have had a minor role in speeding up queries in high-performance analytic databases and in big data implementations, owing to several reasons.

• Index data typically uses smaller block sizes, which are quite incompatible to the whole idea of bigger block sizes in the big data world.

• Index creation slows down data loads.

• With the addition of newer data, indexes have to be updated, which is, again, incompatible with the architecture of big data file systems such as HDFS and S3.

Jethro brings indexes into the world of big data SQL, by way of an innovative indexing mechanism to offset the aforementioned problems. Jethro’s indexing mechanism is architected to work with HDFS/S3 and also solves the index update problem, using their architecture. Jethro’s append-only index structure converts index updates to cheap sequential writes, solving the index update problem.

Jethro’s solution fully indexes all columns in the data set and stores it on HDFS. Only the indexes are used to answer the query, instead of doing a full scan, as in the case of other MPP-based SQL-on-Hadoop solutions. The more a query drills down into a data set to get a finer level of detail, the better the performance gets, because indexes are leveraged to the best possible extent, unlike full-scan systems, which will do a full scan even for drilled-down queries. Index-based access to the data results in dramatically lowering the load on I/O and CPU and memory usage, as compared to MPP-based full- scan architectures.

When new data is added to the data set, Jethro architecture does not modify the existing indexes but adds the newer indexes at the end of the current indexes, allowing duplicate index entries. Instead of in-place updates of the index, the new index is appended to allow repeated values. An asynchronous process runs in the background, which merges the newer indexes with the older ones and removes the duplicates.

During the time frame when duplicate indexes exist in the system, the query executor will read multiple index fragments but makes sure to resolve the query results to the latest index values in case of duplicates.

CHAPTER 4 ■ INTERACTIVE SQL—ARCHITECTURE

Apart from the regular speedup of accessing only the data needed for the query, Jethro has other performance-enhancing features, such as compressed columnar storage format of the index data, efficient skip scan I/O, automatic caching of locally frequently accessed column and index blocks. Jethro’s query optimizer uses the index metadata to optimize the queries, rather than the usual statistics gathering and collection processes used in other systems.

Figure  4-18 shows the data flow of index creation during data ingestion and storage of the indexes on HDFS. These indexes are accessed by the Jethro cluster servers during query processing. The indexes are created synchronously as the data is ingested into the big data system.

Figure 4-17. Jethro deployment architecture (Reproduced with permission from Jethro)

Figure 4-18. Jethro data ingestion and query architecture (Reproduced with permission

Jethro’s execution engine is highly parallelized. The execution plan is made of many fine-grained operators, and the engine parallelizes the work within and across operators. The execution engine leverages query pipelining, whereby the rows are pipelined between operators, resulting in higher throughput and lower latency.

There are two downsides to Jethro’s architecture:

1. A separate dedicated cluster of Jethro servers—separate from the Hadoop cluster—for hosting Jethro servers

2. A proprietary format of data, which is much faster than ORC/ Parquet formats

The first downside is not a downside in the true sense, as it is always advisable to have separate clusters that support different workloads. Hadoop clusters meant for doing ETL and batch workloads should be separated from clusters that support real-time or interactive workloads to satisfy the SLAs. This separation of query clusters from other workload clusters results in better performance and the independent scalability of each cluster.

Others

There are a lot more products out there, but we cannot have full coverage to all of them. One product that has been gaining usage lately is Presto.

Presto has been developed at Facebook and is written completely in Java. It is very similar to Impala and Drill in terms of architecture and general concepts, though it is written in Java. However, compared to Impala or Drill, which have been on the market for a longer time, Presto is very new, although Netflix uses Presto extensively for ad hoc interactive analytics. Presto is an in-memory distributed SQL query engine that supports ANSI SQL and rich analytical functions. Presto is just the query engine; it is like Impala/ Drill, not a storage engine. It can connect to a wide variety of data sources—relational, NoSQL, and distributed file systems.

Presto should not be used when you require batch processing or when one has to implement iterative machine-learning algorithms. Presto is also not recommended to use in data warehouses in which the dimensional modeling is done with star schemas.

Documento similar