MODELOS PRESCRIPTIVOS - TAXONOMÍA DE LA INVESTIGACIÓN EN ATFM

1.5. MODELOS DE GESTIÓN DEL FLUJO DEL TRÁFICO AÉREO

1.5.1 TAXONOMÍA DE LA INVESTIGACIÓN EN ATFM

1.5.1.1 MODELOS PRESCRIPTIVOS

There are several design decisions that reduce Impala’s query latency compared to other SQL-in-Hadoop solutions.

Efficient use of memory

As a completely rewritten query engine, Impala is not bound by the limitations of the MapReduce engine. Data is read from the disk when the tables are initially scanned, and then remains in memory as it goes through multiple phases of processing. Even when data is being shuffled between different nodes, it is sent through the network without being written to disk first. This means that as queries become more complex and require more stages of processing, Impala’s performance benefits become more pronounced. Contrast this with Hive, which is forced to perform relatively slow disk reads and writes between each stage.

This does not mean that Impala can only process queries for which the results of intermediate computation can all fit in the aggregate memory. The initial versions of Impala had such a limitation for queries that relied heavily on memory. Examples of such queries are joins (where the smaller table, after filtering had to fit in the aggregate memory of the cluster), order by (where each individual node did some part of the ordering in

memory), and group by and distinct (where each of the distinct keys were stored in

memory for aggregation). However, with Impala 2.0 and later, Impala spills to disk when the intermediate data sets exceed the memory limits of any node. Consequently, with newer versions of Impala, queries are not simply limited to those whose intermediate data sets can fit within certain memory constraints. Impala will still favor fitting data in

memory and running computations that way, but when necessary will spill data to disk and later reread the data, albeit at the expense of performance overhead due to higher I/O. In general, for faster performance, Impala requires significantly more memory per node than MapReduce-based processing. A minimum of 128 GB to 256 GB of RAM is usually recommended. There is still one downside to using Impala due to its favoring memory over disk as much as possible: Impala queries can’t recover from the loss of a node in the way that MapReduce and Hive can. If you lose a node while a query is running, your query will fail. Therefore, Impala is recommended for queries that run quickly enough that restarting the entire query in case of a failure is not a major event. Restarting a query that took a few seconds or even five minutes is usually OK. However, if a query takes over an hour to execute, then Hive might be a better tool.

Long running daemons

Unlike the MapReduce engine in Hive, Impala daemons are long-running processes. There is no startup cost incurred and no moving of JARs over the network or loading class files when a query is executed, because Impala is always running. The question comes up sometimes of whether to run the Impala daemons on the same nodes that run MapReduce tasks or on a separate set of nodes. We highly recommend running Impala on all the

DataNodes in the cluster, side by side with MapReduce and other processing engines. This allows Impala to read data from the local node rather than over the network (aka data locality), which is essential for reducing latency. Much of the resource contention between Impala and other processing engines can be managed dynamically via YARN or statically by Linux CGroups.

Efficient execution engine

Impala is implemented in C++. This design decision makes Impala code highly efficient, and also allows a single Impala process to use large amounts of memory without the latency added by Java’s garbage collection. Moreover, in general, it allows Impala to take better advantage of vectorization and certain CPU instructions for text parsing, CRC32 computation, and more because it doesn’t have to access these hardware features through the JVM.

Use of LLVM

One of the main performance improvement techniques used in Impala is the use of Low Level Virtual Machine (LLVM) to compile the query and all the functions used in this query into optimized machine code. This gives Impala query execution a performance boost in multiple ways. First, machine code improves the efficiency of the code execution in the CPU by getting rid of the polymorphism that you’d have to deal with when

implementing something similar in, say, Java. Second, the machine code generated uses optimizations available in modern CPUs (such as Sandy Bridge) to improve its I/O efficiency. Third, because the entire query and its functions are compiled into a single context of execution, Impala doesn’t have the same overhead of context switching because all function calls are inlined and there are no branches in the instruction pipeline, which makes execution even faster.

It is possible to turn off the LLVM code generation in Impala by setting the

disable_codegen flag. This is used mostly for troubleshooting, but using it allows you to

Impala Example

Although the inner workings of Impala can seem quite complex, using Impala is actually fairly easy. You can start impala-shell, the Impala command-line interface, and begin submitting queries like so:

CONNECT <impalaDaemon host name or loadbalancer>;

—Make sure Impala has the latest metadata about which tables exist, etc.—from the Hive metastore INVALIDATE METADATA;

SELECT * FROM

foo f JOIN bar b ON (f.fooBarId = b.barId) WHERE

f.fooVal < 500 AND

f.fooVal + b.barVal < 1000;

This code connects to Impala, updates the metadata from Hive, and runs our query. You can immediately see why most developers prefer the SQL version of this code to the MapReduce version we saw earlier in the chapter.

To see the execution plan of a query in Impala, you simply add the word EXPLAIN before

your query. The syntax is identical to that of Hive, but the resulting query plan is completely different. Because Impala is implemented as an MPP data warehouse, the execution plans use similar operators and look similar to those of Oracle and Netezza. These are very different from the MapReduce-based plans that are produced by Hive. Here is the explain plan of the query just shown:

+---+

| Explain String |

+---+

| Estimated Per-Host Requirements: Memory=32.00MB VCores=2 |

| WARNING: The following tables are missing relevant table |

| and/or column statistics. |

| default.bar, default.foo |

| |

| 04:EXCHANGE [PARTITION=UNPARTITIONED] |

| | |

| 02:HASH JOIN [INNER JOIN, BROADCAST] |

| | hash predicates: f.fooBarId = b.barId |

| | other predicates: f.fooVal + b.barVal < 1000 |

We can see that Impala first scans table foo and filters it with the predicate in the query. The plan shows the filtering predicate and the table size after the filtering as estimated by the Impala optimizer. The results of the filter operation are joined with table bar via a broadcast join, and the plan also shows the join columns and the additional filter.

gives access to query profiles. Query profiles look similar to execution plans, but they are created after the query is executed — so in addition to the estimated size, a query profile also contains additional runtime information such as the rates at which tables were

scanned, the actual data sizes, the amount of memory used, the execution times, and so on. This information is very useful for improving query performance.

In document Análisis de la red europea de aeropuertos mediante la teoría de redes (página 53-57)