• No se han encontrado resultados

25 % 50 % 75 % 100 % 125 % 150 % 175 % 200 % 2 4 6 8 10 12 14 16 18 20

Evolving Pattern Average I/O cost

Capacity Power Distribution Worse than FullScan

BC LEB−∞HAIL LRU−1LEB−2 RandomLRU−2 SoftIndexFullScan Infeasible RegionOPT

2 4 6 8 10 12 14 16 18 20

Capacity Uniform Distribution

(a) Average read and write I/O cost per block after executing the whole workload

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 2 4 6 8 10 12 14 16 18 20

Evolving Pattern Average I/O cost

Capacity Only Reading

BC HAIL LEB−∞ LEB−2 LRU−1 LRU−2 Random SoftIndex OPT

2 4 6 8 10 12 14 16 18 20

Capacity Only Writing

BC HAIL LEB−∞ LEB−2 LRU−1 LRU−2 Random SoftIndex OPT

(b) Cost breakdown into the average read I/O and write I/O cost per block after executing the whole workload using the Uniform distribution

Figure 4.4: Simulated I/O cost for all presented AIR strategies with varying ca- pacities.

Figure 4.4(a) shows the total cost for both the Uniform Distribution as well as the Power Distribution. On the y-axis we see the average I/O cost per data block in percent. This means that if every block is always read fully from disk, the average I/O cost is 100 %. If every block is read fully and written out again this corresponds to the maximum of 200 %. The x-axis shows how many full copies of the dataset can be stored. Since the dataset has 20 attributes, it is clear that we

can create an index for every attribute, if we can store 20 copies of the data. The interesting area starts if the available space is greater or equal to three, as Hadoop creates already three replicas by default.

The first thing to note is that both LRU algorithms perform rather poorly. LRU-2 performs much better than LRU-1 but it fails to beat the other AIR al- gorithms. The BC algorithm performs rather well if the available space is scarce, but it does not reach the same level as the other algorithms when we can store almost all indexes. This is caused by the way BC tries to prevent oscillating index creation. Whenever an existing index is used by an incoming query, BC lowers the accumulated benefit of all indexes that are not yet created. This leads to the case that several indexes are never created, even though the space would allow for creating more indexes.

We see that LEB-2 is slightly better than the LEB-∞ algorithm. But why does the LEB-2 algorithm only slightly improve over the LEB-∞ algorithm and does not outperform it clearly, even though the access pattern changes drastically after every hundredth query? To explain this result we look at the breakdown of the I/O cost depicted in Figure 4.4(b). Here we see, that the average read cost per query of all algorithms is rather similar, with the LEB-2 and LRU-2 being a little better for the Evolving Pattern. What we also notice is that the LRU algorithms are very competitive in terms of read performance. Since the LRU algorithms are very good at adapting to new query distributions, they manage to provide useful indexes most of the time. This adaptivity comes at a cost as we see when looking at the average write cost per query. This makes BR algorithms unsuitable for the AIR problem. LRU-1 always has to create an index for every “index miss”, as this attribute has the smallest LRU-1-age. LRU-2 performs a little better with respect to write performance, as it does not always create a new index for every “index miss”.

0 % 25 %

2 4 6 8 10 12 14 16 18 20

Evolving Pattern Average I/O cost

Capacity Power Distribution

BC HAIL LEB−∞ LEB−2 LRU−1 LRU−2 Random SoftIndex FullScan

2 4 6 8 10 12 14 16 18 20

Capacity Uniform Distribution

Figure 4.5: Additional I/O cost compared to OPT in absolut numbers (% of block read and written additionally)

4.6. Evaluation 119

∞ algorithm. LEB-2 adapts faster to the evolving workload as LEB-∞, but since adapting means to change the set of available sort-orders more often, it has to write more data. In contrast, LEB-∞ adapts only slowly to a new query distribution and changes only slowly the set of available indexes. The additional write cost to adapt faster diminishes the gain of the better indexes that are available with the LEB-2 algorithm.

Figure 4.5 provides a zoom-in on the I/O overhead compared to the OPT algorithm. Here the absolute overhead compared to OPT is depicted, i.e, the average I/O an algorithm has to pay additionally for every query compared to the OPT strategy. We see that the LEB-2 algorithm is our favorite for the evolving workloads and both selectivity distributions.

0 % 25 % 50 % 75 % 100 % 125 % 150 % 175 % 200 % 0 50 100 150 200 250 300

Evolving Pattern Average I/O cost

Query Power Distribution

BC HAIL LEB−∞ LEB−2 LRU−1 LRU−2 Random SoftIndex OPT FullScan

0 50 100 150 200 250 300

Query Uniform Distribution

Figure 4.6: Running-average I/O cost per query for the different patterns and distributions

Until here we only looked at the overall performance of the different algorithms. Now we want to look at the development of the performance while executing the workload. Figure 4.6 depicts the running average I/O cost for all implemented al- gorithms, query patterns, and selectivity distributions with a capacity constraint of eight indexes. We see that the conservative algorithms benefit in the beginning, as they do not incur the initial indexing costs. Please note that our HAIL imple- mentation can limit the initial spike by setting a so called offer rate, that allows only for a certain percentage of the blocks to be indexed.

In the next figure we will visualize the evolving indexes for four of the presented AIR strategies, namely OPT, BC, LRU-2, and LEB-2. Figure 4.7 depicts the accumulated benefit, or overhead respectively, for the different attributes. The black boxes depict the existing indexes and the small crosses depict the accessed attributes. The green color depicts the accumulated benefit over full scan, if we hit the index, while the red color depicts the accumulated overhead over an index scan, if we miss the index. We see that the OPT and LRU-2 strategies often change the set of available indexes. In contrast BC and LEB-2 keep their indexes for a relatively long period of time. We also notice that the green in the OPT strategy is

0 2 4 6 8 10 12 14 16 18 20 OPT A ttribute 0 2 4 6 8 10 12 14 16 18 20 BC A ttribute 0 2 4 6 8 10 12 14 16 18 20 LR U-2 A ttribute 0 100 200 300 400 500 600 700 800 900 1,000 0 2 4 6 8 10 12 14 16 18 20 Query LEB-2 Attribute

Figure 4.7: Visualization of AIR strategies over an evolving workload, uniform distribution, capacity of 8

4.6. Evaluation 121

not necessarily darker than the green in the other strategies, but there are almost no traces of red visible. Keep in mind that OPT is only shown for reference — it is the only algorithm knowing the future. BC accumulates high benefits on a few attributes, however, it keeps most of the indexes over long periods of time, and only one of the indexes is frequently adapted to the workload. In contrast, LEB-2 adapts all of its indexes more frequently and achieves overall a better performance.

Documento similar