• No se han encontrado resultados

Otras divisas en la cerámica del rey y la nobleza

In document De la literatura caballeresca al Quijote (página 84-95)

de Alfonso el Magnánimo

8. Otras divisas en la cerámica del rey y la nobleza

In this Section we propose a training methodology targeted to reduce the training time in terms of running benchmark queries compared with extensive training with state of the art analytical benchmarks adopted in prior work (e.g., TPC-H used in [50, 6], and TPC-DS used in [29]) without sacrificing on accuracy.

5.4.1 Query Template Pruning

Starting from a large set of training queries we propose an iterative query pruning procedure as shown by Algorithm 1. In the first iteration, the training set T is instantiated with a random

sample of n0queries from the full set of benchmark queries Q. Then, in each subsequent iter-

ation the training set size is augmented progressively as long as the accuracy of the prediction models improves beyond a threshold value t h. To quantify the model accuracy improvement of each iteration we use k-fold cross validation, a widely used method for estimating the pre- diction error [35]. K-fold cross validation divides the training set into k folds of approximately of equal size, then uses k-1 folds to train the models and uses one fold to test the models. The process is repeated k times, for each possible train/test folds combination. For all the

queries of the testing fold Fithe aggregated prediction error is computed as the squared error

of the predicted value w.r.t. the actual value (line 13). Then the cross validation estimate of the prediction error is computed on line 15. Finally, the rate of accuracy improvement is computed as the ratio of the cross validation estimate of the previous iteration to the cross validation estimate of the current iteration. With respect to the sampling procedure we use progressive sampling with geometric rate (i.e., 2i∗ n0, line 21) inspired by the work of Provost et al. [64] which shows that geometric sampling is remarkably efficient in finding the conver- gence plateau in a few number of steps (compared with linear sampling). While the algorithm

requires to execute two times more queries than the minimum (i.e., in order to compare the model improvement with the previous step), all sub-sequent re-training phases benefit from a smaller subset of queries that have to be re-executed.

Algorithm 1 Iterative query pruning procedure. Notations: initial number of training queries

n0, query set Q, number of folds k, convergence threshold t h, number of training queries n,

training query set T , iteration i t , rate of accuracy improvement R, cross validation error CV . 1: Input: n0, Q, k, t h

2: n = n0, T = ;, R = 0, i t = 1

3: while n ≤ |Q| and (i t ≤ 2 or R > th) do

4: if i t == 1 then

5: T = {random set of n queries from Q}

6: else

7: T = T ∪ {random set of n/2 queries from {Q − T }}

8: end if

9: separate T into k folds Fi, 1 <= i <= k 10: for i = 1; i <= k; i + + do

11: train models with all Fl s.t. l <> i , 1 <= l <= k

12: test models on queries Qjfrom fold Fi

13: CVi=P|Fj =1i|(P r ed i c t edj− Act ualj)2 14: end for 15: CVi t=n1Pk i =1CVi 16: if i t >= 2 then 17: R = CVi t −1/CVi t 18: end if 19: i t = i t + 1 20: n = n × 2 21: end while

5.4.2 Synthetic Query Generation

Assuming that some minimal information about the testing workload is available (i.e., schema information, query operators in the workload) we suggest a methodology for generating synthetic datasets and queries that can be used as input into the pruning algorithm. As we show with experiments, synthetically generated benchmark queries can be more concise (have less training data overlap) than state of the art query template instances and reduce further the training time. In the following we summarize the ideas we use for generating synthetic benchmark queries and present the key differences compared with state of the art training that uses TPC-H/-DS query templates. A synthetic workload instance that follows the methodlogy described bellow is presented in Section 5.6.1.

• Synthetic queries and data: We generate both synthetic datasets and synthetic queries to make the training methodology generally applicable for a larger set of testing work- loads instead of using an existing benchmark that is tight to a fixed schema.

• Reducing overlap: In order to remove unnecessary profile data overlap that inherently occurs for long running workflows, queries are generated such that they include only short pipelines of query operators. For instance, a TPCH query includes multiple instances of the same operator implementation in one single query. While we also profile any given operator multiple times, we explicitly map different executions of the operator to different input data characteristics (e.g., input sizes, row sizes, data distributions).

• Fine granularity, systematic training: Queries are generated such that all processing phases that occur in the workload execution are covered for a range of row level data properties established during the workload characterization phase (e.g., the join op- erator processes input tuples with sizes in the range of 100 to 200 bytes). Building training datasets for each operator in isolation reduces the number of queries required for training because the number of operators is limited, unlike the number of potential queries which is unbounded.

• Task heterogeneity / Data skew: Given the MapReduce execution model where multiple tasks are executed in parallel, task heterogeneity in terms of data processing require- ments is effective by capturing different “execution patterns” in one single query. Hence, we propose to execute synthetic queries on skewed datasets such that different tasks process a different number of rows.

• Reducing materialization: Training queries are generated such that the number of MapReduce jobs in a query is minimized. Proceeding this way we aim to reduce the cost of materializing intermediate results.

In document De la literatura caballeresca al Quijote (página 84-95)