SECRETARIA DE GOBIERNO DEL ESTADO DE TABASCO DIRECCION GENERAL DE ADMINISTRACION

}

Note that we tag the result struct with anArraySoaTagto keep track of the transformation. This class is defined as follows:

case class ArraySoaTag(base: StructTag, len: Exp[Int]) extends StructTag

We also override the methods that are used to access array elements and return the length of an array to do the right thing for transformed arrays:

override def infix_apply[T](a: Rep[Array[T]], i: Rep[Int]) = a match { case Def(Struct(ArraySoaTag(tag,len),elems)) =>

struct[T](tag, elems.map(p => (p._1, infix_apply(p._2, i))))

case _ => super.infix_at(a,i)

}

override def infix_length[T](a: Rep[Array[T]]): Rep[Int] = a match { case Def(Struct(ArraySoaTag(tag, len), elems)) => len

case _ => super.infix_length(a)

}

Examples for this struct of array transformation are shown in Section 13.5 and Chapter 14.

10.3 Loop Fusion and Deforestation

The use of independent and freely composable traversal operations such asv.map(..).sum is preferable to explicitly coded loops. However, naive implementations of these operations would be expensive and entail lots of intermediate data structures. We provide a novel loop fusion algorithm for data parallel loops and traversals (see Chapter 14 for examples of use). The core loop abstraction is

loop(s) x_{=G { i => E[}x_←f(i)] }

wheresis the size of the loop andithe loop variable ranging over [0,s). A loop can compute multiple resultsx, each of which is associated with a generatorG , one ofCollect, which creates a flat array-like data structure,Reduce(⊕), which reduces values with the associative operation ⊕, orBucket(G ), which creates a nested data structure, grouping generated values by key and applyingG to those with matching key. Loop bodies consist of yield statements

x←f(i)that define values passed to generators (of this loop or an outer loop), embedded

in some outer context E [.] that might consist of other loops or conditionals. For Bucket generators yield takes (key,value) pairs.

The fusion rules are summarized in Figure 10.1. This model is expressive enough to model many common collection operations:

x=v.map(f) loop(v.size) x=Collect { i => x ← f(v(i)) }

x=v.sum loop(v.size) x=Reduce(+) { i => x ← v(i) }

x=v.filter(p) loop(v.size) x=Collect { i => if (p(v(i))) x ← v(i) } x=v.flatMap(f) loop(v.size) x=Collect { i => val w = f(v(i))

Generator kinds:G ::=Collect|Reduce(⊕)|Bucket(G ) Yield statement: xs ← x

Contexts: E [.] ::= loops and conditionals

Horizontal case (for all types of generators):

loop(s) x1=G1 { i1 => E1[ x1 ← f1(i1) ] } loop(s) y1=G2 { i2 => E2[ x2 ← f2(i2) ] } loop(s) x1=G1, x2=G2 { i =>

E1[ x1 ← f1(i) ]; E2[ x2 ← f2(i) ] }

Vertical case (consume collect):

loop(s) x1=Collect { i1 => E1[ x1 ← f1(i1) ] }

loop(x1.size) x2=G { i2 => E2[ x2 ← f2(x1(i2)) ] }

loop(s) x1=Collect, x2=G { i =>

E1[ x1 ← f1(i); E2[ x2 ← f2(f1(i)) ]] }

Vertical case (consume bucket collect):

loop(s) x1=Bucket(Collect) { i1 =>

E1[ x1 ← (k1(i1), f1(i1)) ] }

loop(x1.size) x2=Collect { i2 => loop(x1(i2).size) y=G { j =>

E2[ y ← f2(x1(i2)(j)) ] }; x2 ← y }

loop(s) x1=Bucket(Collect), x2=Bucket(G ) { i =>

E1[ x1 ← (k1(i), f1(i));

E2[ x2 ← (k1(i), f2(f1(i))) ]] }

10.3. Loop Fusion and Deforestation

x=v.distinct loop(v.size) x=Bucket(Reduce(rhs)) { i =>

x ← (v(i), v(i)) }

Other operations are accommodated by generalizing slightly. Instead of implementing a groupByoperation that returns a sequence of (Key, Seq[Value]) pairs we can return the keys and values in separate data structures. The equivalent of(ks,vs)=v.groupBy(k).unzipis: loop(v.size) ks=Bucket(Reduce(rhs)),vs=Bucket(Collect) { i =>

ks ← (v(i), v(i)); vs ← (v(i), v(i)) }

In Figure 10.1, multiple instances off1(i)are subject to CSE and not evaluated twice. Substitutingx1(i2)withf1(i)will remove a reference tox1. Ifx1is not used anywhere else, it will also be subject to DCE. Within fused loop bodies, unifying index variableiand substituting references will trigger the uniform forward transformation pass. Thus, fusion not only removes intermediate data structures but also provides additional optimization opportunities inside fused loop bodies (including fusion of nested loops).

Fixed size array constructionArray(a,b,c)can be expressed as loop(3) x=Collect { case 0 => x ← a

case 1 => x ← b case 2 => x ← c }

and concatenationxs ++ ysasArray(xs,ys).flatMap(i=>i):

loop(2) x=Collect { case 0 => loop(xs.size) { i => x ← xs(i) }

case 1 => loop(ys.size) { i => x ← ys(i) }}

Fusing these patterns with a consumer will duplicate the consumer code into each match case. Implementations should have some kind of cutoff value to prevent code explosion. Code generation does not need to emit actual loops for fixed array constructions but can just produce the right sequencing of yield operations.

Part III

Staging and Embedded Compilers at

Work

Chapter 11

Intro: Abstraction Without Regret

LMS is a dynamic multi-stage programming approach: We have the full Scala language at our disposal to compose fragments of object code. In fact, DSL programs are program

generators that produce an object program IR when run. DSL or library authors and application

programmers can exploit this multi-level nature to perform computations explicitly at staging time, so that the generated program does not pay a runtime cost. Multi-stage programming shares some similarities with partial evaluation [65], but instead of an automatic binding-time analysis, the programmer makes binding times explicit in the program. We have seen how LMS usesReptypes for this purpose:

val s: Int = ... // a static value: computed at staging time

val d: Rep[Int] = ... // a dynamic value: computed when generated program is run

Unlike with automatic partial evaluation, the programmer obtains a guarantee about which expressions will be evaluated at staging time.

While moving computations from run time to staging time is an interesting possibility, many computations actually depend on dynamic input and cannot be done before the input is available. Nonetheless, explicit staging can be used to combine dynamic computations more efficiently. Modern programming languages provide indispensable constructs for abstract- ing and combining program functionality. Without higher-order features such as first-class functions or object and module systems, software development at scale would not be possible. However, these abstraction mechanisms have a cost and make it much harder for the compiler to generate efficient code.

Using explicit staging, we can use abstraction in the generator stage to remove abstraction in the generated program. This holds both for control (e.g. functions, continuations) and data abstractions (e.g. objects, boxing). Some of the material in this chapter is taken from [113].

11.1 Common Compiler Optimizations

We have seen in Part II how many classic compiler optimizations can be applied to the IR generated from embedded programs in a straightforward way. Among those generic optimizations are common subexpression elimination, dead code elimination, constant folding and code

motion. Due to the structure of the IR, these optimizations all operate in an essentially global way, at the level of domain operations. An important difference to regular general-purpose compilers is that IR nodes carry information about effects they incur (see Section 9.4). This permits to use quite precise dependency tracking that provides the code generator with a lot of freedom to group and rearrange operations. Consequently, optimizations like common subexpression elimination and dead code elimination will easily remove complex DSL operations that contain internal control-flow and may span many lines of source code.

Common subexpression elimination (CSE) / global value numbering (GVN) for pure nodes is handled bytoAtom: whenever theDefin question has been encountered before, its existing symbol is returned instead of a new one (see Section 9.2.1). Since the operation is pure, we do not need to check via data flow analysis whether its result is available on the current path. Instead we just insert a dependency and let the later code motion pass (see Section 9.3.1) schedule the operation in a correct order. Thus, we achieve a similar effect as partial redundancy elimination (PRE [74]) but in a simpler way.

Based on frequency information for block expression, code motion will hoist computation out of loops and push computation into conditional branches. Dead code elimination is trivially included. Both optimizations are coarse grained and work on the level of domain operations. For example, whole data parallel loops will happily be hoisted out of other loops.

Consider the following user-written code: v1 map { x =>

val s = sum(v2.length) { i => v2(i) }

x/s }

This snippet scales elements in a vectorv1relative to the sum ofv2’s elements. Without any extra work, the generic code motion transform places the calculation ofs(which is itself a loop) outside the loop overv1because it does not depend on the loop variablex.

val s = sum(v2.length) { i => v2(i) }

v1 map { x => x/s

}

11.2 Delite: An End-to-End System for Embedded Parallel DSLs

In document Número de procedimiento de licitación: R0LN número de la licitación (CompraNet): (página 122-125)