Then, assumingRis the final program result, the set of remaining symbols in the graphGis the least fixpoint of:
G = R ∪ syms(G map findDefinition)
Dead code elimination will discard all other nodes.
9.3 From Graphs Back to Trees
To turn program graphs back into trees for code generation we have to decide which graph nodes should go where in the resulting program. This is the task of code motion.
9.3.1 Code Motion
Other optimizations can apply transformations optimistically and need not worry about maintaining a correct schedule: Code motion will fix it up. The algorithm will try to push statements inside conditional branches and hoist statements out of loops. Code motion depends on dependency and frequency information but not directly on data-flow information. Thus it can treat functions or other user defined compound statements in the same way as loops. This makes our algorithm different from code motion algorithms based on data flow analysis such as Lazy Code Motion (LCM, [77]) or Partial Redundancy Elimination (PRE, [74]). The graph IR reflects “must before” (ordering) and “must inside” (containment) relations, as well as anti-dependence and frequency. These relations are implemented by the following methods, which can be overridden for new definition classes:
def syms(e: Any): List[Sym[Any]] // value dependence (must before)
def softSyms(e: Any): List[Sym[Any]] // anti dependence (must not after)
def boundSyms(e: Any): List[Sym[Any]] // nesting dependence (must not outside)
def symsFreq(e: Any): List[(Sym[Any], // frequency information (classify
Double)] // sym as ’hot’, ’normal’, ’cold’)
To give an example,boundSymsapplied to a loop nodeRangeForeach(range,idx,body)with index variableidxwould returnList(idx)to denote thatidxis fixed “inside” the loop expres- sion.
Given a subgraph and a list of result nodes, the goal is to identify the graph nodes that should form the “current” level, as opposed to those that should remain in some “inner” scope, to be scheduled later. We will reason about the paths on which statements can be reached from the result. The first idea is to retain all nodes on the current level that are reachable on a path that does not cross any conditionals, i.e. that has no “cold” refs. Nodes only used from conditionals will be pushed down. However, this approach does not yet reflect the precedence of loops. If a loop is top-level, then conditionals inside the loop (even if deeply nested) should not prevent hoisting of statements. So we refine the characterization to retain all nodes that are reachable on a path that does not cross top-level conditionals.
This leads to a simple iterative algorithm (see Figure 9.3): Starting with the known top level statements, nodes reachable via normal links are added and for each hot ref, we follow nodes that are forced inside until we reach one that can become top-level again.
R
R: current block result B: bound in e1 e1 e0 B g1 f1 e0 = global scope
e1 = reachable from R “ current scope “ g1 = dependent on B “ forced inside ” f1 = (1 step from g1) - g1 “ fringe ”
Figure 9.2: Graph IR with regular and nesting edges (boundSyms, dotted line) as used for code motion.
Code Motion Algorithm: Compute the set L of top level state- ments for the current block, from a set of available statements
E , a set of forced-inside statements G ⊆ E and a block result R.
1. Start with L containing the known top level statements, initially the (available) block result R ∩ E.
2. Add to L all nodes reachable from L via normal links (nei- ther hot nor cold) through E −G (not forced inside). 3. For each hot ref from L to a statement in E − L, follow any
links through G, i.e. the nodes that are forced inside, if there are any. The first non-forced-inside nodes (the “hot fringe”) become top level as well (add to L).
4. Continue with 2 until a fixpoint is reached.
9.3. From Graphs Back to Trees
To implement this algorithm, we need to determine the setGof nodes that are forced inside and may not be part of the top level. We start with the block resultRand a graphEthat has all unnecessary nodes removed (DCE already performed):
E = R ∪ syms(E map findDefinition)
We then need a way to find all uses of a given symbols, up to but not including the node where the symbol is bound:
U(s) = {s} ∪ { g ∈ E | syms(findDefinition(g)) ∩ U(s) 6= ; && s ∉ boundSyms(findDefinition(g))) }
We collect all bound symbols and their dependencies. These cannot live on the current level, they are forced inside:
B = boundSyms (E map findDefinition)
G = union (B map U) // must inside
ComputingU(s)for many symbolssindividually is costly but implementations can exploit considerable sharing to optimize the computation ofG.
The iteration in Figure 9.3 usesGto follow forced-inside nodes after a hot ref until a node is found that can be moved to the top level.
Let us consider a few examples to build some intuition about the code motion behavior. In the code below, the starred conditional is on the fringe (first statement that can be outside) and on a hot path (through the loop). Thus it will be hoisted. Statementfoowill be moved inside:
loop { i => z = *if (x) foo
if (i > 0) loop { i =>
*if (x) if (i > 0)
foo z
} }
The situation changes if the inner conditional is forced inside by a value dependency. Now statementfoois on the hot fringe and becomes top level.
loop { i => z = *foo if (x) loop { i => if (i > 0) if (x) *foo if (i > 0) } z }
For loops inside conditionals, the containing statements will be moved inside (relative to the current level).
if (x) if (x)
loop { i => z = foo
foo loop { i =>
} z
Pathological Cases
The described algorithm works well and is reasonably efficient in practice. Being a heuristic, it cannot be optimal in all cases. Future versions could employ more elaborate cost models instead of the simple hot/cold distinction. One case worth mentioning is when a statement is used only in conditionals but in different conditionals:
z = foo if (x)
if (x) foo
z if (y)
if (y) foo z
In this situationfoowill be duplicated. Often this duplication is beneficial becausefoocan be optimized together with other statements inside the branches. In general of course there is a danger of slowing down the program if both conditions are likely to be true at the same time. In that case it would be a good idea anyways to restructure the program to factor out the common criteria into a separate test.
Scheduling
Once we have determined which statements should occur on which level, we have to come up with an ordering for the statements. Before starting the code motion algorithm, we sort the input graph in topological order and we will use the same order for the final result. For the purpose of sorting, we include anti-dependencies in the topological sort although they are disregarded during dead code elimination. A bit of care must be taken though: If we introduce loops or recursive functions the graph can be cyclic, in which case no topological order exists. However, cycles are caused only by inner nodes pointing back to outer nodes and for sorting purposes we can remove these back-edges to obtain an acyclic graph.
9.3.2 Tree-Like Traversals and Transformers
To generate code or to perform transformation by iterated staging (see Section 8.2.5) we need to turn our graph back into a tree. The interface to code motion allows us to build a generic tree-like traversal over our graph structure:
trait Traversal {
val IR: Expressions; import IR._
// perform code motion, maintaining current scope
def focusExactScope(r: Exp[Any])(body: List[Stm[Any]] => A): A
// client interface
def traverseBlock[T](b: Block[T]): Unit =
focusExactScope(b.res) { levelScope => levelScope.foreach(traverseStm) }
def traverseStm[T](s: Stm[T]): Unit = blocks(s).foreach(traverseBlock)