3.5 DIMENSIONAMIENTO DE LAS CIMENTACIONES
3.5.3 E.L.U de agotamiento de la cimentación
Transition system →Θ defines the semantics of instructions at the thread level. In order to
be able to evaluate expressions, the program environment and an expression register look
up function are given to the rules of →Θ. Typically, a thread either generates a control flow,
memory, or communication action when it executes an instruction. There is only one case in which no action is generated.
Configurations : hθ | instr, φ, %i ∈ Thread × Instr × ProgEnv × ExprRegLookup,
θ ∈ Thread
Transitions : hθ | instr, φ, %i −−−→
tactε
Θθ0
The (dop) rule handles all operations that operate on data words. This includes addition, multiplication, bit shifting, bitwise and, calculation of the reciprocal square root, and many more. The rule is based on the abstract function dop that computes the resulting data word in accordance with the semantics of the corresponding operation specified in [9, 8.7.1, 8.7.2, and 8.7.4]. The expression value returned by the dop function is stored in a memory action that is passed up the hierarchy together with other necessary information to process the action
at the device level. The expressions e1. . . en are evaluated using the expression evaluation
function E~−. The address of the target register is retrieved by invoking the function V~−. (dop) hθ . κ, tid, ro, lo | instr . dop r e1. . . en, φ, %i
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
V~rφκ←dop(E~e1φ%κ,...,E~enφ%κ) tid ro lo regs(instr,φ) Θ θ
A bra instruction has no meaning at the thread level; all branching related decisions are made exclusively at the warp level. However, the warp’s branching algorithm needs to know the target address of the branch that might differ for each thread taking the branch. Together with the warp lane, the target program address is stored in a control flow action that is subsequently processed by the warp. The expression e is either a label, in which case E~− returns a valid program address because we only consider valid PTX programs, or a register. If the value stored in the register is not a valid program address, program behavior is undefined.
(bra) hθ . κ, wl | bra e, φ, %i −−−−−−−−−−−→
brawl E~eφ%κ Θθ
A function call is similar to a branch instruction in that the actual jump to the first instruction of the function is performed at the warp level. The expression e is either a valid function name, in which case E~− returns a valid program address, or a register storing the first address of a function with a matching signature. Otherwise, program behavior is unpredictable. Additionally, the local variable environment of the called function is pushed onto the thread’s
call stack. Return addresses are also kept track of at the warp level instead of individually for each thread.
(call) hθ . (κ . ~ν), wl | call e, φ, %i −−−−−−−−−−−−→
callwl E~eφ%κ
Θ hθ / φ
localEnv(E~eφ%κ) :: ~νi
The following control flow operations also have no meaning at the thread level and are only used to inform the warp which threads actually executed the instructions. Only labels are valid for ssy, preBrk, and preRet instructions because all threads of the same warp must agree on the address specified by these instructions. The semantics of the control flow instructions are explained in greater detail in section 5.4.1.
(ssy) hθ . κ | ssy l, φ, %i −−−−−−−−−→ ssy E~lφ%κ Θ θ (preBrk) hθ . κ | preBrk l, φ, %i −−−−−−−−−−−→ preBrk E~lφ%κ Θ θ (preRet) hθ . κ | preRet l, φ, %i −−−−−−−−−−−→ preRet E~lφ%κ Θ θ (sync) hθ | sync, φ, %i −−−→ sync Θ θ (break) hθ . wl | break, φ, %i −−−−−−→ breakwl Θ θ (exit) hθ . wl | exit, φ, %i −−−−−→ exitwl Θ θ
Returning from a function and jumping back to the call site is handled at the warp level as mentioned above. However, at the thread level the topmost local variable environment must be popped off the call stack.
(return) hθ . (κ . ν :: ~ν), wl | ret, φ, %i −−−−−−−→
returnwl
Θ hθ / ~νi
The vote instruction sends a vote action to the thread’s warp. The result of the vote is computed by the warp and stored in destination register r. The rule evaluates the voting expression and stores it as a Boolean value inside the action. The memory actions to store the result of the vote in the destination registers specified by the threads are generated at the warp level, hence the warp needs to know the indices and register offsets of the threads participating in the vote.
(vote) hθ . κ, tid, ro, lo, wl | instr . vote vm r e, φ, %i −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
vote V~rφκ B~eφ%κ wl tid ro lo regs(instr,φ) Θ θ
A memory barrier or memory fence causes the thread to wait for all prior memory opera- tions to be performed at a certain level. The thread is allowed to resume execution once all prior writes are visible to all threads of the specified level and all prior reads can no longer
be affected by writes of another thread at the specified level. For instance, when a thread
continues execution after a memory fence of the thread block level, we know that all other threads of the same block will subsequently read the values previously written by the thread.
Furthermore, all pending read operations by the thread can no longer be affected by a write
of some other thread of the block. It is unspecified when exactly a memory operation is no longer affect by another one, hence memory barriers are not included in our formalization of
the memory environment. To support memory barriers in the program semantics, we store
the level of the memory barrier a thread waits for. As long as the stored value is notε, the
thread cannot be scheduled because a prior memory fence has not yet completed. At the device level, we define an abstract function that checks whether a thread’s memory fence is
completed. If so, the function resets the thread’s memory barrier level toε which allows the
warp scheduler to schedule the thread’s warp again. The (membar) rule is the only rule that does not generate a thread action.
(membar) hθ | membar mbl, φ, %i →Θhθ / mbli
A thread block barrier is a synchronization point for all threads of the same thread block. Once a thread reaches a barrier, it is not allowed to continue program execution until a specific amount of threads of the same block has reached the barrier instruction as well. Barriers are identified by an index and the amount of threads participating in the barrier can optionally be specified. A thread that processes a bar.sync or bar.red instruction automatically establishes a memory barrier of the .cta level. The thread is not allowed to resume execution as long as either of the two barriers is not yet completed. Both the memory barrier and the synchronization barrier are dealt with at higher levels of the hierarchy and have no meaning at the thread level. If the evaluation of expression e does not yield a valid barrier index, the program is stuck.
(barsync) hθ . κ | bar.sync e e0ε, φ, %i −−−−−−−−−−−−−−−−−→ bar E~eφ%κ Eε~e0εφ%κ
Θhθ / .ctai
The bar.arrive instruction only marks the arrival of the thread at the barrier but does not cause any waiting due to thread block synchronization or an implicit memory barrier. At
the thread level, the only difference to the bar.sync instruction is that the number of threads
participating in the barrier is not optional and therefore must be specified. (bararrive) hθ . κ | bar.arrive e e0, φ, %i −−−−−−−−−−−−−−−−→
bar E~eφ%κ E~e0
φ%κ Θ θ
bar.redis similar to bar.sync in that the thread has to wait for the barrier to complete
before it is allowed to execute further instructions. An implicit thread block level memory fence is initiated as well. Additionally, the thread block performs a reduction operation once the barrier completes. The result of the reduction is stored in register r; the thread block generates corresponding memory actions once it has computed the result of the reduction.
The expression e2 acts as the input of the reduction operation. The rule evaluates e2 and
stores it as a Boolean value inside the communication action. (barred) hθ . κ, tid, ro, lo | instr . bar.red br r e e1,ε e2, φ, %i
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
bar E~eφ%κ Eε~e1,εφ%κ V~rφκ B~e2φ%κ tid ro lo regs(instr,φ)
Θ hθ / .ctai The rules for atomic memory operations and non-atomic load and store operations generate a memory action containing the data specified by the instruction. At the device level, the action is added to the memory environment where a corresponding memory program is created to service the request.
(atom) hθ . κ, tid, ro, lo | instr . atom ss aop size r e1e2 e3,ε, φ, %i
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
V~rφ%κ←E~e1φ%κ ss aop size E~e2φ%κ Eε~e3,εφ%κ tid ro lo regs(instr,φ) Θ θ
(st) hθ . κ, tid, ro, lo | instr . st volatileεss cop
εsize e r, φ, %i
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
E~eφ%κ ss size E~rφ%κ volatileεcopεtid ro lo regs(instr,φ) Θ θ
(ld) hθ . κ, tid, ro, lo | instr . ld volatileεss copεsize r e, φ, %i
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
V~rφκ ←E~eφ%κ ss size volatileεcopεtid ro lo regs(instr,φ) Θ θ
The mov instruction evaluates an expression and stores the resulting value in the specified destination register. The instruction can be used to load constant values into a register, to copy the value of a register to another one, to load the program address denoted by a label or function into a register, or to load the address pointed to by a non-register variable into a register.
(mov) hθ . κ, tid, ro, lo | instr . mov r e, φ, %i −−−−−−−−−−−−−−−−−−−−−−−−−−−−→
V~rφκ ←E~eφ%κ tid ro lo regs(instr,φ) Θθ
setpuses a comparison operator to compare two operands and stores the resulting value
in the destination register. We use the abstract function comp to compute the resulting value in accordance with the default semantics of the comparison operators supported by PTX. (setp) hθ . κ, tid, ro, lo | instr . setp comp r e1e2, φ, %i
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
V~rφκ ←comp(E~e1φ%κ,E~e2φ%κ) tid ro lo regs(instr,φ) Θ θ