RTM is a multi-level hierarchical task queuing system running on Rigel. Note that the implementation detailed here allows for a task queuing system to be built without the use of global cache coherence but rather through the use of memory operations that bypass possibly incoherent local caches. Coherence management is intertwined with the RTM implementation. Multiple policies for managing coherence are described and evaluated in Chapter 9.
There is a single circular global task queue comprising groups of pointers to task descriptors. The global task queue is located in the application’s address space that is resident in main memory or, more likely, cached at the L3, avoiding coherence concerns. The insertions and deletions at the global task queue are synchronized using head and tail pointers that are updated atomically using global atomic operations. Task descriptors inserted into the global task queue are not linked into the task queue until they are flushed to the L3 by the enqueing core. On enqueue operations, the enqueing core fills in the task descriptor locally, flushes it to the L3, and then inserts it into the global task queue. Reducing the exposed latency of enqueue is critical for achieving high speed fanout of work. Here the only exposed latency is the read-modify-write operation to link in the new descriptors. Moreover, we amortize the cost of enqueue and dequeue by inserting groups of tasks at once when possible, nominally eight in our implementation.
Local task queues are used as a local buffer for task descriptors. The local task queue is implemented as a NULL-terminated linked list. Access to the linked list is synchronized using a ticket lock [49]. A ticket lock is used to reduce contention on the L2 bus for situations where the local task queue is empty and many cores
are waiting on another core within the cluster to enqueue more tasks from the global task queue. Insertions into the local task queue, which occur when a core attempts a dequeue operation and finds the local task queue head pointer to be NULL, are synchronized by a spin lock that ensures only one core is attempting to fill the local task queue at any time; multiple global-to-local enqueuers complicate the design of barriers and could lead to increased load imbalance and contention at the L3.
The implementation of RTM relies heavily on global operations and atomic primitives provided by the architecture. Global operations are used to access global barrier state and poll head and tail pointers for the global task queue. Local atomic operations synchronize the insertion and removal of tasks at the local task queues. While global and atomic operations are seldom used directly by application code, high performance global operations are imperative for achieving low-latency enqueue, dequeue, and barrier operations. As such, the latency of global operations has a strong influence on the overall scalability of the design. As the amount of work per interval remains constant and the number of cores scales up, there is less work per core. Therefore, a higher fraction of execution is spent in the runtime executing barriers and performing task queue management relative to task execution.
Dequeue operations are split into a fast path and a slow path. For the common case where there are tasks in the local task queue and no barrier is pending, we use load-link and store-conditional operations to obtain the local task queue lock and unlink a task descriptor from the local task queue. We find this operation, including transferring task descriptor contents into registers, can be accomplished in fewer than 50 cycles, but may require many more on average if there is a high degree of contention.
path and attempts to grab the per-cluster global task queue lock. The purpose of this lock is to limit the number of cores per cluster attempting to dequeue from the global task queue to one. If the lock is held, it implies another core is accessing the global task queue and will either fill in tasks later or will notify the cores that a barrier has been reached. While the lock is held, cores waiting at the local task queue must check for more tasks by checking the value of the head pointer and the cores must check if a barrier has been reached. The cores thus alternate between checking the local barrier sense and checking the local task queue for more work. Both operations require no locks and can be performed locally at the L1 cache. When either the local task queue has tasks appended to it or a barrier is reached, the cluster-level coherence implementation invalidates the local copy of those values.
The performance of enqueue operations is critical to the scalability of RTM. If the fan-out for tasks is not fast enough, cores will starve waiting for task de- scriptors. If enqueue operations cannot happen fast enough, dequeued tasks will finish before new tasks are available, thus leaving cores unutilized. Similarly, if the latency of dequeue operations grows with the number of cores, scalability will be limited. We address the problem of fast enqueue by performing enqueue in parallel when possible. We can achieve this due to the SPMD execution model that ensure all threads execute the same code given certain constraints placed on synchronization and control flow. One core from each cluster is designated the enqueing core. That core determines for which section of the tasks enqueued by the user it should generate task descriptors and locally begins building up groups of task descriptors. To address the problem of fast initial dequeue, the enqueing core inserts tasks locally until a predefined number of task descriptors exists at the local task queue before it begins enqueing tasks at the global task queue. The design achieves very low overhead for enqueue operations across all our workloads.
A particularly difficult aspect of the RTM implementation is the barrier op- eration. The general structure of the barrier is as a reduction operation where cores enter into a cluster-level barrier and then one core from the cluster enters a tile-level barrier. One core from each tile-level barrier then participates in a global barrier. When the global barrier is reached, all cores are notified. We make use of the broadcast update operation supported by Rigel to efficiently perform the wake up. The broadcast operation on Rigel is a multicast message initiated by one core and sent from the L3 to each of the tiles where the message is replicated at each level of the interconnect down to the cores. If such a message did not ex- ist, cores would be required to poll using global memory operations which would result in a high degree of contention at the L3. The problem would be worse when most tasks are polling and few tasks remain thus exacerbating the effect of load imbalance.
The challenge in implementing RTM barriers correctly stems from RTM sup- porting arbitrary enqueue between two barriers. Some BSP-like models force all enqueue operations to occur before any tasks begin; no tasks may enqueue other tasks. Fork-join models that allow for task enqueues, or forks, to occur at arbi- trary points in the program have a parent-child relationship not present in RTM that allows for synchronization at post-dominators, i.e., points where all children join with their parents recursively. The implementation difficulty that this de- sign aspect creates is that barriers must be two-phased. When there is no work pending locally, a task will check at the next level for more work. If no work is found globally, the core assumes a barrier is reached and waits. Only when all other cores have entered the barrier can the barrier proceed, as with a conven- tional barrier. However, there is the possibility that one of the cores not in the barrier could insert more work. Adding more work while other cores are waiting at the barrier requires that cores already in the barrier have the ability to back
out and begin dequeuing work again. The two-phase barrier is implemented by having each core waiting on the barrier check the status of the task queue and the number of other cores in the barrier, and only allowing the barrier to proceed once all cores are in the barrier and no new work is in the task queue.