BOLETÍN OFICIAL DEL ESTADO
MÓDULO FORMATIVO 2
In this section, we describe a software based arbitration called Memguard [146]. We then propose an analysis to bound the memory time of each task. The analysis works by converting the problem into two modes of operation, and then applying Optimal-Assign algorithm on each mode.
Memguard is a technique designed to partition the memory bandwidth from software level. It uses per-core regulators to monitor and enforce the memory bandwidth allocation. In particular, each regulator monitors the number of memory requests performed by each core in a given period P . This can be achieved through HPC. These counters are configured to trigger an event when any core exceeds its allocated bandwidth (budget). Upon receiving such event, the corresponding core is suspended. The core recharges its budget and resumes its activity at the next period. The main idea is that within a period of time P , each core can issue a number of memory requests equivalent to the assigned bandwidth Qi. P is a
system parameter decided at design time and its value is set based on the overhead imposed to synchronize all cores.
Memguard was proposed to regulate the memory of individual cores. In this section, we extend the idea of Memguard to federated scheduling with cluster of cores. In a platform such as P4080 [58], a complex chain of conditions can be established to trigger one event. In our case with a cluster of cores executing one parallel task, we can configure the HPC to trigger an event when the aggregated number of memory requests from a cluster of cores exceeds the assigned bandwidth. Memguard leaves the fine grained arbitration at the level of one memory request to the hardware. We consider RR as the arbitration policy between clusters and each cluster uses one request buffer similar to nRR (Section5.4.2).
Like Memguard with individual cores, we assume that the clusters are synchronized with one period P andPn
i=1Qi ≤ P . In addition, both P and Qi represent time and their ratio Qi/P is unit-less. We can think of Qi/P as the fraction of time that a task is allowed to access memory. To ensure full memory budget at each task release, we assume the task’s period to be integer multiple of P .
Memguard for partitioned multicore processor has been analyzed in [142]. This work considers sequential tasks and assumes the budget for each core is given. The main focus was to compute the response time of real-time tasks on regulated cores. In contrast, our analysis considers parallel tasks with federated scheduling. In particular, we seek to determine Qi and compute the memory access time for each task under Memguard regulation.
We recall that the memory part of each task can be bounded as em
i × n when we rely on nRR hardware arbiter only. However, the memory time can be modified due to the effects of Memguard regulation. Since Pn
i=1Qi ≤ P , other clusters can consume at most P − Qi memory time within each regulation period. Thus, a task Ti can be delayed for at most P − Qi time in each regulation period. We distinguish two modes to compute the memory delay of any task. Unlike the analysis in [142] which computes a delay term and adding it to the memory time of the task without interference, our analysis proceeds by computing an inflation factor for memory. This is done for simplicity, but the two concepts are effectively equivalent: given a memory time em
i and a delay ∆, the inflated memory time is simply em
i + ∆. As an example, in the case of pure RR contention delay, [142] would compute a delay term ∆ = emi × (n − 1), while in our case we consider an inflated memory time em
Regulation mode
The task in this mode consumes its budget Qi and waits until the next period, i.e., it suffers a delay of P − Qi in each period. For this mode, the inflated memory time can be computed as follows. em i Qi × P + P (5.9)
The formula states that the number of regulation periods is bem
i /Qic. For each regulation period, the inflated memory time is P (Qimemory time plus P − Qidelay). Any remaining (emi mod Qi) memory must complete within one last period; hence, its inflated time can be bounded by P . Since we assume any arrival pattern of memory requests, we consider the worst-case pattern that leads to maximum delay. That is, all memory requests are clustered at the beginning of regulation periods without computation between them. If, however, there is computation between them, say for x time, the task will be stalled for P − Qi− x in each regulation period.
Contention mode
Unlike the previous mode, the task in this mode consumes less than its budget, and the contention of its memory causes the maximum delay of P − Qi. Due to RR arbitration, every memory access can be delayed by (n − 1) requests of other clusters. Hence, the task must access memory for Ωi time to suffer the maximum delay of P − Qi within each period. This can be stated as follows.
Ωi× (n − 1) = P − Qi Ωi =
P − Qi n − 1
We observe two possible cases under contention mode as illustrated in Figure 5.3 for three regulation periods. The intuition behind the following analysis is to think that each task has memory and computation, and within each regulation period, the task can spend Qi time by either computation or memory. In addition, the task can be delayed by P − Qi when it generates Ωi amount of memory within each regulation period as stated above. Since, as per our task model, the positions or even the order of computation and memory are unknown at compile time, we try to bound the memory time by constructing the worst case pattern between memory and computation that leads to maximum memory delay.
Q-Ω P 2P 0 Ω Q-Ω Ω (P-Q) 3P Q-Ω Ω time
(a) Enough computation to fill Q − Ω in each regulation period.
Q-Ω P 2P 0 Ω Q-Ω Ω (P-Q) Q 3P memory computation time
(b) No enough computation o fill Q − Ω in each period. Figure 5.3: Two sub-cases for contention mode.
In case (a), there is enough computation to fill Q − Ω gap in every regulation period as shown in Figure 5.3(a). This causes all memory to suffer contention. Hence, the inflated memory time is simply emi × n. However, in case (b), there is no enough computation to fill Q − Ω gap in each regulation period, for example, period 3 in Figure5.3(b). Thus, the task has to execute memory because the task budget is not consumed yet. In this case, the inflated memory time can be upper bounded by
e∗ i Qi
× P + P, (5.10)
where e∗i = emi + (exi−Li)/mi+ Li. We consider e∗i, the combined memory and computation time, since either of them can execute within Qitime. Similar to before, be∗i/Qic represents the number of regulation periods. For each period, the task executes for Qi (memory and computation) and waits for P − Qi. The remaining execution time is again upper bounded by P .
We note that the memory time under contention mode cannot exceed Cm
i × n because we assume per cluster buffer and RR arbitration policy at the hardware level. Thus, the memory time will be the minimum between case (a) and case (b). In contrast, the memory time under regulation mode may exceed Cm
We just described how to compute the memory time under each mode. We now specify a condition upon which we determine the mode that should be used to bound the memory time.
Lemma 22. When Qi < Ωi, the memory time under regulation mode is longer than the memory time under contention mode. On the other hand, when Qi ≥ Ωi, the memory time under case (a) or case (b) of contention mode is longer than or equal the memory time under regulation mode.
Proof. Clearly, when Qi < Ωi, the task cannot cause the maximum delay of P − Qi by contention. Since the task is delayed by P −Qi(the maximum possible delay) in each period under regulation mode, the memory time under this mode is longer than the memory time under contention mode. On the other hand, when Qi ≥ Ωi, the contention mode leads to longer or equal memory time even though, under both modes, the task is assumed to receive P − Qi delay in each period. We recall that under regulation mode, the task is assumed to consume its budget Qi in each period. In contrast, case (a) of contention mode assumes that the task consumes Ωi ≤ Qi budget in each period while case (b) assumes the task consumes Ωi in some periods and Qi in some other periods. In both cases, the number of regulation periods cannot be less than the number of periods under regulation mode in which the task is assumed to consume Qi memory in each period.
We can safely remove the floor function from (5.9) and (5.10), and the formulas still serve as an upper bound to memory time. In addition, we propose to let Qi/P = qi. Thus, (5.9) can be re-written as follows.
emi qi
+ P (5.11)
Similarly, the memory time in contention mode is bounded by (5.10) and cannot exceed emi × n. Thus, the memory time can be re-written as
min(emi × n,e ∗ i qi
+ P ). (5.12)
Since the formulas are approximated and Qi is replaced by P × qi, the conditions in Lemma22, which is based on Qi, can be re-stated in terms of qi as follows.
Lemma 23. memory time = ( (5.11) if qi < (n − P/emi ) −1, (5.12) otherwise.
Proof. We prove this lemma by showing that the memory time under each mode is no less than the memory time under the other mode for the corresponding value of qi. For regulation mode, qi < (n − P/emi )
−1. Thus, em i qi + P > emi × (n − P/em i ) + P > emi × n − P + P > emi × n.
It is enough to only check with emi ×n because the memory time in contention mode cannot exceed em
i × n. Similarly, the memory time under contention mode can be either e∗i qi + P ≥ e m i qi + P, since e∗i ≥ em i , or emi × n in which emi qi + P ≤ emi × (n − P/emi ) ≤ emi × n.
By adding the computation time to the inflated memory time, we obtain the following makespan bounds under each mode. For regulation mode, we have:
eregi = e m i qi + P +e x i − Li mi + Li, (5.13)
and for contention mode, we have:
ecnti = min(emi × n, e ∗ i qi + P ) +e x i − Li mi + Li. (5.14)
Similar to (5.4), we can derive qireg(mi) from (5.13) and qcnti (mi) from (5.14) by enforcing the schedulability test eregi ≤ Di and ecnti ≤ Di, respectively.
In Figure 5.4, we show a graphical representation of these two functions for a given task Ti. The bandwidth fraction qi ranges from 0 to 1 and the middle bold dotted line represents the condition as in Lemma23in which the lower area is for regulation mode and the upper area is for contention mode. Each point in these two curves represents a pair of values (qi, mi) in which the task is schedulable. The mni vertical line represents emi × n
(n-P/eim)-1 1 min qireg qicnt mi 0 1/n
Figure 5.4: Two modes of operation for Memguard.
and computed as in (5.8). In contention mode, the memory time cannot exceed this line as indicated by min function in (5.14).
We propose to apply Optimal-Assign to obtain qi bandwidth fractions by assuming each task operates in one particular mode. This approach results in 2n possible combina- tions as each task can be in two different modes. As a final step, we let Qi = P × qi, and then program Memguard’s regulators to enforce the bandwidth assignment.
An important point to note is that by having tasks operating in different modes, Mem- guard is able to assign unequal fractions of memory bandwidth in which some tasks have qi > (n − P/emi )
−1 and some others have q
i < (n − P/emi )
−1. In contrast, nRR always assigns qi = 1/n for all tasks.