In the next set of experiments, we evaluated many heuristics, representing different combina- tions of policies, on task sets representing a variety of system utilizations, task utilizations, and WSS distributions. Experiments were initially conducted on the eight-core architecture— the heuristic that was found to be particularly effective at improving system performance for a wide variety of task sets was then evaluated on the 32-core architecture. We considered heuristics that employed the following thresholds and policies.
• Promotion-duration policy: (1).
• Cache utilization threshold: 0%, 50%, or 75%. • Cache-aware policy: All policies (1)-(5) considered. • Lost-cause threshold: 110%.
• Lost-cause policy: All policies (1)-(3) considered. • Phantom tasks: Used and not used.
• Avoid scheduling partially-eligible MTTs: Yes.
Note that, in these experiments, we sometimes chose to consider only one threshold or policy choice. This was done when we did not expect the choice to have a significant impact on the performance of the heuristics, especially for the purpose of makingrelative performance comparisons between various policies. This greatly reduced the number of policy combina- tions that needed to be evaluated; considering additional policy variations would have made the required number of experiments prohibitive. Later, in Section 6.1.4, we consider making changes to the heuristics that performed well in these experiments for the purpose of im- plementation efficiency, so that scheduling overheads are low. The changes considered there were evaluated by conducting experiments with additional threshold or policy choices beyond those considered here, particularly in categories where only one choice was considered.
Task-set generation methodology. When generating random task sets, we varied the following parameters. In this chapter, the utilization of an MTT indicates the utilization of every task within that MTT—the execution cost and period of an MTT are defined similarly. Note that this means that, in our experiments, all tasks within an MTT have the same utilization, execution cost, and period.
• System utilization: 50% or 100% utilized.
• MTT periods: Between 10 and 100 ms (some values removed to avoid arithmetic overflow), except for the the last-generated MTT, which may have a larger period. • MTT utilizations: Uniform over [0.01, 0.1], [0.1, 0.4], [0.5, 0.9], or [0.01, 0.9]. • MTT execution costs: Derived from periods and utilizations, and at least 1 ms. • MTT task counts: Uniform over [1, 8].
• MTT WSSs: Uniform over [64 bytes, 2 MB]; or equal to the task count multiplied by a size uniform over [64 bytes, 512K], and capped at 2 MB.
Each heuristic was used to schedule 20 task sets for each combination of these parameters. In total, this resulted in nearly 30,000 experimental runs using SESC. Due to the large amount of time and processing power that is required for each experimental run, running additional experiments is problematic (which is why, as stated earlier, considering additional policy combinations would have required a prohibitive number of experiments). Even with the assistance of a large research cluster, we were able to complete only a few thousand experimental runs per day in the best case (i.e., when there is little contention for the cluster, which is shared across campus). Like many architecture simulators, SESC is quite slow, especially when timing accuracy is required.
Task set justification. We believe that our task periods represent a reasonable range of those observed in real applications, and our task utilization ranges are similar to those used in other work [8, 16, 22]. System utilizations were chosen so that scheduling flexibility was either substantial (at 50%) or very limited (at 100%). For half of the experiments, MTT WSSs were
Task Set Parameters Heuristic L2 Miss Rate IPC
%S MTT U WD T CP LP PT GEDF H %Im GEDF H %Im
50 [0.01, 0.1] TC 0 (1) (1) used 3.62 1.60 55.88 0.97 1.23 26.46 50 [0.01, 0.1] Uni 0 (1) (3) used 7.14 3.16 55.76 0.80 1.17 44.90 50 [0.1, 0.4] TC 0 (1) (1) used 1.22 0.36 70.62 1.21 1.20 -1.07 50 [0.1, 0.4] Uni 0 (3) (1) used 6.70 0.67 90.00 0.93 1.17 25.19 50 [0.5, 0.9] TC 0 (1) (1) used 1.07 0.28 73.67 1.03 1.01 -2.28 50 [0.5, 0.9] Uni 0 (3) (1) used 15.38 0.98 93.61 0.77 0.92 18.99 50 [0.01, 0.9] TC 0 (3) (1) used 3.61 0.63 82.68 1.01 1.12 10.77 50 [0.01, 0.9] Uni 0 (1) (1) used 7.92 0.78 90.12 0.97 0.95 -2.15 100 [0.01, 0.1] TC 0 (3) (1) N/A 5.30 1.67 68.55 0.85 1.16 36.96 100 [0.01, 0.1] Uni 0 (3) (2) N/A 7.22 2.57 64.38 0.76 1.11 45.26 100 [0.1, 0.4] TC 0 (3) (2) N/A 3.75 1.35 64.00 0.96 1.18 22.44 100 [0.1, 0.4] Uni 0 (3) (3) N/A 7.02 3.46 50.71 0.89 1.14 28.20 100 [0.5, 0.9] TC 0 (1) (3) N/A 3.81 2.83 25.66 1.05 1.13 7.20 100 [0.5, 0.9] Uni 50 (1) (1) N/A 5.03 3.58 28.93 0.99 1.06 6.28 100 [0.01, 0.9] TC 0 (1) (1) N/A 2.49 0.88 64.56 1.09 1.23 13.29 100 [0.01, 0.9] Uni 50 (1) (1) N/A 4.30 3.70 14.04 0.99 1.05 6.26
Table 6.2: The heuristics that performed best for random task sets. When specifying task set parameters, the columns labeled “%S”, “MTT U”, and “WD” correspond to percent system utilization, MTT utilization distribution, and WSS distribution, respectively. For WSS distributions, “Uni” means uniformly distributed and “TC” means correlated by task count. For the policies used by the heuristics, the columns “T”, “CP”, “LP”, and “PT” stand for cache utilization threshold, cache policy, lost-cause policy, and phantom tasks, respectively. Finally, when presenting L2 miss rates and instructions per cycle (IPC), the column labeled “H” presents performance numbers for the heuristic indicated, and the column labeled “%Im” presents the relative percentage improvement in miss rate or IPC overGEDF.
correlated with task count. This seems realistic, since a larger number of tasks would be more capable of referencing and processing a larger memory region. WSSs were often large, but never exceeded the size of the L2 cache—otherwise, thrashing would be inevitable. Large WSSs are realistic in practice; for example, the authors of [25] claim that the WSS for an HDTV-quality MPEG decoding task could be as high as 4.1 MB, and statistics presented in [70] show that substantial memory usage is required for video-on-demand applications. Finally, note that while these experiments certainly should not be considered definitive, similar task sets have been used effectively in other published work [3, 4, 16, 22].
Results. Table 6.2 presents average cache miss rates and average per-core instructions per cycle (IPC)2 for both GEDF and the heuristic that exhibited the best performance in terms 2
In comparing this data to that presented in Section 6.1.1, note that IPC is often correlated with the number of memory references performed.
Algorithm Average Maximum
GEDF 0.216 474
Heuristics 1.843 572 Best heuristic only 3.711 493
Table 6.3: Tardiness forGEDFand our heuristics (in quanta).
of these two metrics, as indicated. We can make several observations from this data. First, in almost all cases, the heuristic that performed best for a particular combination of task- set generation parameters outperformed GEDF, often by a substantial margin (see the bold
entries in Table 6.2). Second, heuristics that use cache-aware policies (1) or (3) performed best; however, as we will see in Section 6.1.4, when policy (3) performed better than policy (1), it was often by a negligible margin. Third, the use of phantom tasks was clearly bene- ficial, as every heuristic in the table employs their use when applicable; in fact, we believe that performance improvements tended to be larger at 50% system utilization solely because phantom tasks could be effectively employed. Fourth, the heuristics that performed best almost unanimously employed a cache utilization threshold of 0% and lost-cause policy (1), though lost-cause policies (2) and (3) sometimes performed best at 100% system utilization. This is probably because, at 100% system utilization, phantom tasks cannot be employed, and lost-cause policies (2) and (3) present another way of reducing the impact of MTTs that have the greatest potential to cause thrashing. Overall, we conclude that the heuristic that performed best for the widest variety of task sets employed a cache utilization threshold of 0%, cache-aware policy (1), lost-cause policy (1), and phantom tasks.
Deadline tardiness. We next tabulated average and maximum observed deadline tardi- ness. These results are shown in Table 6.3. In this case, we ran each task set for 2,000 quanta rather than 20 quanta. Tardiness is higher with our heuristics than with GEDF, but average
tardiness is reasonable, and maximum tardiness is comparable toGEDFwith our best heuris-
tic. The somewhat high maximum tardiness values are an artifact of our task generation methodology, which produces some tasks with very large execution costs. The average-case results suggest that tardiness will not significantly restrict the extent to which our heuristics can be employed. Further, if tardiness is undesirable, then a combination of early-releasing
Task Set Parameters L2 Miss Rate IPC
%S MTT U WD GEDF H %Im GEDF H %Im
50 [0.01, 0.1] TC 3.10 3.30 -6.65 0.90 1.29 43.72 50 [0.01, 0.1] Uni 4.72 5.02 -6.41 0.78 1.24 58.87 50 [0.1, 0.4] TC 0.66 0.59 11.76 1.25 1.31 4.67 50 [0.1, 0.4] Uni 1.40 0.94 32.54 1.15 1.30 12.63 50 [0.5, 0.9] TC 0.22 0.23 -2.68 1.51 1.52 0.77 50 [0.5, 0.9] Uni 0.51 0.37 27.75 1.50 1.52 1.84 50 [0.01, 0.9] TC 0.35 0.33 7.77 1.41 1.44 1.72 50 [0.01, 0.9] Uni 0.75 0.40 47.28 1.34 1.42 6.40 100 [0.01, 0.1] TC 6.01 3.22 46.37 0.92 1.96 113.47 100 [0.01, 0.1] Uni 6.62 6.04 8.84 0.91 1.87 106.73 100 [0.1, 0.4] TC 1.19 0.60 49.85 1.13 1.41 25.10 100 [0.1, 0.4] Uni 2.40 0.98 59.09 0.96 1.34 39.18 100 [0.5, 0.9] TC 0.64 0.49 23.15 1.20 1.29 7.70 100 [0.5, 0.9] Uni 1.69 0.79 53.29 1.08 1.26 15.88 100 [0.01, 0.9] TC 0.75 0.60 20.03 1.18 1.31 10.31 100 [0.01, 0.9] Uni 1.14 0.78 31.78 1.13 1.27 12.96
Table 6.4: Evaluation of one of our heuristics for the 32-core architecture. This heuristic employs a cache utilization threshold of 0%, cache-aware policy (1), lost-cause policy (1), and phantom tasks. The meaning of each column in the table is identical to its meaning in Table 6.2.
and buffering can be employed to “hide” tardiness from an end user, as described in Chapter 4.
32-core architecture evaluation. We next ran similar experiments for the 32-core ar- chitecture, where task sets were generated identically to those for the eight-core architecture (i.e., same parameters, but many more tasks per task set, since the platform is considerably larger). Task sets were scheduled usingGEDFand the heuristic that performed best over the
widest variety of task sets in the eight-core experiments (cache utilization threshold of 0%, cache-aware policy (1), lost-cause policy (1), and phantom tasks). The results in Table 6.4 are similar to the eight-core results in Table 6.2, with the heuristic outperformingGEDF.
Interestingly, on the 32-core architecture, there were several instances where cache miss ratesincreased slightly when our heuristic was used; however, per-core IPC was always higher under our heuristic (as done earlier, entries representing large improvements are in bold in Ta- ble 6.4). In one of the cases where the cache miss rate increased (the fifth entry in Table 6.4), the IPC increase is somewhat small; therefore, we assume that performance differences be- tweenGEDF and our heuristic were not substantial. In the two other cases where the cache
observed—these also happen to be cases where task utilizations are low, and MTT task counts are more likely to be high as a result. In these cases, the results may have less to do with cache miss rates (especially if thrashing was avoided under both GEDF and our heuristic),
and more to do with memory bandwidth, and perhaps even contention for accessing lines of the shared cache itself. In this case, if MTTs are being co-scheduled more often when our heuristic is used, then there is more data being shared, and a smaller total set of data being referenced, at any point in time. This reduced pressure on theentirememory subsystem may, in turn, result in IPC improvements even if cache miss rates are relatively the same. Further, reducing pressure on the memory subsystem would be much more likely to have a noticeable impact when the number of cores in the system quadruples. In summary, these results give us reason to believe that the tested heuristic will continue to perform well as the core counts of multicore architectures increase.