• No se han encontrado resultados

The concept of load criticality has been around for approximately 15 years [18, 72]. While the idea of identifying a select group of loads as more important is a well-accepted idea, no one single criterion for critical load identification has emerged. In fact, a survey of papers over the 15-year period reveals that each load criticality proposal has used a unique metric for identifying these loads. Given this, it stands to reason that our selection of commit stalls may not be the best such indicator of criticality.

In order to get a better sense of these other metrics, we split them into two groups. The first group contains coarse-grained criteria for load criticality, where criticality is not assigned based on the individual properties of each load. In- stead, an identical classification is assigned to a series of loads, with this clas- sification typically updated in a periodic manner. Many such works do not ex- plicitly refer to load criticality, but in effect are identifying which loads should be prioritized. The second group consists of fine-grained load criticality crite- ria, where each individual load receives its own prediction. Our commit stall based mechanisms falls into this second group, as we make criticality predic- tions based only on the static instruction itself.

Coarse-grained criteria typically make their predictions based on the prop- erties of phases, threads, or the processor itself. Fisk and Bahar used the in- struction issue rate of the processor to determine criticality—loads issued to the memory system during a low issue rate period were classified as critical [18]. A number of priority-based memory scheduling algorithms use coarse-grained criticality. The TCM scheduler prioritizes loads that originate from latency-

sensitive threads over those loads from bandwidth-sensitive threads [38]. The Minimalist Open-page scheduler identifies loads from threads with lower memory-level parallelism (MLP) as more critical [36], somewhat similar to the TCM approach. The memory request prioritization buffer (MRPB) was pro- posed to perform request reordering and cache bypassing for GPU memory re- quests; the requests are reorganized in a warp-aware manner, and prioritization is performed such that requests from certain warps are prioritized over those from other warps [32].

Since our work belongs to the category of fine-grained criteria, we are more interested in evaluating these metrics. Unlike many of the coarse-grained cri- teria, which express criticality across different threads, the fine-grained criteria typically express the intra-thread differences in criticality. Often, we find that the choice of criticality metric can be driven by the desired optimization, though the choice of criterion is still primarily ad hoc.

Early work by Srinivasan and Lebeck used dependence chain analysis to determine if a load was critical (e.g., a load that has a dependent mispredicted branch instruction, or a load with a dependent load that will miss in the L1 cache) [72, 71]. They used a predicted criticality both to alter parts of the cache into a critical-load-only victim cache, and to guide prefetching.

Jaleel et al. proposed the Re-Reference Interval Prediction (RRIP) policy for cache replacement [30]. Typically, cache line replacement candidates are chosen based on temporal locality—if a line was reused, it moves to the top of an or- dered list, where the bottom list member is the one evicted if an upcoming cache request requires space for a new line. RRIP instead classifies the likelihood that a memory location will be reused, and lines that are more likely to be reused are

considered more critical to preserve, and are therefore moved to the top of the replacement list.

Several cache-oriented optimizations were proposed by Subramaniam et al. using their concept of load criticality [73]. For each load, they maintain a his- tory of how many direct consumer instructions were dependent on a particular dynamic instance of a load instruction. The number of consumers is saved into a PC-indexed table, and is used in conjunction with a confidence counter to predict future criticality. They target several efficiency-related optimizations for the x86 architecture, such as “faking” the performance of a second cache read port by deferring non-critical loads, only performing forwarding from the store queue for critical loads, and having a cache insertion policy that is dependent on load criticality.

Prieto et al. considered several load criticality metrics to manage off-chip bandwidth via memory scheduling [60]. The majority of their studies use the distance of a load from the ROB head as the load criticality metric, where loads closer to the head are considered more critical. This metric was also used as load criticality by ˙Ipek et al. when they were comparing features for a reinforcement- learning-based memory scheduler [28]. Prieto et al. also consider, but dismiss, criteria such as instruction cache misses and the outstanding branch count.

While the criteria may differ significantly, one common thread through the fine-grained criticality work is an assumption that criticality exhibits some form of loop recurrent behavior. As a result, all of these criticality predictors (includ- ing our CBP predictor) assume that the criticality behavior of a static instruction will be similar across its dynamic instances, and therefore use the PC to index their prediction mechanisms.

Documento similar