•Conectividad a Nivel Departamental
Mapa 3. División Político Administrativa
1.1.2 SECTOR ADMINISTRATIVO
There is a subtle interaction between the value, coherence, initiation, and uniprocessor dependence conditions when all of them are to be satisfied by a set of operations.19Understanding this interaction provides intuition for the form of the value condition and the exclusion (in most models) of W po
,! R from the uniprocessor
dependence condition.
Consider the code segment in Figure 5.10. Both P1 and P2 write to location A and then read this location. Assume location A is initialized to 0, and each processor caches a clean copy. Let W1 and R1 denote the write and read on P1, and W2 and R2 the operations on P2. Consider W1 and W2 being issued at approximately the same time. Figure 5.11 depicts this scenario in two different implementations, both with write-back caches and an invalidation-based coherence scheme. A one word line size is assumed for simplicity. In both implementations, the write request leads to a miss on a clean line. The implementation in Figure 5.11(a) delays the write in the write buffer until the exclusive request is satisfied, while the implementation in Figure 5.11(b) allows the write to immediately retire into the cache with the corresponding cache line remaining in the pending state for the duration of the miss. These correspond to the two implementations discussed in Section 5.3.2 for
19This interaction was described first in a revision to the release consistency paper [GGH93b], and also in a later technical re-
port [GAG+
Cache Buffer Outgoing Cache Buffer Write (a) Interconnection Network Buffer Outgoing ExclRq A A: 0 A=2 P2 P1 Buffer Write A=1 A: 0 ExclRq A Buffer Outgoing Cache Buffer Outgoing Cache (b) ExclRq A P1 P2 A: 0 2 (P) ExclRq A Interconnection Network A: 0 1 (P)
Figure 5.11: Simultaneous write operations.
supporting the coherence and value conditions. For the traditional notion of data dependences to be satisfied, the read on each processor must return the value of its own write on A or a write that completes after its own write. Thus, in our example code segment, it is safe for the reads to return the value of their own processor’s write while the exclusive request is outstanding. The two key questions that arise are: (a) at what point in time can we consider a write complete with respect to its issuing processor? and (b) can a subsequent read (e.g., R1 or R2) return the value of its own processor’s write and complete before the write completes with respect to that processor?
First consider the question about completion of a write with respect to the issuing processor. Without the coherence requirement, a write can be considered complete with respect to its own processor immediately after it issued. However, pinpointing this event is more subtle when the coherence requirement is imposed on a pair of writes. The coherence requirement requires writes to the same location to complete in the same order with respect to all processors: either W1(i) xo
,! W2(i), or W2(i) xo
,! W1(i), for all i. If the write is assumed
to complete with respect to its own processor as soon as it is issued (i.e., before it is serialized with respect to other writes to the same location), then the completion events with respect to the issuing processors, W1(1) and W2(2), would be considered to occur before either W1(2) and W2(1) which are the completion events with respect to the other processor. This clearly violates the coherence requirement. The above example motivates why the completion event for a write with respect to its own processor is related to the reception
of the exclusive or read-exclusive reply, which signals that the write has been serialized with respect to other writes to the same location.
Next consider the question about a read returning the value of its own processor’s write before the write completes with respect to that processor. This optimization is uninteresting for models that do not impose the coherence requirement on a given write since, as we discussed above, the write can be considered complete with respect to its own processor immediately after it is issued. Therefore, the optimization only applies to models that impose the coherence requirement on the given write. The optimization can be supported in either implementation depicted in Figure 5.11; the implementation on the top can forward the value from the write buffer, while the implementation on the bottom can forward the value from the cache. The read-forwarding optimization is not safe for every memory model, however. Referring back to the example in Figure 5.10, consider a model with a strict uniprocessor dependence condition which would require W1(1) xo
,! R1(1) and W2(2) xo
,! R2(2). If the model also imposes the coherence requirement, either
W1(2) xo ,! W2(2) xo ,! R2(2) or W2(1) xo ,! W1(1) xo
,! R1(1) must hold in any execution of the code. To
ensure the above, the implementation must disallow a read to return the value of its own write until the write is serialized. The need to delay the read arises from the subtle interaction of the initiation, value, coherence, and uniprocessor dependence conditions: a read must return the value of its own processor’s write or a later write in the execution order (initiation and value condition); the read cannot complete (i.e., return a value) until the conflicting write that precedes it in program order completes with respect to this processor (strict form of the uniprocessor dependence condition); and the write is considered complete with respect to its issuing processor after it has been serialized with respect to other writes to the same location (indirectly imposed by the coherence requirement).
Except for the IBM-370 model, all other models described in this thesis safely allow the read forwarding optimization by relaxing the uniprocessor dependence condition through the elimination of the W(i) xo
,! R(i)
requirement given W po
,! R. Therefore, referring back to the example in Figure 5.10, execution orders such
as R1(1) xo ,! R2(2) xo ,! W1(1) xo ,! W1(2) xo ,! W2(1) xo
,! W2(2) are allowed. Note that the coherence
requirement is still satisfied. Furthermore, even though the reads complete before the writes in the above execution order, the initiation and value conditions still maintain the traditional notion of data dependence by requiring R1 and R2 to return the value of their own processor’s write (i.e., outcome of (u,v)=(1,2)). The subtle interaction among the conditions is broken by the following two things: (a) the relaxed form of the uniprocessor dependence condition allows a read to complete before a conflicting write that precedes it in program order, and (b) the value condition disassociates the visibility and completion events for writes, allowing a read to return the value of its own processor’s write before the write completes. Switching between the strict and relaxed versions of the uniprocessor dependence condition affects the semantics of several of the models: the IBM-370 model depends on the strict version, while the TSO, PSO, PC, RCsc, RCpc, and RMO models depend on the relaxed version. For example, the program segments shown in Figure 2.14 in Chapter 2 distinguish the IBM-370 and the TSO (or PC) models based on whether a read is allowed to return the value of its processor’s write early.
5.3.5
Supporting the Multiprocessor Dependence Chains
The multiprocessor dependence chains represent the most distinctive aspect of a memory model specification. Intuitively, these chains capture the orders imposed on pairs of conflicting operations based on the relative
define,!: X,!Y if X and Y are to different locations and X,!Y
define sco ,!: X
sco
,!Y if X and Y are the first and last operations in one of
X co 0 ,!Y R co 0 ,!W co 0 ,!R Conditions on xo ,!: :::
(b) given memory operations X and Y, if X and Y conflict and X,Y are the first and last operations in one of
:::
multiprocessor dependence chain: one of W co 0 ,!R po ,!R RWspo ,! fA sco ,!B spo ,!g+ RW Wsco ,!R spo ,! fA sco ,!B spo ,!g+ R then X(i) xo
,!Y(i) for all i.
Figure 5.12: Multiprocessor dependence chains in the SC specification.
program and conflict orders of other operations. This section describes the relevant mechanisms for supporting multiprocessor dependence chains, including mechanisms for providing the ordering information to the hardware (e.g., through fences or operation labels), for keeping track of outstanding operations, and for enforcing the appropriate order among operations.20
Much of the discussion and examples in the following sections pertain to the SC and PL1 specifications which represent the strict and relaxed sides of the spectrum. For easier reference, Figures 5.12 and 5.13 show the isolated multiprocessor dependence chains for these two specifications.
Overview of Implementing Multiprocessor Dependence Chains
Multiprocessor dependence chains comprise of specific program and conflict orders that impose an execution order between the conflicting pair of operations at the beginning and end of the chain. In our specification notation, most of the relevant program order and conflict order arcs that constitute these chains are represented by the spo
,! and sco
,! relations, respectively. For implementation purposes, it is useful to separate the
multiprocessor dependence chains into three different categories: 1. Chains of the form W co
,! R po ,! R or W co ,! R po
,! W consisting of three operations to the same
location.
2. Chains that begin with a po
,!and do not contain conflict orders of the form R co ,! W
co ,! R.
3. Chains that begin with a co
,!or contain conflict orders of the form R co ,! W
co ,! R.
The above categories do not directly match the way the chains are separated in the model specifications. For example, the second chain representation in the specification for SC can fall in either the second or the third category above depending on the presence of conflict orders of the form R co
,! W co ,! R.
20Even though the reach relation appears as part of the multiprocessor dependence chains (throughspo
,!) in system-centric models
define spo ,!, spo ,!: Xspo 0
,!Y if X and Y are the first and last operations in one of
Rc po ,!Rc Rc po ,!Wc Wc po ,!Rc, to different locations Wc po ,!Wc Xspo
,!Y if X and Y are the first and last operations in one of
Rc po ,!RW RW po ,!Wc define sco ,!: X sco
,!Y if X and Y are the first and last operations in one of
Wc co 0 ,!Rc Rc co 0 ,!Wc Wc co 0 ,!Wc Rc co 0 ,!Wc co 0 ,!Rc Conditions on xo ,!: :::
(b) given memory operations X and Y, if X and Y conflict and X,Y are the first and last operations in one of
:::
multiprocessor dependence chain: one of Wc co 0 ,!Rc spo ,!RW RW spo ,! fWc sco ,!Rc spo 0 ,!g* fWc sco ,!Rc spo ,!g RW Wc sco ,!Rc spo 0 ,! fWc sco ,!Rc spo 0 ,!g* fWc sco ,!Rc spo ,!g RW RWcspo 0 ,! fA sco ,!B spo 0 ,!g+ RWc Wc sco ,!Rc spo 0 ,! fA sco ,!B spo 0 ,!g+ Rc ::: then X(i) xo
,!Y(i) for all i.
(a) Category 1: outcome (u,v)=(1,0) disallowed P1 A = 1; P2 a2: b2: a1: v = A; u = A;
(b) Category 2: outcome (u,v)=(0,0) disallowed
P1 A = 1; P2 a1: b1: a2: b2: v = A; B = 1; u = B;
(c) Category 2: outcome (u,v,w)=(1,1,0) disallowed
P3 a3: b3: P1 P2 a2: a1: A = 1; b2: w = A; b2: B = 1; u = B; C = 1; v = C;
(d) Category 3: outcome (u,v,w)=(1,1,0) disallowed
P3 a3: b3: P1 P2 a2: b2: a1: A = 1; u = A; B = 1; v = B; w = A; P4 a4: b4: P3 a3: P1 A = 1; P2 a2: b2: a1: u = A; v = B; B = 1; w = B; x = A;
(e) Category 3: outcome (u,v,w,x)=(1,0,1,0) disallowed
Figure 5.14: Examples for the three categories of multiprocessor dependence chains.
Figure 5.14 provides examples of the three multiprocessor dependence chain categories. The disallowed outcomes assume the SC specification, with all locations initially 0. The first two categories of chains can be satisfied by enforcing specific program orders. The third category of chains also require enforcing multiple-copy atomicity on some writes.
The first category is of the form W co ,! R
po
,! Y, where all three operations are to the same location.
Assume R and Y are issued by Pi. W co
,! R already implies W(i) xo
,! R(i), i.e., the write completes
with respect to Pi before the read completes. If Y is a write, the required order, W(j) xo
,! Y(j) for all j,
is automatically enforced by the combination of the uniprocessor dependence condition and the coherence requirement (i.e., if the implementation enforces this requirement between W and Y). If Y is a read, the required order, W(i) xo
,! Y(i), can be trivially satisfied by forcing R and Y to complete in program order with
respect to Pi, i.e., R(i) xo
,! Y(i). Referring to the example in Figure 5.14(a), this correspond to maintaining
program order between the two reads on P2. In a relaxed model such as PL1, program order between reads to the same location needs to be enforced only if the first read is a competing read.
Figures 5.14(b) and (c) provide examples of the second category of chains. A simple way to satisfy such chains is to conservatively enforce the relevant program orders. Consider the SC specification, for example. For every A spo
,! B in the chain, the implementation can ensure A(i) xo
,! B(j) for all i,j by delaying any sub-
operations of B until all sub-operation of A have completed. Given X and Y are the first and last operations in the chain, maintaining the strict execution order at every point in the chain enforces X(i) xo
,! Y(j) for
all i,j. This conservatively satisfies the specification requirement of X(i) xo
,! Y(i) for all i. In fact, such
an implementation satisfies conservative specification styles such as those presented for SC in Figure 4.6 of Chapter 4.
Satisfying the third category of chains requires extra ordering restrictions beyond maintaining program order since these chains expose the multiple-copy behavior in an implementation. Consider the example in
Figure 5.14(d). We will use the statement labels to uniquely identify operations; e.g., Wa1refers to the write
on P1. Given an execution with the conflict orders shown in Figure 5.14(d), the SC specification requires Wa1(3)
xo ,! R
b3(3) to be enforced. The conflict orders already imply Wa1(2) xo ,! R
a2(2) and Wb2(3) xo ,!
Ra3(3). Maintaining the program order among operations on P2 and P3 would enforce Ra2(2) xo ,! W
b2(j)
for all j and Ra3(3) xo ,! R
b3(3). However, maintaining the program order is insufficient for disallowing the
outcome (u,v,w)=(1,1,0) since it allows Rb3(3) xo ,! W
a1(3).
The third category of chains can be enforced by ensuring that certain writes appear atomic with respect to multiple copies.21 A simple mechanism to achieve this is to require a write to complete with respect
to all other processors before allowing a conflicting read from another processor to return the value of this write. Referring to Figure 5.14(d), this restriction ensures Wa1(i)
xo ,! R
a2(2) for i=(2,3). Assuming program
order is maintained on P2 and P3, the atomicity restriction enforces Wa1(i) xo ,! R
b3(3) for i=(2,3), which
conservatively satisfies the execution order constraints of the SC specification. In general, the atomicity constraint must be enforced for a write that starts a chain with a conflict order or a write that is part of a conflict order of the form R co
,! W co
,! R (e.g., the write on P3 in Figure 5.14(e)). In practice, satisfying
the SC specification requires enforcing this requirement for all writes, while the PL1 specification limits this requirement to competing writes only.
An alternative mechanism for enforcing atomicity is inspired by the conditions Dubois and Scheurich proposed for satisfying SC (see Sections 2.3 and 4.1.3), whereby the program order from a read to a following operation is enforced by delaying the latter operation for both the read to complete and for the write whose value is read to complete with respect to all other processors. Therefore, given W co
,! R po
,! Y in a chain, the
above requirement enforces W(i) xo
,! Y(j) for all i,j except for i equal to W’s issuing processor. Compared
to the previous mechanism, the delay for the write W to complete simply occurs after, instead of before, the read R. Referring back to the example in Figure 5.14(d), the above enforces Wa1(i)
xo ,! W
b2(j) for all i,j
except for i=1. Along with enforcing the program order on P3, this implies Wa1(i) xo ,! R
b3(3) for i=(2,3),
which conservatively satisfies the specification. Maintaining the program order conservatively past a read is required for the following reads that appear in third category chains: (i) the first read (R) in the chain if the chain begins with W co
,! R , or (ii) the second read (R2) in a conflict order of the form R1 co ,! W
co ,! R2.
In practice, the SC specification requires every R po
,! Y to be enforced in the conservative way, while the
PL1 specification requires this only for Rc po ,! Yc.
The following sections describe the various mechanisms for enforcing multiprocessor dependence chains in more detail. The techniques described maintain the execution orders at all intermediate points in the chain, and therefore do not exploit the aggressive form in which such chains are expressed in the specifica- tions. Section 5.4.1 describes alternative techniques that satisfy the multiprocessor dependence chains more aggressively.
Providing Ordering Information to the Hardware
To aggressively support multiprocessor dependence chains, the information about significant program orders and write operations that must obey multiple-copy atomicity needs to be communicated to the underlying system. This information is implicit for models such as SC, TSO, and PC. In SC, for example, all program
21Section 4.4 in Chapter 4 described indirect ways of supporting multiple-copy atomicity by transforming specific reads into dummy
Table 5.2: How various models inherently convey ordering information. Mechanism for Providing Information on
Model Program Order Multiple-Copy Atomicity
SC, TSO, PC - -
IBM-370, PSO, fence -
Alpha, RMO, PowerPC
WO, RCpc label -
RCsc, label label
PL1, PL2, PL3
TSO+, PSO+ fence -
PC+, PowerPC+ fence label
RCpc+ label, fence label
orders are significant and every write must obey multiple-copy atomicity. For most relaxed models, however, explicit information that is provided through operation labels or fences must somehow be explicitly commu- nicated to the hardware. Table 5.2 shows how such information is inherently provided by various models (models such as TSO+ are extensions that were defined in Section 4.4).
There are several ways to communicate the information inherently provided by a model to the underlying hardware:
The information may be conveyed in the form of an operation label encoded either in the type or the
address of a memory instruction.
The information may be conveyed through additional instructions, for example explicit fence instruc-