• No se han encontrado resultados

Proceso convencional de tratamiento de gas natural

A general replication scheme used with consistent hashing is successor- list replication [208], whereby every key-value data item is replicated at a number of servers that succeed the responsible node on the consistent hashing ring. An example is shown in Figure 6.1. Here, the replication degree is three and a quorum is a majority, i.e., any set of two nodes from the replication group. A naïve attempt of achieving linearizable consistency is to use a shared memory register approach, e.g. ABD [23], within every replication group. This will not work as false failure suspicions, along with consistent hashing, may lead to non-overlapping quorums. The diagram in Figure 6.2 depicts such a case, where node 15 falsely suspects node 10. According to node 10, the replication group for keys in the range (5, 10] is {10, 15, 20}, while from the perspective of node 15, the replication group for the same keys is {15, 20, 25}. Now, two different operations, for instance on key 8, may access non-intersecting quorums leading to a violation of linearizability. For example, a put operation may complete after updating the value associated with key 8 at replicas 10 and 20. A subsequent get

6.5. PROBLEM STATEMENT 103

10

5 15 20 25

replication group for key range (5, 10]

Figure 6.1.A correct replication group for keys in range(5, 10]using consistent hashing with successor-list replication. Replication degree is three and a majority quorum is any set of two in the replication group, i.e., {10, 15}, {10, 20}, or {15, 20}.

10

5 15 20 25

Figure 6.2. Node 15 inaccurately suspects node 10 to have crashed. As a conse- quence, node 15 assumes the replication group for key range(5, 10]is {15, 20, 25}. Other nodes may still assume the replicas set is {10, 15, 20}, potentially leading to non-overlapping quorums, e.g., {10, 20} and {15, 25}, or {10, 15} and {20, 25}.

operation may reach replicas 15 and 25 and return a stale value despite contacting a majority quorum of replicas.

In a key-value store with replication groups spontaneously reconfigured by consistent hashing, applying put and get operations on majority quorums is not sufficient for achieving linearizability. Furthermore, in such a self- managing system, where changes in one replication group are related to changes in other groups, and where put and get operations may occur during reconfiguration, guaranteeing atomic consistency is non-trivial. In fact, any quorum-based algorithm will suffer from the problem of non- intersecting quorums when used in a dynamic replication group dictated by consistent hashing. We propose consistent quorums as a solution. In contrast to reusing an existing dynamic replication protocol within each replication group, as a black box, consistent quorums allow us to decouple group reconfigurations from data operations in a clean way, avoiding the complexities and unnecessary overheads of those protocols.

Chapter 7

Consistent Quorums

In a typical quorum-based protocol an operation coordinator sends request messages to a set of participants and waits for responses. Upon receiving a request message, each participant acts on the request and responds to the coordinator with an acknowledgement. The coordinator completes the operation as soon as it receives a quorum [88] of acknowledgements. Typically, essential safety properties of the protocol are satisfied by ensuring that the quorums for different operations, e.g., put and get, intersect in at least one participant.

Quorum intersection is easily achieved in a static system with a fixed set of nodes. In a dynamic system however, different nodes may have inconsistent views of the group membership. It is possible thus, that the number of nodes which consider themselves responsible for a key range, i.e., the number of nodes in a replication group, is larger than the replication degree. As a result, successive put and get operations may complete by contacting non-overlapping quorums, as we’ve shown in the previous chapter, which could lead to a violation of linearizability.

The idea is then to maintain a membership view of the replication group at each node which considers itself to be a replica for a particular key range according to the principle of consistent hashing. Each node in a replication

group has a view vi = hs, Gii, where s represents the set of keys or the key

range replicated by the nodes in the group, Gi is the set of nodes in the

replication group, and i is the version number of the view. A node has an installed view for each key range that it replicates. We say that a node n is in view vi, not when n∈Gi, but when n has view vi installed.

Definition 1. For a given replication group G, we say that a quorum Q is a consistent quorum of G, if every node n in Q has the same view of G installed, at the time when Q is assembled, i.e., when n sends its acknowledgement in Q.

When a node replies to a request for a key k, it stamps its reply with its currently installed view for the corresponding key range, s, where k∈ s. The main idea is that a quorum-based operation will succeed only if it collects a quorum of nodes with the same view, i.e., a consistent quorum.

As node membership changes over time, a mechanism is needed to reconfigure the membership views consistently at all replication group members. For that we devised a group reconfiguration protocol based on Paxos consensus [130], extended with an extra view installation phase and augmented with consistent quorums. We present our group reconfiguration protocol in the following section.

7.1

Paxos-based Group Reconfiguration using

Consistent Quorums

Replication groups must be dynamically reconfigured [3, 132] to account for new node arrivals and to restore the replication degree after group member failures. The system starts in a consistent configuration, whereby each node has consistent views installed for every key range that it replicates. Thereafter, within each replication group, the system reconfigures by using the members of the current view as acceptors in a consensus instance which decides the next view. Each consensus instance is identified by the view in which it operates. Therefore, views are decided in sequence and installed at all members in the order in which they were decided. This decision sequence determines view version numbers. Algorithms 1–3 illustrate our Paxos-based group reconfiguration protocol using consistent quorums. Earlier we defined a consistent quorum as an extension of a regular quorum.

7.1. GROUP RECONFIGURATION USING CONSISTENT QUORUMS 107

Without loss of generality, in the remainder of this chapter we focus on majority-based (consistent) quorums.

A reconfiguration is proposed and overseen by a coordinator node which could be a new joining node, or an existing node that suspects one of the group members to have failed. Reconfiguration(vi ⇒ vi+1)takes

the group from the current view vi to the next view vi+1. Group size stays

constant and each reconfiguration changes the membership of a replication group by a single node. One new node joins the group to replace a node which leaves the group. The version number of a view is incremented by every reconfiguration. The reconfiguration protocol amounts to the coordinator getting the group members of the current view, vi, to agree on

the next view, vi+1, and then installing the decided next view at every node

in the current and the next views, i.e., Gi∪Gi+1. We say that a node is in

view vi, once it has installed view vi and before it installs the next view,

vi+1. Nodes install views sequentially, in the order of the view versions,

which reflects the order in which the views were decided.

The key issue catered for by the reconfiguration protocol is to always maintain the quorum-intersection property for consistent quorums, even during reconfigurations. To make sure that for any replication group, G, no two consistent quorums may exist simultaneously, e.g., for the current and the next views of a reconfiguration, the decided next view, vi+1, is first

installed on a majority of the group members of the current view, Gi, and

thereafter it is installed on the new group member, Gi+1\Gi.

Reconfiguration Proposals

Proposed new views are devised based on changes in the consistent hashing ring topology. Under high churn [191], different nodes may concurrently propose conflicting next views, e.g., when a node joins the system shortly after another node fails, and both events lead to the reconfiguration of the same replication group. Using consensus ensures that the next view is agreed upon by the members of the current view, and the group re- configuration proceeds safely. When a reconfiguration proposer p notices that the decided next view vi+1= hs, Gdiis different from the one it had

proposed, say vi+1 = hs, Gpi, p assesses whether a reconfiguration is still

Algorithm 1 Reconfiguration coordinator

Init: phase1Acks[vi] ←∅, phase2Acks[vi]←∅, phase3Acks[vi]←∅

prop[vi]←0, pRec[vi] ← ⊥ . ∀consensus instance vi

1: onhPropose: (vi ⇒vi+1)ido

2: pRec[vi] ← (vi ⇒vi+1) .proposed reconfiguration

3: send hP1A: vi, prop[vi]ito all members of group Gi .vi= hs, Gii

4: onhP1B: vi, Ack, pn, rec, vido

5: phase1Acks[vi]←phase1Acks[vi] ∪ {(pn, rec, v)}

6: vQ ←consistentQuorum(extractViewMultiset(phase1Acks[vi]))

7: if vQ6= ⊥then

8: r←highestProposedReconfiguration(phase1Acks[vi], vQ)

9: ifr 6= ⊥then

10: pRec[vi]←r

11: sendhP2A: vi, prop[vi], pRec[vi]ito all members of GQ

12: onhP1B: vi, Nacki ∨ hP2B: vi, Nackido

13: prop[vi]++ .retry with higher proposal number, unique by process id

14: send hP1A: vi, prop[vi]ito all members of Gi

15: onhP2B: vi, Ack, vido

16: phase2Acks[vi]←phase2Acks[vi] ∪ {v}

17: vQ ←consistentQuorum(phase2Acks[vi])

18: if vQ6= ⊥then

19: sendhP3A: vi, rec[vi]ito all members of GQ

20: onhP3B: vi, vido

21: phase3Acks[vi]←phase3Acks[vi] ∪ {v}

22: if consistentQuorum(phase3Acks[vi])6= ⊥then

23: sendhP3A: vi, pRec[vi]ito new group member (Gi+1\Gi)

which p suspects to have failed. In such a scenario, p generates a new reconfiguration to reflect the new view, and then proposes it in the new protocol instance determined by vi+1.

In the algorithm specifications we omit the details pertaining to ignoring orphan messages or breaking ties between proposal numbers based on the proposer id. The consistentQuorum function tests whether a consistent

7.1. GROUP RECONFIGURATION USING CONSISTENT QUORUMS 109

Algorithm 2 Current group member

Init: wts[vi]←0, rts[vi]←0, aRec[vi]← ⊥ . ∀consensus instance vi

1: onhP1A: vi, pido .acceptor role

2: if p≥rts[vi] ∧ p≥wts[vi] then

3: rts[vi]← p .promise to reject proposal numbers lower than p

4: sendhP1B: vi, Ack, wts[vi], aRec[vi], view(vi.s)ito coordinator

5: else sendhP1B: vi, Nackito coordinator

6: onhP2A: vi, p,(vi ⇒vi+1)ido .acceptor role

7: if p>rts[vi] ∧ p>wts[vi] then

8: wts[vi] ←p .promise to reject proposal numbers lower than p

9: aRec[vi] ← (vi ⇒vi+1) .accepted reconfiguration

10: sendhP2B: vi, Ack, view(vi.s)ito coordinator

11: else sendhP2B: vi, Nackito coordinator

12: onhP3A: vi,(vi ⇒vi+1)ido .learner role

13: installView(vi, vi+1)

14: send hP3B: vi, view(vi.s)ito coordinator

15: send hData: (vi ⇒vi+1), data(vi.s)ito new member (Gi+1\Gi)

Algorithm 3 New group member

1: onhP3A: vi,(vi ⇒vi+1)ido

2: installView(vi, vi+1) .makes vi+1busy if the data was not received yet

3: onhData: (vi ⇒vi+1), dataido .from old members of group Gi

4: dataSet[vi+1] ←dataSet [vi+1]∪ {(data, vi)}

5: send hDataAck: vi+1ito old member of group Gi

6: if consistentQuorum(extractViewMultiset(dataSet[vi+1]))6= ⊥then

7: storeHighestItems(dataSet[vi+1]) .makes vi+1ready

quorum exists among a set of views and if so, it returns the view of that consistent quorum. Otherwise it returns⊥. The extractViewMultiset function maps a multiset of (proposal number, reconfiguration, view) triples to the corresponding multiset of views. The highestProposedReconfiguration takes a multiset of such triples and returns the reconfiguration with the highest proposal number, among the triples whose view matches view vQ,

view corresponding to a given key range or just a single key, and the data function retrieves the timestamped data items corresponding to a given key range. The storeHighestItems function takes a multiset of sets of timestamped key-value data items, and for each distinct key, it stores locally the corresponding data item with the highest timestamp. Finally, the installView function takes two consecutive views vi and vi+1. If the local

node belongs to group Gi, it must have vi installed before proceeding with

the view installation. If the local node is the new node in group Gi+1, it

can proceed immediately. We discuss these situations in more detail below, when we describe the install queue and the data chain mechanisms.

Phase 1 and phase 2 of the protocol are just the two phases of the Paxos consensus algorithm [130], augmented with consistent quorums. We could view Paxos as a black-box consensus abstraction whereby the group members of the currently installed view, Gi, are the acceptors deciding the

next view to be installed, vi+1. Nonetheless, we show the details of phases

1 and phase 2 as an illustration of using consistent quorums. View Installation and Data Transfer

Phase 3 of the protocol is the view installation phase. Once the next view vi+1is decided, the coordinator asks the members of the current view vi to

install view vi+1. Once vi+1is installed at a majority of nodes in Gi, only

a minority of nodes are still in view vi, and so it is safe to install vi+1 at

the new member, without allowing two simultaneous majorities, i.e., one for vi and one for vi+1. When a member of group Gi installs vi+1, it also

sends the corresponding data to the new member of vi+1. Conceptually,

once the new member receives the data from a majority of nodes in the old view, it stores the data items with the highest timestamp from a majority. In practice however, we optimize the data transfer such that only keys and timestamps are pushed from all nodes in Gi to the new node, which then

pulls the latest data items in parallel from different replicas.

Ensuring that the new group member gets the latest data items among a majority of nodes in the old view is necessary for satisfying linearizability of the put and get operations that occur during reconfiguration. To see why, consider a case where a put operation occurs concurrently with reconfiguration(vi ⇒ vi+1). Assume that this put operation updates the

7.1. GROUP RECONFIGURATION USING CONSISTENT QUORUMS 111

value of key k with timestamp t, to a newer value with timestamp t+1, and further assume that a majority of replicas in Gi have been updated while a

minority of replicas are yet to receive the update. If the new group member, n, didn’t get the latest value of k from a majority of replicas in Gi, and

instead n transferred the data from a single replica, it would be possible that n got the old value from a replica in the not-yet-updated minority. In this situation, a majority of nodes in Gi+1 have the old value of k. As we

discuss in Section 7.2, the concurrent put operation will complete using the old view vi. A subsequent get accessing a consistent quorum with view

vi+1may later return the old value of k, thus violating linearizability.

Install Queue

Two interesting situations may arise in asynchronous networks. Recall that once a reconfiguration(vi ⇒vi+1)has been decided,P3Amessages instruct

the nodes in Gi+1to install the new view vi+1. First, it is possible that mul-

tiple consecutive reconfigurations progress with a majority of nodes while the nodes in a minority temporarily do not receive anyP3Amessages. Later, the minority nodes may receiveP3Amessages in a different order from the order in which their respective reconfigurations were decided. Assume for example that node n is such a node in this “left behind” minority. When node n is instructed to apply a reconfiguration(vi ⇒vi+1)whereby n is a

member of group Gi, but n has not yet installed view vi, node n stores the

reconfiguration in an install queue marking it for installation in the future, immediately after installing view vi. Accordingly, node n will issue view

installation acknowledgments, i.e.,P3Bmessages, and it will initiate the corresponding data transfers to the new node in Gi+1, only after node n

installs view vi+1. This install queue mechanism ensures that even if nodes

receive view installation requests out of order, views are still going to be installed in the order in which they were decided.

Data Chain

Another interesting situation that may be caused by asynchronous execution is one where after having installed a view vi, a new group member n,

before having received all the data for view vi. In such cases, node n stores

the newer views in a data chain, reminding itself that upon completely receiving the data for view vi, it should transfer it to the new group member

in group Gi+1, Gi+2, etc. Even if data is transferred slowly and it arrives

much later than the view installation, node n still honors its responsibility to push the data forward to the new nodes in subsequently newer views. This data chain mechanism ensures that upon a view change(vi ⇒vi+1),

all the nodes in Gi push the data to the new node in Gi+1. Considering

various failure scenarios, this increases the new node’s chances of collecting the data from a majority of nodes in Gi, and therefore it preserves the

linearizability and liveness of put and get operations. Termination

We say that a reconfiguration(vi ⇒vi+1)has terminated, when a majority

of nodes in group Gi+1 have installed the new view vi+1. Once vi+1 has

been installed at a majority of nodes in Gi+1, a consistent quorum for the

view vi+1 becomes possible, enabling the group to be reconfigured yet