• No se han encontrado resultados

Conclusiones Una valoración crítica desde la perspectiva del derecho a la

In this experiment, we compare the effect of skewed workloads on the perfor- mance of P-SMR and sP-SMR (Figure 6.7). The workload is composed of 50% updates and 50% reads against the key-value store. We evaluate the scalability of each approach with both uniform and Zipfian distributions for key selection. In the latter case, clients select keys following a Zipfian distribution with ex- ponent value of one. In skewed distributions, communication is expected to be uneven across multicast groups. In sP-SMR, the number of threads reflects the number of worker threads excluding the scheduler. Besides absolute values for the maximum throughput, we also show the normalized throughput of an individual thread.

With a uniform selection of keys, commands are evenly distributed across groups and P-SMR’s throughput increases up to the capacity of each available core. With a Zipfian distribution, however, P-SMR’s throughput is bounded by the most-loaded multicast group (point with 8 threads). sP-SMR is not bounded by a single multicast group as is P-SMR, but rather by the load the scheduler can handle until it becomes CPU-bound. Increasing the number of worker threads af- ter two threads has a negative impact on sP-SMR’s performance since the sched- uler spends more time synchronizing with worker threads. Also notice that with 1 and 2 threads the throughput of sP-SMR with a uniform workload is lower than its throughput with a Zipfian distribution. In the Zipfian distribution, some keys are accessed more often than the others and there are higher chances that these keys are cached at the processor. According to the normalized per-thread throughput, P-SMR scales better with the number of cores than sP-SMR under both uniform and Zipfian distributions.

6.5.8

Conclusions from the experiments

0 500 1000 1500 2000 2500 0 1 2 4 6 8 Throughput (Kcps) Number of threads P-SMR: uniform Zipfian 0 0.2 0.4 0.6 0.8 1 0 1 2 4 6 8 per-thread normalized throughput Number of threads sP-SMR: uniform Zipfian

Figure 6.7. The effect of the number of threads on the performance of a skewed workload; maximum throughput in Kilo commands executed per second (Kcps) (left); normalized per-thread throughput (right).

• P-SMR is optimized for workloads that mostly include independent com- mands and as the experiments illustrated, P-SMR outperforms other se- quential and parallel approaches with this type of workload.

• When the workload is dominated by dependent commands, a sequential implementation of state-machine replication outperforms the parallel im- plementations. This is a result of synchronization among the threads in parallelized approaches.

• Although P-SMR lacks the flexibility of sP-SMR’s model in distributing the load among the worker threads, we have seen that with skewed workloads P-SMR has still a higher throughput. This is because sP-SMR’s performance is limited by the overhead of the tasks the scheduler does before its advan- tages in load balancing reveal themselves.

6.6

Conclusion

In this chapter, we have questioned the sequentiality of state-machine replica- tion and the possibility of its integration with parallel services. State-machine replication is widely used in making systems fault tolerant and parallel systems are no exception. We have designed and implemented P-SMR, a new paral- lelized model for state-machine replication. In summary, we have found that

for independent commands, P-SMR outperforms sequential state-machine repli- cation by a factor of more than 3 and other approaches by a factor of more than 2. For dependent commands, although P-SMR has better performance than other parallel techniques, it is defeated by SMR, the sequential implementation of state-machine replication that is not subject to the overhead of synchroniza- tion. Moreover, P-SMR scales better with the number of cores whereas other techniques show poor scalability.

Chapter 7

Experimenting with Paxos in the Cloud

Paxos is often present at the core of state-machine replication and has been the most central component of this study. Implementations of Paxos are currently used in many prototypes and production systems in both academia and industry. Given its wide usage, Paxos’s performance and behavior in the presence and absence of failures is critical to the overall performance of a system built on top of it. In this chapter, we present the results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon’s EC2. Although all protocols surveyed in this chapter implement Paxos, they are optimized in a number of different ways, resulting in very different behaviors. In addition to reporting our findings in a variety of configurations with and without failures, we propose and assess additional optimizations to existing implementations.

7.1

Problem Statement

In this chapter we present the results of an extensive performance evaluation we conducted with open-source implementations of Paxos[4] deployed in Amazon’s EC2, a public cloud-computing environment.1 Our study is motivated by the fact

that many online services deployed in the cloud require both high availability and sustained performance. High availability is achieved by means of repli- cation, using techniques such as state-machine replication [68] and primary- backup replication [88]. At the core of these techniques lies an agreement pro- tocol (e.g., [4; 89]). A large variety of such agreement protocols exist in the literature that solve the problem under many different system assumptions[? ].

1http://aws.amazon.com/ec2/

Among these protocols, Paxos has received much attention in recent years both from the industry (e.g.,[87; 16]) and the academia (e.g., [10; 90]). Several re- cent efforts (including the first three chapters of this thesis) have reported on the performance of Paxos implementations, mostly under “normal conditions” (e.g., [10; 32]), that is, deployments with homogenous nodes, balanced communica- tion links, and the absence of failures. While differences in the implementations impact overall performance, these reports typically show steady behavior in the normal case. Yet, anecdotal evidence tells that under less favorable conditions (e.g., after the failure of a node), Paxos may lose its sustained performance. In- tuitively, this is explained by the fact that by relying on a quorum of acceptors for progress, Paxos may proceed at the pace of the quorum of faster acceptors, leaving slower acceptors lagging behind with an ever-increasing backlog of re- quests. Paxos implementations generally read and process messages in arrival order, hence even if the messages in question relate to protocol actions that have been completed long time before, they will be read and processed just as if they are associated with pending decisions. All of this will take time, hence should a fast acceptor fail and a slow one be needed to form a quorum, the system may experience a performance hiccup, corresponding to the time it takes for the slower acceptor to catch up. Bursty behavior is undesirable because it can cas- cade into the application, within which end-user requests may be piling up and replicas falling behind.

We set out to understand to what extent existing implementations genuinely suffer from this phenomenon and if so, under what conditions. To this end we evaluated four open-source implementations of Paxos: S-Paxos, OpenReplica, U- Ring Paxos, and Libpaxos under different message sizes in four configurations: (a) a homogeneous set of nodes in the same availability zone (i.e., data cen- ter); (b,c) two heterogeneous configurations with nodes in the same availabil- ity zone; and (d) homogeneous nodes distributed across different availability zones. In each case, we considered executions with and without participant failures. These configurations represent the deployment of many current online services in the cloud. Placing replicas on a set of nodes with similar hardware characteristics (configuration (a)) is probably the most common configuration used in experimental evaluations. Heterogeneous settings (configurations (b) and (c)) may arise involuntarily (e.g., if applications run in a virtual machine whose physical node turns out to be shared among other applications) or vol- untarily: a designer might choose to deploy Paxos in this manner, perhaps to reduce the perceived risk of correlated failures, or to reduce cost, for example by paying for 3 powerful nodes and 1 or 2 weaker backup nodes (e.g., Cheap Paxos [14] is a variation of Paxos that exploits this alternative). In addition to

these two configurations, the participants of a service can be geographically dis- tributed (configuration (d)) to improve locality and availability. Locality reduces user-perceived latency and is achieved by moving the data closer to the users. Availability improves as the service can be configured to tolerate the crash of a few nodes within a data center or the crash of multiple data centers.

By evaluating the four open-source Paxos libraries under these configura- tions, we show that standard Paxos implementations sometimes have unex- pected behavior and long delays, although the phenomenon varies and depends very much on the details of the implementations: some protocols are more prone to problematic behavior; others are more robust but at the price of reduced per- formance.

7.1.1

Outline

The remainder of this chapter is organized as follows. In Section 7.2 we de- scribe the libraries used in our performance evaluation with emphasis on their flow control mechanisms. In Section 7.3 we detail our experimental setup and present the results. In Section 7.4 we discuss the main lessons we have learnt while interacting with these libraries and we conclude this chapter in Sec- tion 7.5.

Documento similar