PILAR III: EL DIRECTORIO Y LA ALTA GERENCIA
Principio 21: Comités especiales
I/O workloads exhibit different kinds of fluctuations in the workload locality. A short term fluctuation is a small period(less than a second) of bursts in the workload and is usually absorbed either by the software disk schedulers located on the DA or by the hardware disk schedulers used by the disk controllers. A long term fluctuation is an overall change in the workload locality pattern as a result of some permanent changes in the workload, and such fluctuations typ- ically last for several hours before returning to the expected locality pattern or can even be a permanent deviation from the expected workload locality. Therefore, in the event of such long-term fluctuations in the workload, either the tenant reconfigures the QoS specifications or the system automatically
adds or removes hardware resources, such that the QoS specifications are en- forced accurately. However, there are mid term fluctuations too, that exhibit variations in the workload locality in the order of few minutes, and it is these type of workloads that pose serious challenges in enforcing the QoS guarantees because it is not sure whether to reconfigure the system settings or to wait until the fluctuations disappear.
In VMware’s storage DRS [141], when a VM is seen to overload its storage node, the VM is isolated and migrated to a least loaded storage node within its cluster, where a cluster is a hard partition of a set of storage nodes. In BASIL [144], when a storage node is overloaded, VDs are migrated to different storage nodes using a pair-wise assignment algorithm. By pairing VDs of min- imum and maximum loads, the pair of VDs are expected to balance the overall load in the system. Its a common practice to migrate VDs from overloaded DAs to new or least loaded DAs [155–157], but the cost of data migration can be prohibitively long. Therefore, BASIL suggests to use Storage VMotion to show that data migration from one store to another store can be automated and need not block the VMs. They use workload characterization to under- stand the load pattern to manage the data transfer. HP’s AutoRaid [157] maintains a hierarchical storage system that has RAID1 setup on the higher layer to ensure higher throughput and lower latency and in the lower layer of the storage system, it has a RAID5 setup to provide additional redundancy. The data migration between RAID1 and RAID5 layers are done automatically in the background, transparent to the user applications.
A disadvantage with the data migration technique is the need for migrat- ing the entire VD to a different storage node. The cost of migration can prohibitively bottleneck any I/O accesses to the affected storage nodes and may not benefit for workloads with frequent short-term fluctuations. Hence the load balancing feature through data migration is more of a reaction to an imbalanced load rather than a proactive measure like ours where best effort is made to balance the loads uniformly across all the DAs.
In Azure [20], storage system consists of multiple clusters of storage nodes and load balancing is done only within a cluster of storage nodes. Each read request is associated with a strict deadline timestamp and if the deadline cannot be met, the storage cluster returns the query to the source node that submitted the I/O request. The source node increases the deadline timestamp and again submits the data until the data is successfully processed by the storage cluster. Since Azure uses erasure coding to replicate data for high availability, a data object is striped and replicated across several storage nodes. To handle a read I/O request, it has an option to either read all the various data fragments corresponding to the data object from a set of storage nodes or
to reconstruct the data object based on analytical methods, from a different set of storage nodes. The decision to reconstruct or not, is made dynamically depending on the real-time load on the storage nodes.
Zoolander [158] replicates data for providing high availability and to en- sure predictable performance guarantees. When there is contention for data access, which is reflected in the slow I/O latency of the data workload, the entire storage node is replicated to distribute the load and bring down the I/O latency. Though it enforces performance guarantee, it comes at the cost of increased hardware utilization.
Pisces [142] proposes an unique approach to load balancing the storage nodes using reciprocal swaps. The crux of the algorithm is the give-and-take policy: if a tenant t takes some share of the storage node N from a tenant u on N, t must give an equivalent share back to u on another storage node M. This might not be feasible on a large-scale storage system where t and u might not share more than one storage nodes at all, and its an additional constraint to the admission control algorithm. Even otherwise, if you take some share from one tenant and give it to another tenant, the reciprocal swap procedure has to continue until there is an equilibrium in the entire storage system, which might necessitate reversing some actions. In order to avoid such a scenario, Cheetah uses a multiple iterative procedure to minimize the possibility of overloading a storage node.