PILAR III: EL DIRECTORIO Y LA ALTA GERENCIA
Principio 21: Comités especiales
Applications that do intensive data write operations often bottleneck on slow I/O bandwidth. A typical solution is to do delayed writing like Aries [105],
which involves logging followed by an asynchronous write. The bottleneck now shifts to the logging operation and if the logging record size is small, the underlying storage has to manage high throughput with low latency even in cases of small random logging updates. Much research has been done on improving the logging interface, like the append-only logging technique [106], and in this section of the survey we will discuss a few important research works that highlight the core issues in an efficient logging system.
The idea of writing data to disk at the position where disk head happens to be, can be traced back to as late as Trail [107]. Though Trail aims at the problem of minimizing seek delay and rotational latency, it’s not trivial to implement it these days. It involves having accurate control over disk geometry details like rotational latency, seek latency, number of sectors in each track, zone coding, bad sectors mapping and other finer details. It’s much tougher to implement this idea these days because of the advanced disk compaction techniques and more importantly disk manufacturers no longer supply the inner details of disk layout due to complicated disk management techniques and also due to competitive market. Multiple prior research efforts [108–
111], similar to trail have been proposed, that target specific workloads using accurate disk geometry predictions. Lumb et al. [112] propose the idea of setting NCQ length to 2 and then utilize the disk seek and rotational latency to do some useful background work. Beluga also uses limited command queueing technique but also builds a sophisticated pipeline exploiting disk subsystem to the fullest extent. Yet another strikingly differentiating feature is in the added burden of these Trail like approaches, to maintain a map of used and free blocks on disk, in order to place the incoming data accurately on an unoccupied block, and at the same time avoid track switch delay. Beluga avoids these by sweeping through the disk sequentially, without leaving behind any holes in the process. As a result, Beluga doesn’t need to maintain any mapping information of the used and freed blocks.
Gallagher et al. [113] propose to skip N number of blocks depending on the observed latencies at each portion of the hard disk and hence makes it extremely hardware dependent. Moreover, adopting this technique on modern disks with advanced NCQ capabilities is very time consuming. More impor- tantly, they propose a model to avoid disk rotational latency by idling during the time the disk head skips the requested number of blocks. Our approach totally eliminates any sort of latency and achieves the best possible theoretical latency because the disk head never moves without doing any useful work.
The complexity of modern disk drives as elucidated by Gim et al. [114], an in-depth explanation of the Linux kernel storage subsystem in the book [115] gave us a good understanding of the complex sector layout schemes and the
difficulties associated with the accurate estimation of the modern day hard disk geometry.
Logging disk Array [116] uses the RAID technology to handle small writes problem and NVRAM buffer to provide persistency to the cache. The buffer is flushed periodically to disk(Raid-5) when sufficient data is built up. Since RAID uses stripe size as the basic unit of data transfer to disk, NVRAM buffer is structured to hold data in multiples of the stripe size. This idea helps ag- gregate smaller writes and then write it at one shot to disk in units of stripe size so that no additional overhead is incurred in the transfer process. Though latency in writing to NVRAM buffer is very low(in order of microseconds), flushing NVRAM buffer to disk is not a trivial task. Though optimal size is chosen in units of stripe size, there are various other factors which determine whether the disk is utilized to the best extent. That’s where Beluga intends to break down the performance metrics and show how tuning certain param- eters can help achieve best results. Another important factor to note is that NVRAM is a costly hardware resource, which can be avoided if the inexpen- sive SATA disks can be carefully tuned to yield same or even better results. In many situations, writing to NVRAM can yield very slow response times [117]. Log structured file system(LFS) [118] is another major solution to handle small buffer size writes. The entire file system is organized as a sequential log, which converts writes from user application as append to the underlying log structure in the File System. But logging operations require persistent write to disk and hence synchronous writes are required, which obviously yields a very low performance on a naively implemented LFS. Advanced LFS tech- niques like [119–124] use NVRAM or flash to make LFS handle synchronous writes efficiently, but both NVRAM and flash are costly hardware alterna- tives. Though flash based disks provide very high throughput and very low latency, erase cycles are very slow and hence flash disks’ performance goes down when its utilization factor goes up. Also, the basic block size of flash ranges from kilobytes to megabytes and is much higher than the sector size of typical magnetic hard disks. The erase operation in flash devices requires the block size to be of bigger size to get optimal results. However, having a bigger block size increases the latency of smaller requests, which need to be aggregated to form a bigger block size. Flash logging [125] technique uses an array of USB flash devices to provide a fast logging infrastructure. The work proposes to use commodity USB devices as an alternative to expensive SSD based logging systems. The author discards modern day magnetic disks as ill suited for small sequential writes based on a naive logging implementation on SAS disks. Beluga’s evaluations convincingly show how commodity hard
drives can be used to extract comparable performance as that of the expensive flash based devices.
Phase Change Memory(PCM) [126] is a faster alternative to flash based disks but because of its smaller density and higher cost, it’s not easy to be adopted in near future. Mohan et al. [127] propose to use PCM as the first choice for logging, since the speed of PCM is up to four orders of magnitude faster than that of flash based disks [126,128], thereby guaranteeing very high throughputs and very low latencies.
Dynamo [129] and Cassandra [130] performs in-memory logging and the logging system is spread across multiple systems so that even if one system crashes data can be recovered from other machines. Various techniques are used to isolate catastrophic failures to ensure high reliability. With the increas- ing technological advances in network speed, data can be transferred across systems in a very short time thereby providing low latencies. However this comes at the expense of expensive RAMs and high end networking hardware. Azure [20] and HDFS [3] uses journalling and append-only logging to maintain data persistency.
Dual actuator [131] proposes to reduce synchronous write seek time using an accurate disk head prediction technique, It uses an additional hardware actuator and a set of disk heads to service read I/O requests. While one disk head actuator is dedicated for servicing disk write I/Os, another disk head actuator is dedicated for servicing read disk I/Os. However the author makes an assumption that disk head prediction techniques can be easily adopted, but unfortunately it is no longer easy with modern disk drives. Additionally, this technique requires additional hardware and hence is not applicable to existing storage devices. However, Sungem is able to achieve near zero seek times, without any additional hardware alteration and without the need to predict the disk head position.