Estándares Mínimos para Implementadores - Instrucciones para la nota conceptual de los solicita

The POSIX standard requires that a file locking mechanism is available, which allows a pro- gram to lock the whole file or a part of the file for one task before starting an I/O operation on this file. This mechanism can grant exclusive locks, typically needed for write operations, and shared locks, which multiple tasks can use for concurrent read operations. Depending on the file system, one or more daemons are responsible for managing the locks. For example, GPFS implements a distributed token management for file and byte-range locking. GPFS clients, requesting a lock for a region in a file, have to request a token from the GPFS client that owns a lock for a file region containing the requested region. The responsibility for the requested region is then delegated to the requesting GPFS client. Depending on the imple- mented strategy for shared file access, the number of requests and delegate operations will grow with the number of tasks accessing the file. The number will increase linearly when applying the chunk-based container scheme, as a task only has to request a lock once for its chunk. The number of lock operations will grow much faster when applying the interleaving strategy, where a task owns a large number of small segments within the file. However, both strategies will lead to a possible bottleneck at large scale.

A strategy to avoid this bottleneck is the parallelization of the lock management, which can be done on user level by partitioning the shared file into multiple physical files. File locking for each physical file is independent from the file locking of the other files and has to scale only to a limited number of tasks accessing this file. Therefore, this strategy is considered for the design of SIONlib and will be described in Section 3.4.

Furthermore, GPFS provides a special feature to push information about I/O access patterns directly to the lock daemons of the file system (gpfs fcntl). This feature can be used for optimizing file locking for the chunk-based file container format as it ensures that information about position and size of the chunks is available at open time. As a result, no further request and delegate operations are needed for file locking [46].

Alignment

Most of the components of a file system on the server and compute nodes have an internal memory cache to buffer data that has to be written to disk or that was read from disk. In case of write operations, data can reside in this memory buffer as long as the corresponding client has a write lock on the data region in the file. This feature can be used to aggregate the data of small write operations in the memory cache and send it to the file system later as one big block. GPFS, for example, allocates a page cache as a memory buffer on each client, which mirrors file-system blocks in local memory. Because such blocks can only be handled in one piece, a GPFS client has to invalidate a full page in the memory cache when it has to release a write lock on a region that overlaps with the data region of this page. To synchronize page handling with the lock handling, GPFS restricts byte-range locking to the granularity of file-system blocks.

The approach to dedicate a chunk of the file container to one task guarantees that tasks do not concurrently access identical parts of a file. However, adjacent chunks may nonetheless occupy parts of the same file-system block. With write locks being assigned at the granularity level of file-system blocks, this may cause lock contention when writing to these chunks. The situation is similar to the false sharing of cache lines in a multiprocessor. As depicted in Figure 3.2a, the GPFS client of the second task has to wait until the GPFS client of the first task has flushed the file-system block to disk and has released the corresponding write lock. Subsequently, the second GPFS client gets the write lock and can read the file-system block in its page cache to add data to it. Consequently, write accesses to the same file-system block are serialized through this mechanism.

To avoid this limitation, the chunks have to be aligned with file-system block boundaries as shown in Figure 3.2b. This guarantees that only one task accesses a file system block. The file- system block can stay in the page cache as long as a caching algorithm does not purge it out.

FS Block FS Block FS Block

data

task 1 task 2data

… … lock T₁ T₂ lock … (a) No alignment

FS Block FS Block FS Block

… … lock lock FS Block T₁ T2 … data

task 1 task 2data

(b) Alignment to file-system blocks

Figure 3.2: No alignment of data to file-system blocks in a dense shared file leads to a serial- ization of data access from multiple tasks, whereas the alignment of chunks supports parallel access.

3.2 Scalability of Shared-File I/O

Furthermore, due to these concepts, other tasks will not request write locks for this block and a GPFS client does not have to release an existing write lock. This omits lock communication between the clients and read-modify-write cycles on a client during write time. File-system access to a shared file container with aligned chunks is therefore efficient. To verify this, we will discuss the results of a measurement on JUQUEEN comparing I/O bandwidth of writing and reading data to unaligned and aligned chunks in Section 4.2.3.

The Lustre file system handles the byte-range locking differently. It has no restriction on the granularity of file-system blocks. On the other hand, caching of file data is done in the file- system components with different granularity. The data blocks in the client memory cache have the same size as the memory pages of the operating system, whereas on the server side the file data is partitioned in file-system blocks of the underlying local file systems on the OSTs. Furthermore, file data is distributed over the OSTs in portions of user-defined size (stripe size). Especially with the latter partitioning, Lustre is able to delegate part of the file metadata and lock management to individual OSTs. From this perspective, an alignment of chunks to these blocks should result in good I/O performance.

In document Instrucciones para la nota conceptual de los solicitantes de la primera fase (página 33-36)