Anexo 4 : Elementos de una respuesta técnicamente sólida
2. Seguimiento del progreso: las intervenciones propuestas para promover la igualdad y los derechos humanos deben ser incluidos en la herramienta modular y ser respaldadas por un
This section introduces the organization of the SIONlib file container step by step, starting with a simple and preliminary layout, which will be refined as new features are discussed. The file container has a substructure based on chunks, which are assigned to the application
3.5 File Organization
tasks. Additional chunks are used to store metadata, which describes the structure of the file container. Figure 3.8 depicts this format of the file container, which starts with a metadata block. The data chunks of the application tasks are placed according to their rank order. The size of the chunks has to be known in advance before creating the file container. Therefore, the application tasks have to specify the maximum requested size as parameter of the file open statement. SIONlib takes this information and reserves the corresponding space in the file. In addition, SIONlib extends each chunk so that its end position is aligned with the boundary of a file-system block. As discussed in Section 3.2.1, this prevents possible congestion when accessing chunks from different tasks.
The metadata block that precedes the data chunks is needed to ensure that the content of the file container can be read again by an application. It stores scalar attributes and arrays with one element per task. For example, scalar attributes are used to store the number of tasks that have written data to the file. Array data is needed to describe the size and the fill rate of chunks. The number of elements of these arrays is known in advance, because the number of tasks writing to the file is required to be fixed after opening the file (cf. next Section). Therefore, the size of the metadata block is known in advance and it can be placed at the beginning of the file. The space for this block is also extended to be aligned with the boundaries of file- system blocks. The metadata block exists only once in the file, and because of efficiency multiple tasks should not access the metadata block concurrently. To ensure this, SIONlib aggregates the metadata on one tasks in its collective file open operation using the application communication layer. In detail, all tasks send their requested chunk size to the master task, which is responsible for writing the metadata block, to calculate the individual start addresses of each chunk, and to return the start position of the chunks to the tasks. With this, the tasks can advance the file pointer to the beginning of the reserved chunk. Because the file creation and the calculation of chunk positions are done during the open operation, no further collective operation is needed until the file is closed. Application tasks can act individually on the file container for writing and reading data. During the close operation, the master task collects the number of bytes written from each task and stores it in the metadata block. The close operation is again collective to avoid the inefficiency of having all tasks writing to the metadata block concurrently.
The presented file format requires that chunks sizes are defined when the file is opened. How- ever, the need to know the total amount of data each task writes is too restrictive for applica- tions that cannot compute the data size in advance. With an extension of the file format, this restriction can be relaxed to the requirement to know the maximum amount of data written in one piece by each task. The file format extension leads to the layout depicted in Figure 3.9. The file container is now organized in blocks with each block containing one chunk per task. If a task wants to write more bytes than there are left in the current chunk, it can request a
FS Block 0 FS Block 1 FS Block 2 FS Block 3 FS Block 4 FS Block 5 FS Block 6 FS Block 7 FS Block 8 … FS Block M-2 FS Block M-1 FS Block M
Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk N
Meta Block
1 data data data data data
Figure 3.8: Simple and preliminary structure of the SIONlib file container. The chunks are assigned to the application tasks and their size is extended to be aligned to file-system blocks. The metadata block stores information about chunk sizes and fill rates.
Meta Block 1
Chunk 1 Chunk 2 Chunk 3 Chunk N Chunk 1 Chunk 2 Chunk 3 Chunk N
FS Block 0 FS Block 1 FS Block 2 FS Block 3 … FS Block M FS Block M+1 FS Block M+2 FS Block M+3 … FS Block 2M … FS Block bM+1
Meta Block 2
Block 1 Block 2
data data data data data data data
Figure 3.9: Extended structure of the SIONlib file container. The chunks are now organized in blocks, which can be repeated multiple times in the file. A second meta block at the end of the file is needed to store metadata with variable size.
new chunk of the same size. As chunks have a predefined size, the size of a block is also known beforehand and each task can compute the positions of subsequent chunks on its own without the need to communicate with other tasks. It is noteworthy to mention, that this may create substantial gaps in the file container if only a subset of the tasks ask for additional chunks. However, since file systems typically do not allocate space for empty file-system blocks, these blocks exist only on the logical level and not on disk. To avoid their physical materialization, for example, when the file container is copied, the file can be defragmented in a post-processing step with tools provided by SIONlib.
SIONlib needs to store metadata indicating the space used in each chunk without knowing the total number of blocks in advance, the first fixed-sized metadata block cannot be used for this purpose. Instead, SIONlib allocates a second metadata block at the end of the file in the collective file close operation. In this block, SIONlib stores the number of chunks per task and the space occupied by data in each of the chunks. However, appending data to a SIONlib container beyond the initially allocated space after it has been closed would require updating and re-writing the second metadata block. Although this feature is not required for SIONlib, adding it would not pose a fundamental design problem.
As discussed in Section 3.4, SIONlib shared file containers should be dividable into multi- ple physical files to improve scalability and to use the hardware or software parallelism that is available between the application and the disks. Therefore, the SIONlib file format is extended further to offer the option of distributing the chunks across a user-defined number of physical files (cf. Figure 3.10). Each task is still mapped onto a single physical file, but two tasks may now end up being mapped onto different physical files. The first physical file stores additional data that is needed to manage multi-files (e.g., the number of multi-files) in the first metadata block. In addition, a mapping table is added to the second metadata block of the first file. This table contains the file number of the physical file and the local rank number in this file for each task. Each file is complete in terms of the SIONlib file format. A multi-file contains the two metadata blocks and stores a subset of chunks. This allows SIONlib to dump metadata or to read file data independently for each of the multi-files. To implement this transparency, the first metadata block of each file will maintain additionally a list of global rank numbers, indicating the application tasks that have written data to chunks of this file.
The use of multi-files with SIONlib is optional. Applications are able to use multi-file con- tainer or single-file containers without modifying their code. This means that the SIONlib file is identified by the name given in the file open operation in both cases. Therefore, the first physical file is stored under the originally specified name, whereas a consecutive number is appended to the filenames of the other physical files.