Atención Integral Equipos Básico de Salud

Organización por Procesos de la Dirección Provincial de Salud del Guayas

PROCESOS GENERADORES DE VALOR AL CLIENTE EXTERNO AREA DE SALUD

2.3.1.5 Atención Integral Equipos Básico de Salud –E.BA.S-

Most modern NICs provide on-chip support for RDMA operations and address translation, which enables scatter/gather DMA. Among other things, those scatter/gather operations, also referred to as vectored I/O, have two major benefits: atomicity and efficiency. Atomicity means that one process can write into or read from a set of physical buffers, which can be scattered throughout the physical memory, without the risk that another process might perform I/O on the same data. In addition, this mechanism improves the efficiency since one vectored I/O operation can replace multiple ordinary reads or writes, and therefore, reduces the overhead. Scatter/gather DMA controllers provide the hardware support for scatter/gather I/O. To initiate such an operation, a controller needs the input modifier, also known as scatter/gather list, to offload the transfer to the NIC.

As described in section 6.2.2.3, the payload of an LNET message can consist of up to 256 pages. It is desirable to transfer the payload in one operation instead of multiple RDMA reads or writes. Other LNDs, such O2IB LND, implement vectored I/O by mapping the attached memory regions onto so called scatterlists. Within the kernel, a buffer to be used in a scatter/gather DMA operation is represented by an array of one or more scatterlist structures. As presented in section 3.2.5, the Extoll design features the address translation unit, which can be utilized to map page lists and contiguous kernel buffers into the Extoll address space. This functionality is used to provide scatter/gather DMA over Extoll.

The remainder of this section is organized as follows. First, the term physical buffer list is introduced followed by an overview of how Infiniband utilizes so called scatter/gather elements to support vectored I/O. The last part of this section focuses on the design and its limitations of scatter/gather DMA support for Extoll.

6.4.1 Memory Management

In the context of RDMA-enabled NICs, memory regions refer to continuous memory areas, which have been pinned in main memory and registered with a NIC. Such non-shared memory regions are also called physical buffer lists and consist of page or block lists, as depicted in Figure 6.11. NICs can access these physical buffers by using their physical addresses. Another important characteristic is that they cannot be swapped out of the main memory, which enables scatter/gather transfers.

For page lists, the page size has to be an integral power of two and all pages have to have the same size. The data can start at an offset into the first page, referred

6 Efficient Lustre Networking Protocol Support Page List Page 0 Page 1 Page N-1 Page N ... Block List Block 0 Block 1 Block N-1 Block N ... Page size = 2x

Figure 6.11: Physical buffer lists.

to as first byte offset, and can end on a non-page boundary, which means that the last page can be partially filled. Pages do not have to be continuous in memory. To perform scatter/gather I/O with page lists, the following input modifiers are needed: the page size, the first byte offset, the length, and the address list of the pages.

For block lists, the pages where the blocks are residing on need to have the same size. The block size itself is arbitrary and depends on the sizes supported by the NIC. As for the data boundaries, the same rules as for page lists apply. The data can start at an offset into the first block and can end at an offset into the last block. The dashed lines in Figure 6.11 outline the block versus data boundaries. The following modifiers are needed for scatter/gather transfers: the block size, the first byte offset, the length, and the address list of the blocks.

Depending on the underlying NIC technology, two different types of address translations can be distinguished: onloading and offloading. In case of onloading, the address translation is moved to the CPU and handled by the driver software. In case of offloading, the hardware typically has a dedicated controller that is able to gather data out of the memory onto the wire. Offloading bypasses the operating system and reduces the load on the CPUs.

6.4.2 Infiniband Verbs and Scatter/Gather Elements

As previously described in section 3.4.1 about Infiniband, work requests are placed onto a queue pair and can be categorized in send and receive work requests. When the request processing is completed, a work completion (WC) entry can optionally be placed onto a completion queue (CQ), which is associated with the work queue. 174

PD wr_id sg_list* num_sge=2 next* wr_id sg_list* num_sge=1 next* wr_id sg_list* num_sge=3 next* addr Length=N1 bytes Lkey=W addr Length=N2 bytes Lkey=X addr Length=N3 bytes Lkey=Y addr Length=N4 bytes Lkey=Z addr Length=N6 bytes Lkey=Z addr Length=N9 bytes Lkey=Z N1 Bytes N2 Bytes N3 Bytes N4 Bytes N5 Bytes N6 Bytes N7 Bytes N8 Bytes N9 Bytes addr Length=N1 bytes Lkey=W rkey addr Length=N2 bytes Lkey=X rkey addr Length=N3 bytes Lkey=Y rkey addr Length=Sum(N4->N9) Lkey=Z rkey Sum of N4 through N9 ibv_send_wr ibv_send_wr ibv_send_wr ibv_sge ibv_sge ibv_sge ibv_mr ibv_mr ibv_mr ibv_mr

Write Requests Scatter/Gather

Elements Memory

Figure 6.12: Relation of work requests, scatter/gather elements, main memory, memory regions and protection domain [176].

Scatter/Gather Elements (SGE) are used to define the memory address to write

to or read from and are associated with a work request. An SGE is a pointer to a memory region, which has been pinned through a protection domain (PD) and can be accessed by an HCA for read and write operations. Typically, a memory region is a contiguous set of memory buffers, which have been registered with an HCA. The registration of a memory region causes the OS to provide the HCA with the virtual-to-physical mapping of that particular region, but also pins the memory, which means that the OS cannot swap the memory out onto secondary storage. The successful memory registration among other things returns two objects called Lkey and Rkey, which need to be used when accessing memory regions. The key pair provides authentication means. The Lkey (local key) can be used to access local memory regions, while the Rkey (remote key) needs to be sent to remote peers, so that they can directly access a local memory region through RDMA operations. As already mentioned, a memory region belongs to a protection domain, which provides an effective bonding between QPs and memory regions. PDs can be seen as an aggregating entity. Figure 6.12 presents a detailed overview of the relation between work requests, SGEs, main memory, and protection domains.

HCAs have an on-chip scatter/gather DMA controller that enables the gathering of data (page lists and block lists) out of the memory onto the wire in a single DMA

6 Efficient Lustre Networking Protocol Support

transaction. This means that scatter/gather I/O can be completely offloaded to the HCA. This feature is utilized by the O2IB LND for bulk data transmissions.

6.4.3 Scatter/Gather DMA Operation Support for Extoll

Recall from section 3.2.5 that the ATU acts as an MMU for the Extoll NIC, especially for the RMA unit, and therefore, is suitable to provide scatter/gather support for RDMA operations through the RMA unit. By default, the ATU kernel module allocates 128 GATs with an NLP size of 4 KB. Each GAT can map up to 218 NLPs, which translates to 218 ∗ 4 KB = 1 GB of mappable main memory per GAT. The ATU provides address translation offloading for two types of memory regions: pages lists and continuously allocated kernel virtual buffers.

The payload of an LNET message is limited by the LNET MTU, which is 1 MB per transmission, and can comprise of up to 256 pages with a page size of 4 KB. The payload is described by a memory descriptor, which points to an associated buffer that is allocated utilizing GFP (get free pages) flags. The GFP flags ensure that the returned buffer consists of pages, but do not pin the buffer into the main memory. The buffer either consists of an array of pages (array of lnet_kiov_ts) or a continuously allocated kernel virtual buffer of type struct kvec, which can be translated into a scatterlist consisting of pages. This means that the LNET design aligns well with the capabilities of the ATU design, which indicates that for Lustre the address translation can be completely offloaded to the Extoll NIC.

Within the scope of this work, the Extoll kernel API has been extended to provide ATU memory registration services for kernel modules such as LNDs. There are two functions available for memory registration, one for scatterlists and one for page lists. Both return a software NLA, which can be used to build RMA software descriptors. In addition, a de-registration function has been implemented, which expects the software NLA, the corresponding VPID, and the number of mapped pages as parameters. For example for page lists, the ATU expects the following input modifiers in order to perform a correct address translation: the VPID of the kernel process, the pointer to the first entry in page list of type struct page, the number of list entries, and the number of bytes (also called stride) that need to be added to reach the next entry of type struct page in the list.

When using the Extoll kernel API to register memory regions, the requesting process needs to make sure that the provided buffer is pinned into main memory so that the buffer can not be exempt by the OS and be swapped out onto secondary storage. This is a necessary requirement when working with physical addresses. 176

When memory is swapped out before the RMA operation gets completed, the address translation results in a general protection error, which ultimately leads to a compute node failure.

In general besides the support for page lists and continuous buffer space, it is desirable to provide scatter/gather DMA operations for block lists. However, the current ATU design does not support address translation offloading for such physical buffer lists. It requires an additional piece of kernel code, which handles such buffer types. One idea is to copy memory blocks into continuously allocated buffer space, e.g., for small fragments, and then, map this buffer to an NLA.

In document Sistema de respuesta del Sector Salud ante una Pandemia en la Provincia del Guayas: Guía para la preparación del Sector Salud ante una posible Pandemia de Influenza en la Provincia del Guayas (página 31-33)