Capítulo III Propuesta de mejora de la situación actual
3.2 Método S.L.P
3.2.5. Determinación de espacios de las áreas de cada departamento de
The OpenFabrics Enterprise Distribution (OFED) [108] is an open-source software stack offering different network adapter drivers for Infiniband and Ethernet devices, middle/upper layer kernel core modules and related libraries and utilities for RDMA and kernel bypass applications. Figure 5.5 provides an overview of the supported protocols and interconnects.
IP over Infiniband IP over InfiniBand (IPoIB) [130, 131, 132] is a protocol that
specifies how to encapsulate and transmit IPv4/IPv6 and Address Resolution Protocol (ARP) packets over Infiniband. Therefore, IPoIB enables IP-based legacy applications to run seamlessly on an Infiniband fabric. IPoIB is implemented using either the
unreliable datagram (UD) mode [130] or the reliable connected (RC) mode [132]. In
Linux, the ib_ipoib kernel driver implements this protocol by creating a network interface [84, chapter 17] for each Infiniband port on the system. This way, an HCA acts like an Ethernet NIC. Every such IPoIB network interface has a 20 bytes MAC address, which may cause problems since the “standard” Ethernet MAC address is 6 bytes (48 bits) in size. Also, the IPoIB protocol does not fully utilize the HCA’s capabilities such as it does not implement any kernel bypass, reliability, RDMA, and
5 RDMA-Accelerated TCP/IP Communication
Application / Middleware
Sockets API (RDMA API)Verbs
TCP/IP Ethernet Driver Ethernet NIC IP over Infiniband Ethernet NIC with TOE Infiniband HCA Ethernet Switch Ethernet Switch Infiniband Switch SDP Infiniband HCA Infiniband Switch Infiniband HCA Infiniband Switch iWARP- Ethernet NIC Ethernet Switch RoCE- Ethernet NIC Ethernet Switch Infiniband HCA Infiniband Switch RDMA Software Stack
RSockets LD_PRELOAD
Hardware offload
1/10 GigE 10/40/100 GigE-TOE IPoIB SDP RSockets iWARP RoCEv1/v2 IB Native
Kernel bypass User space
Kernel space
Hardware
Figure 5.5: Overview of interconnects and protocols in the OpenFabrics stack.
splitting and assembly of messages to packets. The network traffic traverses through the normal IP stack, which means a system call is required for every message and the host CPU must handle breaking data up into packets.
In recent years, there have been efforts to cope with the limitations of IPoIB. The first attempt is the introduction of user space Ethernet verbs [133], which bypasses the TCP/IP stack for Ethernet frames. A similar approach is proposed by user space IPoIB packet processing over Verbs [134]. Also, an acceleration to the IPoIB kernel modul itself is proposed [135], including interrupt moderation and RDMA capabilities. The latest approach introduces Ethernet over Infiniband [136] as a replacement for the IPoIB kernel module. It decouples the Ethernet link layer from the underlying Infiniband network, which is a must for virtualization.
Sockets Direct Protocol The Sockets Direct Protocol (SDP) [137], included as an annex of the Infiniband specification, was a first attempt to implement a transport- agnostic protocol to support TCP-like stream sockets over an RDMA-enabled network fabric. The initial implementation of SDP used a buffer copy method similar to BSD sockets, therefore referred to as BCopy mode and provided support for zero-copy data transfers for asynchronous I/O operations. Later, the zero-copy mode, referred to as ZCopy, was expended to support the synchronous socket calls send() and recv(). The first attempt to implement a ZCopy mode [138] pinned and registered
the application buffers in the SDP implementation and supported two different modes: Read ZCopy and Write ZCopy. The ZCopy mode utilized Infiniband’s Fast
Memory Region mechanism to transfer data between two HCAs. However, it did
not allow simultaneous send requests, a send() call would block until the data was received. This blocking behavior was necessary to prevent the modification of the user memory involved in the data transfer while being processed.
The Asynchronous Zero-Copy SDP (AZ-SDP) [139] allows multiple simultaneous send requests and introduces the mprotect() call as a safeguard mechanism, which forces a segmentation fault whenever a user modifies the memory region of an ongoing transfer. This protection mechanism results in an additional kernel trap for every data transfer, which forces the user application to block or copy the memory area. The main objective of SDP is to run with unmodified sockets applications. Therefore, this costly mechanism is needed since applications can reuse memory as soon as the sockets library returns control to the application. The SDP protocol has been deprecated by several different user space libraries providing the same functionality. RSockets The RSockets [140] library implements a user space protocol for byte streaming transfers over RDMA, which provides parity with standard TCP-based sockets. It comes with its own blocking API, which is similar to standard socket calls such as rsend() and rrecv(), and typically performs buffer copies on both sides. Existing socket applications can utilize RSockets by using the pre-loadable conversion library, which exports socket calls and maps them to RSockets. A zero- copy functionality is available as a set of extra functions on top of RSockets, i.e., riomap() and riowrite().
Internet Wide Area RDMA Protocol The Internet Wide Area RDMA Protocol (iWARP) enables RDMA over TCP/IP infrastructures, including zero-copy and protocol offload, if the underlying NIC provides RDMA functionality. iWARP is layered on the congestion-aware protocols TCP and the Stream Control Transmission
Protocol (SCTP) [141] and is defined by a set of RFCs, specifically the RDMA Protocol (RDMAP) [142], Direct Data Placement (DDP) protocol [143], Marker PDU Aligned (MPA) Framing [144], and DDP over SCTP [145]. DDP is the main
component in the protocol, which permits the actual zero-copy transmission. iWARP only supports reliable connected transport services and is not able to perform RDMA multicasts. Applications implementing the Verbs API can utilize iWARP.
5 RDMA-Accelerated TCP/IP Communication
RDMA over Converged Ethernet In contrast to iWARP, RDMA over Converged
Ethernet (RoCE) [146] is an InfiniBand Trade Association standard designed to
provide Infiniband communication on Ethernet networks. RoCE preserves the InfiniBand Verbs semantics together with its transport and network protocols and replaces the InfiniBand link and physical layers with those of Ethernet. RoCE packets are regular Ethernet frames with an EtherType allocated by IEEE which indicates that the next header is a RoCE value global route header, but they do not carry an IP header. Therefore, they cannot be routed across the boundaries of Ethernet L2 subnets. RoCE version 2 (RoCEv2) is a straightforward extension of the RoCE protocol that involves a simple modification of the RoCE packet format. Instead of the global route header, RoCEv2 packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA transport protocol packets over IP.