VOQ/sub SW/: a methodology to reduce hol blocking in ... - UPV

(1)

VOQ

SW

: A Methodology to Reduce HOL Blocking in InfiniBand Networks

M.E. G´omez, J. Flich, A. Robles, P. L´opez, and J. Duato

Department of Computer Engineering Universidad Polit´ecnica de Valencia P.O.B. 22012, 46071 - Valencia, SPAIN

E-mail:^fmegomez,jflich,arobles,plopez,jduato^g@gap.upv.es

Abstract

InfiniBand is a new switch-based standard interconnect for communication between processor nodes and I/O devices as well as for interprocessor communication.

InfiniBand architecture allows switches to support up to 15 virtual lanes per port for data traffic. To route packets through a given virtual lane (VL), packets are labeled with a certain service level (SL) at injection time, and SLtoVL mapping tables are used at each switch to determine the VL to be used. Many previous works in the literature have shown that separate virtual lanes are able to reduce the influence of the well-known head-of-line (HOL) blocking effect on network performance. However, using virtual lanes to form separate virtual networks is not enough to eliminate the HOL blocking problem. Alternative solutions such as Virtual Output Queuing (VOQ) are able to eliminate it at the expense of modifying the switch buffer organization. In this paper, we propose an effective strategy to implement the VOQ scheme in IBA switches by using virtual lanes. This strategy does not require to modify the switch architecture, simply SLtoVL tables must be properly filled. Evaluation results show that our proposed VOQ scheme is able to outperform the results obtained with the virtual network approach using the same number of resources. Moreover, the methodology proposed to implement the VOQ scheme in IBA only requires a small number of resources in order to significantly improve network throughput.

Keywords: InfiniBand network, irregular topologies, virtual lanes, head-of-line blocking.

1 Introduction

InfiniBand [4] has been recently proposed as a standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The

This work was supported by the Spanish MCYT under Grant TIC2000-1151-C07 and by Generalitat Valenciana under Grant GV00- 131-14 and by Junta de Comunidades de Castilla-La Mancha under Grant PBC-02-008.

leading computer manufacturers, support the InfiniBand initiative. InfiniBand creates a more efficient way to connect storage and communication networks and server clusters together, while delivering an I/O infrastructure that will produce the efficiency, reliability and scalability that data centers demand. Moreover, InfiniBand can be used as a platform to build networks of workstations (NOWs) or clusters of PCs [8] which have become a cost-effective alternative to parallel computers. Currently, clusters are based on different available network technologies (Fast or Gigabit Ethernet [11], Myrinet [1], ServerNet II [3], Autonet [10], etc...). However, they may not provide the protection, isolation, deterministic behavior, and quality of service required in some environments.

The InfiniBand Architecture (IBA) is designed around a switch-based interconnect technology with high-speed point-to-point links. Nodes are directly attached to switches through Channel Adapters (CAs). An IBA network is composed of several subnets interconnected by routers, each subnet consisting of one or more switches, processing nodes and I/O devices. Routing in IBA subnets is distributed, based on forwarding tables stored in each switch, which only consider the packet destination node for routing [5]. Also, IBA routing is deterministic, as the forwarding tables only store one output link per destination ID. Moreover, IBA switches use virtual cut-through [6].

IBA switches support a maximum of 16 virtual lanes (VL). VL15 is exclusively reserved for subnet management, whereas the remaining VLs are used for normal traffic.

Virtual lanes provide a mean to implement multiple logical flows over a single physical link [2]. Each virtual lane has associated a separate buffer. However, IBA virtual lanes can not be directly selected by the routing algorithm. In particular, to route packets through a given virtual lane, packets are marked with a certain Service Level (SL) at injection time, and SLtoVL mapping tables are used at each switch to determine the virtual lane to be used. According to the IBA specification, SLs and VLs are primarily intended to provide quality of service (QoS). Also, they can be used to provide deadlock avoidance and traffic prioritization.

In [16], we analyzed how the use of virtual lanes influenced the performance of IBA subnets. In particular, the evaluation showed that performance can be doubled by

(2)

VL0 VL1 VL2 VL3

LINK 0

LINK 1

LINK 2 LINK 3

Figure 1. Implementing VOQ with VLs.

using only two VLs. In [16], each packet was introduced into a certain VL randomly selected at the source host.

Packets used the same VL until they reached the destination end-node. To distinguish the VL that should be used, a different SL was used for each VL. That is, VLs were used as separate virtual networks.

2 Motivation

Contention appears at a switch when two or more packets want to use the same output port simultaneously.

The arbitration at the required output port of the switch must select one of the packets. The rest of the packets are delayed. Frequent contention causes congestion, as blocked packets may block other packets, obtaining a cumulative effect that propagates over the whole network.

A severe situation happens when using an input-queued (IQ) switch organization. In this case, only the packet at the head of each queue can be routed and forwarded. Hence, if the packet at the head of the queue is blocked waiting for a busy port, the remaining packets of the queue may be blocked even if the output port they require is free. This situation is known as head-of-line (HOL) blocking.

A possible solution for the HOL blocking problem is the design of switches with output-queued (OQ) organizations, or combined input and output queued schemes (CIOQ) [7].

In these switches, an incoming packet is directly stored in the queue associated with the requested output port, thus eliminating the HOL blocking effect. However, in the worst case, OQ switches must operate at a speed equal to the sum of the rates of all input ports. Therefore, the OQ structure is not scalable. On the contrary, IQ switches operate at the data rate of the input ports.

The HOL blocking effect can be also eliminated in IQ switches by Virtual Output Queuing¹(VOQ) schemes [20].

In this scheme, input buffers are organized into a set of queues where packets awaiting access to the switching

1also called Destination Queuing.

SLVL DST PAYLOAD

incoming packet ip op

IP SL VL

VL SL DST PAYLOAD

outgoing packet

SLtoVL table switch

(a)

VL SL DST PAYLOAD

incoming packet

ip op

switch

(b)

Figure 2. The SLtoVL mapping table.

fabric are stored according to their destination output ports.

On the other hand, many previous works in the literature [14, 9, 16] have shown that splitting physical links into a few virtual lanes can contribute to improve network performance while reducing the influence of the head-of- line blocking effect on IQ switches. Virtual lanes are less effective than VOQ, but they are more scalable. Usually, when adding virtual lanes to improve network performance (beyond the minimum number of virtual lanes needed to guarantee deadlock freedom), packets being routed through certain physical links can use any of their free virtual lanes.

However, the VOQ scheme can be easily implemented by using virtual lanes [7, 15]. The idea is based on the use of a number of virtual lanes equal to the number of output ports, and using a different virtual lane buffer for the packets destined to a particular output port (see Figure 1).

IBA specs allow the use of up to 15 VLs per each input port. Unfortunately, these VLs can not be used directly to implement a VOQ scheme, as IBA switches do not allow the VL to be directly assigned as a function of the destination port. Moreover, the packet is already stored at a particular VL before the output port is computed. The VL used by a packet at a particular switch is assigned by using the SLtoVL mapping table associated to the output port used at the previous switch, and it depends on the packet SL and the input port used by the packet. Thus, both the output port and the virtual lane are selected in a deterministic way in IBA.

In this paper, we propose a strategy to implement the VOQ mechanism in IBA, which does not require any specific hardware support in IBA switches. Basically, the strategy is based on computing the SLtoVL tables in such

(3)

VL0 VL1 VL2 VL3

LINK 0

LINK 1

LINK 2 LINK 3

outgoing packet 2 IP SL

5 5 SLy

VL 1 2

SW0 SW1

VL2 DSTy SLy

PAYLOAD

SLx VL1 PAYLOAD DSTx

outgoing packet 1

outgoing packet 2 VL1

DSTy SLy PAYLOAD

outgoing packet 1 packet 1

incoming

packet 2 incoming

SLx VL1 PAYLOAD DSTx

SLx

LINK 5

LINK 1 VL0

VL1 VL2 VL3

Associated SLtoVL table

Figure 3. Implementing the VOQ scheme in IBA.

a way that packets destined to different output ports are stored in different VLs. We also deal with the case of a limited number of available VLs and SLs (a partial VOQ scheme). This strategy may be an effective alternative to other initiatives based on hardware support for VOQ [21].

The paper is organized as follows. Section 3 describes the proposed strategy. We will refer to it as “VOQ Software” or^V^OQ^SW. Section 4 presents the performance evaluation results. Finally, some conclusions are drawn.

3 VOQ

SW

in IBA

Forwarding tables in IBA only consider the destination node for routing, storing one output port per destination ID. On the other hand, VLs can not be directly selected by the routing algorithm. To select the VL to be used by a particular packet, the SLtoVL mapping table associated to the output port is used (see Figure 2.a). Each packet is assigned a given SL when it is injected into the network, and it can not be changed by the switches. For this, a SL mapping table is located at each sender node. At each switch, the next VL to be used by a packet at the next switch is computed by considering the input port of the packet, the output port, and its SL. When a packet arrives at a switch through a given physical link, it is placed at the corresponding VL buffer resource indicated in the packet header (see Figure 2.b). Then, the current switch computes the output port and the next VL to be used. Hence, the VL that is used to store a given packet in the current switch was computed in the previous switch.

The way to implement a software VOQ in IBA is by means of the provided service levels and a proper computation of the distributed SLtoVL mapping tables. In order to properly assign VLs to incoming packets, it is necessary to select an appropriate SL to each packet. Let us consider the example shown in Figure 3. Two different

packets arrive at SW0 through physical link number 5 and both are stored at the buffer resource corresponding to VL1 (as this VL is indicated in both packet headers). The output port is obtained by indexing the forwarding table with the destination identifiers (DSTx and DSTy). Assume that the routing option at SW0 is the physical link 1 for both packets. Both packets will then arrive at SW1 through the same link. In SW1, packet 1 will be forwarded through link 1, whereas packet 2 will use link 2. As we want to apply the VOQ scheme, packets 1 and 2 should be stored in separate buffers at SW1 (in particular, VL1 and VL2, respectively). The VL to store the packet is placed in each packet header and is computed in the previous switch (SW0) by accessing the SLtoVL table associated to output port 1. In this case, we want to assign VL1 and VL2 to packets 1 and 2, respectively. As both packets use the same output port and arrive at SW0 through the same input port, the only way to assign different VLs to both packets is by using different SLs (SLx and SLy). Notice that a different SL has been needed because SLtoVL table is indexed only by the input port and the SL. It must be noticed, though, that if both packets enter the switch SW0 through different input ports they could use the same SL, as the SLtoVL table may differentiate both cases by using the input port index.

This idea needs to be generalized to all the possible packets that may use the same input and output ports of a given switch (i.e. SW0), whereas in the next switch (SW1) they use different output ports. Furthermore, to support VOQ in the entire network (a full VOQ for short), the idea must be generalized to all the switches. That is, all the possible combinations of input and output ports of a switch and the set of output ports at the next switch have to be considered.

The use of SLs to provide VOQ may have a severe limitation. The SL used by a packet is located at its header and cannot be changed. Therefore, a unique SL

(4)

Compute 4−tuplas set of

stage 1

weights to 4−tuplas

Apply

4−tuplas by weights

Sort

stage 2 stage 3

Apply VOQsw to x most−weighted

4−tuplas SLs

successfully?

finished

x = x_max?

inc x stage 4

consider previous SL mapping (x−1) Topology Forw. table

yes no

no yes

Compute SL2VL

& HCA tables

full VOQ obtained stage 5 Forw. table

initially 1

SL2VL table HCA table VLs x 4−tuplas

partial VOQ obtained

Figure 4. Stages of theVOQswalgorithm.

must be used to correctly select all the VLs to be used along the path traversed by the packet. Even worse, we must avoid assigning this SL to other packets using paths that could cause SLtoVL conflicts with the previous packet.

Therefore, it may be expected that a large number of SLs will be required to obtain a full VOQ. Moreover, SLs in IBA are limited to 15, which may not be enough in order to distinguish all the possible situations. Indeed, some of these SLs may be used for other purposes. Therefore, we will also have to deal with a very limited number of SLs in the network. Given that SLs have proven to be useful for many tasks² beyond the ones they were originally intended for, the authors of this paper think that future releases of IBA specs should consider to increase the number of available SLs. This could be accomplished without having a great impact. Indeed, only a few more bits would be required in theLocal Route Headerand in the SLtoVL table. The fact that the number of SLs can be larger than the number of VLs is already considered in the specification.

Finally, it must be considered that the sender node must place the proper SL in the packet header according to the path it must follow, which, in turn, depends on the destination node of the packet.

3.1 The Algorithm

Our methodology uses the available resources (the ones not used for QoS) and tries to obtain (by using only those resources) the highest degree of VOQ achievable in the entire network. Figure 4 shows the different stages of the algorithm that implement the methodology.

At a first stage, from the topology data, the algorithm computes all the 4-tuplas that can be considered to apply VOQ. A 4-tupla is defined by a combination of a switch, an input and an output port at this switch and an output port at the next switch (reached through the considered output port). For a 16-switch network with 4 bidirectional links connecting switches, the number of possible 4-tuplas is 1024 (16 switches4 links4 links4 links).

2such as routing through minimal paths [22] and traffic balance [23].

As stated before, the number of available SLs may be small and therefore, the applicability of VOQ could be reduced. Hence, our algorithm will try to make an efficient use of the available SLs, thus applying VOQ (eliminating the HOL blocking) to the most used network areas. For this, at the second stage, the 4-tuplas are weighted. In particular, from the information contained at the forwarding tables, the 4-tuplas are weighted by counting the number of paths that makes use of them. Notice that the proposed algorithm is independent of the applied routing algorithm in the network.

Also, notice that those 4-tuplas not used in the network will be weighted with zero and, therefore, will not be taken into account. At the third stage, the 4-tuplas are sorted by their weight.

Then the algorithm must compute the SLs to be used by every packet in order to provide VOQ. However, as mentioned before, the effectiveness of the algorithm depends on the number of available SLs. Thus, in order to take the most advantage of the available SLs, the algorithm will try to apply VOQ starting from a small set of 4-tuplas, progressively increasing the set of considered 4-tuplas until the number of available SLs is exceeded. In particular, the algorithm will focus first on one 4-tupla (the most weighted one). If the required number of SLs is smaller than the number of available SLs, then it will compute the SLs again but considering two 4-tuplas (the two most weighted). This process is repeated until the entire set of 4- tuplas has been considered (a full VOQ is achieved) or when the required number of SLs is larger than the number of available SLs (a partial VOQ is achieved). In the later case, the solution provided by the algorithm is the one achieved by the previous computation.

Therefore, at the fourth stage, the algorithm computes the SLs to use based on the number of available SLs and the number (^x) of most-weighted 4-tuplas to consider. Initially, the number of 4-tuplas to consider is ^x ⁼ ¹. If the computation of SLs is successful (the required number of SLs is smaller that the number of number of available SLs), then the algorithm restarts the computation by increasing the number of^xby one. However, it first checks if all the 4-tuplas have been considered (^x ⁼ ^xmax). In this case,

(5)

PROCEDURE ComputeSLs ( SLmax, x, set of tuplas, forwarding table ) FOR each source-destination path (p)

ValidSLs = ARRAY [0..SLmax-1] OF BOOLEAN FOR each hop in p

t = get tupla(ip,sw,op,nextop) IF considered tupla ( t ) THEN

FOR each neighbor tupla of t (nt) FOR each sl

IF Used SL( nt, sl ) THEN ValidSLs[sl] = FALSE ENDIF

ENDFOR ENDFOR ENDIF ENDFOR

sl = lowest valid sl ( ValidSLs ) IF sl == nil THEN

RETURN FALSE ELSE

Add sl to considered tuplas of p ENDIF

ENDFOR RETURN TRUE END PROCEDURE

Figure 5. Algorithm for the assignment of SLs (fourth stage of the methodology).

a full VOQ is achieved and the algorithm skips to the fifth stage. Otherwise, the algorithm computes again the SLs with^x ⁼ ^x⁺¹. On the contrary, if the computation is unsuccessful (there were not enough SLs to apply VOQ to the considered number of 4-tuplas), then the algorithm skips to the fifth stage considering the previous (the^x ¹tuples) computation of SLs.

Finally, at the fifth stage, the necessary tables for IBA are computed based on the assignments of SLs made at the previous stage. In particular, the SLtoVL mapping tables of every switch and the SL mapping tables for each sender node are computed. Note also that at this stage, the number of available VLs are taken into account. If the number of available VLs is larger than the number of ports for a switch, a VL is assigned to each output port. On the contrary, if there are fewer VLs than output ports, then the output ports are grouped and assigned to VLs by groups.

For the SLtoVL mapping tables, the SLs assigned to a particular 4-tupla must be stored in the corresponding SLtoVL table. Also, all the 4-tuplas that were not considered (due to lack of enough SLs) must also be taken into account in the final SLtoVL tables. Notice that, non- considered 4-tuplas can be used by any packet at any switch and using any SL. Therefore, SLtoVL tables must be properly programmed for these non-considered 4-tuplas.

An easy way is to uniformly distribute all the SLs among the entire set of available VLs for each non-considered 4- tupla. The SL mapping table is referred in the Figure 4 to as HCA table.

Obviously, the fourth stage is the most complex one of the algorithm. Figure 5 shows the algorithm applied at this stage. For the sake of clarity, we define that two 4-tuplas are neighbors if they share the same switch, input port and output port, but they use a different output port at the next

switch. Obviously, two neighbor 4-tuplas will use the same SLtoVL table. Therefore, when assigning SLs, conflicts may only arise between neighboring 4-tuplas. Basically, the algorithm is based on traversing the paths between all the possible source-destination pairs. The current used path is referred to as the active path. For every visited 4-tupla of the active path, it is checked whether it is a considered 4-tupla. If it is, the SLs already used for other neighboring 4-tuplas must not be used by the active path. Therefore, the SLs already used by other neighboring 4-tuplas are marked as busy. Once the traverse through the active path is finished, the SL to use will be extracted from those not used (free SLs) by any neighboring 4-tupla along the path.

This implies that the path must be crossed again and the SL must be marked for all the considered 4-tuplas used by the path (in order to prevent future active paths from using the selected SL). On the other hand, if there are not SLs available, then the algorithm finishes unsuccessfully.

The algorithm finishes successfully if it is able to assign a particular SL to every path in such a way that it does not introduce any conflict in any SLtoVL table.

4 Performance Evaluation

In this section, we study the influence on performance of the proposed ^V^OQSW methodology in InfiniBand networks. We have developed a simulator that allows us to model the network at the register transfer level following the IBA specifications [5]. First, we will describe the main IBA subnet model properties defined in the specs together with the main simulator parameters, and the modeling considerations we have used in all the evaluations. Then, we will evaluate the performance obtained by the^V^OQ^SW methodology.

(6)

0 100000 200000 300000 400000 500000 600000 700000

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Average Message Latency (ns)

Traffic (bytes/ns/switch) 8 VNs

VOQ 8SLs 8VLs

(a)

50000 100000 150000 200000 250000 300000 350000 400000

0.02 0.04 0.06 0.08 0.1 0.12 0.14

VOQ 8SLs 8VLs

(b)

50000 100000 150000 200000 250000 300000

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Traffic (bytes/ns/switch) 8 VNs VOQ 8SLs 8VLs

(c)

50000 100000 150000 200000

0.01 0.02 0.03 0.04 0.05 0.06 0.07

Traffic (bytes/ns/switch) 8 VNs VOQ 8 SLs 8VLs

(d)

Figure 6. Average packet latency vs. accepted traffic. Uniform distribution of packet destinations.

Network size is (a) 8, (b) 16, (c) 32, and (d) 64 switches. Packet length is 32 bytes.

4.1 The IBA Subnet Model

The IBA specification defines a switch-based network with point-to-point links, allowing any topology defined by the user. Packets are routed at each switch by accessing the forwarding table. In this paper we will use the well-known up*/down* routing algorithm [10].

In the simulator, each switch has a non-multiplexed crossbar. This crossbar supplies separate ports for each virtual lane. We will use a simple crossbar arbiter based on FIFO request queues per output crossbar port, that will select the first request that has enough space in the corresponding output crossbar port. The crossbar bandwidth will be set accordingly to the link rate. The size of input and the output buffers will be fixed to 1 KB.

The routing time at each switch will be set to 100 ns. This time includes the time to access the forwarding and SLtoVL tables, the crossbar arbiter time, and the time to set up the crossbar connections. The link speed is fixed to 2.5 Gbps (1X links). We will model 20 m copper cables with a propagation delay of 5 ns/m. Therefore, the fly time will be set to 100 ns. We have used two different packet lengths in all the evaluations, short packets with a payload of 32 bytes,

and long packets with a payload of 256 bytes. However, for the sake of shortness, we only show the results for 32-byte packets³.

In order to evaluate different synthetic workload patterns, we have used different message destination distributions to generate network traffic, such as uniform and hot-spot distributions. In the uniform distribution, the destination of a message is randomly chosen with the same probability for all the hosts. In the hot-spot distribution with one hot-spot, a percentage of traffic is sent to one host. The host is randomly chosen. In order to use a representative hot-spot distribution, we have used different percentages of traffic sent to the hot-spot host. Finally, in the hot- spot distribution with several hot-spot hosts, a percentage of traffic is sent to several selected hosts. The hosts are randomly chosen.

In all the presented results, we will plot the average packet latency⁴, measured in nanoseconds, versus the average accepted traffic⁵, measured in bytes/ns/switch. We

3Results for 256-byte packets do not significantly vary, in relative terms, with respect to those achieved by 32-byte packets.

4Latency is the elapsed time between the generation of a packet at the source host until it is delivered at the destination end-node.

5Accepted traffic is the amount of information delivered by the network

(7)

0 1e+06 2e+06 3e+06 4e+06 5e+06

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Global traffic (bytes/ns/switch) 10% 8 VNs 10% VOQ 8SLs 8VLs 20% 8 VNs 20% VOQ 8SLs 8VLs 40% 8 VNs 40% VOQ 8SLs 8VLs 70% 8 VNs 70% VOQ 8SLs 8VLs

Figure 7. Average packet latency vs.

accepted traffic. Destination distribution is hot-spot. Network size is 8 switches. Packet length is 32 bytes. Different percentage of traffic is sent to one hot-spot.

will analyze irregular networks of 8, 16, 32, and 64 switches. These network topologies will be randomly generated taking into account some restrictions. First, we will assume that every switch in the network has 8 ports, using 4 ports to connect to other switches and leaving 4 ports to connect to hosts. And second, two neighboring switches will be connected by exactly one link.

4.2 Simulation Results

In this section, we will analyze in detail the possible benefits of using the ^V^OQSW methodology to increase network performance. In order to make a fair evaluation, we will compare results with the ones obtained when the same amount of VLs/SLs are used as separate virtual networks (VNs) [23]. In a VN, a packet is injected into the network through a particular VL, which is used along all the path.

Therefore, withⁿSLs andⁿVLs, we haveⁿVNs.

The effectiveness of the ^V^OQ^SW methodology will depend on the number of available SLs and VLs. For this reason, we will evaluate it in three different scenarios. First, we will evaluate the impact of the^V^OQSW methodology on network performance when 8 VLs⁶and 8 SLs are used.

In this situation, half of the resources offered by IBA are available. Second, we will assume that we have the sufficient number of SLs and VLs to obtain a full VOQ. In this situation, eight VLs and an unbounded number of SLs will be used. This evaluation will be performed to foresee the highest level of performance that can be achieved by applying the^V^OQSW methodology. Finally, we will limit the number of resources available to^V^OQ^SW with the aim of evaluating its performance in a situation where most of the resources are used for other purposes. We will carry this

per time unit.

6We use 8 VLs because we are assuming 8-port switches.

0 100000 200000 300000 400000 500000

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Traffic (bytes/ns/switch) 8 VNs both

8 VNs hot-spot 8 VNs uniform VOQ both VOQ hot-spot VOQ uniform

Figure 8. Average packet latency vs.

accepted traffic. 4 hot-spots with 20% of the traffic. Network size is 8 switches. Packet length is 32 bytes. Results are shown for 8 VNs andVOQswwith 8 SLs and 8 VLs.

Size SLs VLs Perc. Factor 8 sw 12 8 100% 1.82

8 sw 8 8 58% 1.87

8 sw 4 8 12% 1.75

16 sw 40 8 100% 1.86

16 sw 8 8 15% 1.61

16 sw 4 8 8% 1.58

32 sw 108 8 100% 1.74

32 sw 8 8 16% 1.47

32 sw 4 8 10% 1.46

64 sw 257 8 100% 3.35

64 sw 8 8 16% 3.29

64 sw 4 8 9% 3.14

Table 1. Percentage of obtained VOQ and factor of throughput increase with respect to virtual networks for different network sizes and different number of SLs and VLs.

later study out in two steps. First, we will reduce the number of SLs, and then, we will reduce the number of VLs.

4.2.1 VOQsw with 8SLs and 8VLs

As stated in Section 3, some SLs are needed by the

VOQ

SW methodology to solve the different conflicts that may appear in the SLtoVL mapping tables. However, only 8 SLs are not enough to solve all the possible conflicts in each of the analyzed networks. Hence, in this case only a partial VOQ will be achieved by the^V^OQ^SW methodology.

In Figure 6 we plot the performance obtained by the

VOQ

SW methodology compared with 8 VNs. ^V^OQSW

achieves higher throughput for all the network sizes.

Factors of increment in throughput range from^1:69for the 8-switch network to^3:3for the 64-switch network.

Notice that when using^V^OQSW, the performance curve

(8)

0 100000 200000 300000 400000 500000 600000 700000

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

VOQ 8SLs 8VLs VOQ 12SLs 8VLs

(a)

50000 100000 150000 200000 250000 300000 350000 400000

0.02 0.04 0.06 0.08 0.1

Traffic (bytes/cycle/switch) 8 VNs VOQ 8SLs 8VLs VOQ 108SLs 8VLs

(b)

Figure 9. Average packet latency vs. accepted traffic forVOQswusing 8 SLs and the number of SLs needed to obtain a full VOQ. Destination distribution is uniform. Packet length is 32 bytes. Network size is (a) 8 and (b) 32 switches.

0 100000 200000 300000 400000 500000 600000

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Traffic (bytes/ns/switch)

’8 VNs’

’VOQ 8SLs 8VLs’

’VOQ 4SLs 8VLs’

(a)

0 100000 200000 300000 400000 500000 600000

0.05 0.1 0.15 0.2 0.25

VOQ 8SLs/8VLs VOQ 4SLs/8VLs

(b)

Figure 10. Average packet latency vs. accepted traffic forVOQswusing 4 SLs and 8 VLs. Network size is 8 switches. Packet length is 32 bytes. (a) Destination distribution is uniform. (b) 4 hot-spot with 40% of the traffic sent to the hot-spots.

has a somewhat atypical shape. This behavior is due to a partial saturation of the network caused by the routing algorithm. In particular, the up*/down* routing algorithm is used, which is known by its relative low performance.

At the saturation point of the 8 VNs algorithm (0.22 bytes/ns/switch for the 8-switch network) the area near the root switch saturates. This saturation is rapidly spreaded over the entire network, causing a global saturation. This propagation is due to the HOL blocking effect. Packets blocked in a VL due to the local saturation prevent others from advancing through non-congested areas. However, when using ^V^OQSW, as the HOL blocking effect is minimized, this effect is not severe as this local congestion does not affect to the other areas. Packets being forwarded to non-congested areas use different VLs. Hence, as traffic increases beyond the saturation point of the 8 VNs algorithm, more messages are sent to non-congested areas.

These messages exhibit lower latencies, which reduces the average latency values shown in the results.

Figure 7 shows performance results when there is one hot-spot, with different percentages of hot-spot traffic. It

is well-known that hot-spots rapidly saturate the network, thus obtaining very low performance. In this case, when using the VN approach, the packets sent to the hot-spot host block many packets whose destinations are different from the hot-spot host. However,^V^OQ^SW is able to overcome the hot-spot in all cases. The higher the hot-spot traffic, the higher the throughput improvement achieved by the

VOQ

SW methodology. Actually, under hot-spot traffic,

VOQ

SW is able to increase only the throughput of the traffic that is not destined to the hot-spot host (uniform traffic) due to the reduction of the HOL blocking effect.

Finally, we have also analyzed the influence of having several hot-spot hosts in the network. In this case^V^OQSW

continues to outperform significantly 8 VNs, as can be seen in Figure 8 for 4 hot-spots with 20% of hot-spot traffic. As the number of hot-spot hosts increases, the traffic is spread more uniformly over the network, closing its behavior to that of uniform traffic. As a consequence, the hotspot traffic is also able to take advantage of using^V^OQSW.

(9)

0 100000 200000 300000 400000 500000 600000

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

’8 VNs’

’VOQ 8SLs 8VLs’

’VOQ 4SLs 4VLs’

(a)

0 100000 200000 300000 400000 500000 600000 700000

0.05 0.1 0.15 0.2 0.25

’8 VNs’

’VOQ 8SLs 8VLs’

’VOQ 4SLs 4VLs’

(b)

Figure 11. Average packet latency vs. accepted traffic forVOQswusing 8SLs/8VLs and 4SLs/4VLs.

Network size is 8 switches. Packet length is 32 bytes. (a) Destination distribution is uniform. (b) 4 hot-spot with 40% of the traffic sent to the hot-spots.

4.2.2 Full VOQ

In this section, we evaluate the performance obtained by the^V^OQSW methodology when using a full VOQ scheme.

Eight VLs and an unbounded number of SLs will be used.

This evaluation will be performed to foresee the highest levels of performance that the^V^OQ^SW methodology can achieve.

Table 1 shows the number of SLs required to obtain a full VOQ scheme for different network sizes. For the 8- switch network, 12 SLs are needed to solve all the possible conflicts in the SLtoVL tables, whereas with 8 SLs, as in the previous section, only 58% of the possible conflicts were solved. Figure 9.a shows the performance results obtained by^V^OQ^SW when using a partial and a full VOQ scheme with a uniform distribution of packet destinations. As can be seen, the obtained throughput is very similar. This is due to the fact that a high percentage of conflicts are already solved with 8 SLs and, therefore, the HOL blocking is highly reduced.

However, as network size increases, the differences in performance between the^V^OQSW using 8 SLs and a full VOQ scheme increases too. This is because of the larger number of of SLs required to obtain a full VOQ scheme, as can be seen in Table 1.

For example, Figure 9.b shows the performance evaluation for a 32-switch network. For this network size the difference in performance is greater because when using 8 SLs only a 16% of a full VOQ scheme is reached (see Table 1). However, good performance results are still obtained with 8 SLs.

As a result, throughput does not increase linearly with the percentage of VOQ obtained by the ^V^OQsw methodology. Furthermore, a low percentage of VOQ is enough to achieve more than 75% of the throughput achieved by a full VOQ.

4.2.3 Bounding the Available Number of SLs and VLs In this section, we analyze the behavior of the proposed

VOQ

sw methodology when the number of available resources is bounded below the 8 SLs and 8 VLs used in section 4.2.1. First, we reduce the number of used SLs from 8 downto 4, while maintaining the number of VLs.

Reducing the number of SLs causes the number of solved conflicts into SLtoVL tables to be reduced too. In particular, Table 1 shows the percentage of VOQ obtained for different number of resources used. For the 8-switch network, with 4 SLs available, only 12% of VOQ is obtained, instead of the 58% obtained by using 8 SLs.

Figure 10 shows the throughput achieved for uniform and hot-spot distributions. As we can see, the throughput is reduced by^1:08when using 4 SLs with respect to using 8 SLs. However, with this configuration,^V^OQSW continues to outperform significantly 8 VNs. These results are also confirmed for larger network sizes.

We now analyze the behavior of ^V^OQ^SW when reducing also the number of VLs. Instead of having a dedicate virtual lane for each output port, each virtual lane will be shared by two output ports. We present results for^V^OQ^SW when using 4SLs and 4VLs. Obviously, the results obtained are worse than the results obtained for

VOQ

SW using 8 VLs and 8 SLs, as shown in Figure 11. In particular, throughput decreases by a factor of^1:18 for the 8-switch network for the uniform distribution, but performance is better than the one obtained with 8 VNs.

However, as the network size increases, the differences between 8 SLs with 8 VLs and 4 SLs with 4 VLs configurations are small increasingly.

5 Conclusions

In this paper, we have proposed a strategy to implement Virtual Output Queuing in InfiniBand. The proposed methodology is fully compatible with the InfiniBand

(10)

Architecture specs and does not require any special hardware in the switches. It only relies on the service levels and virtual lanes provided in InfiniBand.

In particular, the proposed methodology computes the SLtoVL tables of each IBA switch in such a way that packets are stored in different VLs according to the output port they must use. Taking into account that IBA SLs and VLs are a limited resource and can be used for other purposes, the methodology can be applied considering that there is a different number of SL/VLs available for VOQ.

In this case, the proposed strategy tries to use the available resources to apply VOQ to the “hottest switches” (i.e. those switches that are traversed by more paths). The evaluation results show that network performance can be strongly improved (more than tripled for a 64-switch network). Most important, this improvement is achieved with only 4 SLs and 4 VLs.

References

[1] N. J. Boden et al., Myrinet - A gigabit per second local area network,IEEE Micro, vol. 15, Feb. 1995.

[2] W. J. Dally, Virtual-channel flow control, IEEE Trans.

on Parallel and Distributed Systems, vol. 3, no. 2, pp.

194-205, March 1992.

[3] D. Garc´ıa and W. Watson, Servernet II, inProc. of the 1997 Parallel Computer, Routing, and Communication Workshop, Jun 1997.

[4] InfiniBand^T^M Trade Association, http://www.infinibandta.com.

[5] InfiniBand^T^M Trade Association, InfiniBand^T^M architecture. Specification Volumen 1. Release 1.0.a.

Available at http://www.infinibandta.com.

[6] P. Kermani and L. Kleinrock, Virtual cut-through:

A new computer communication switching technique, Computer Networks, vol. 3, pp. 267-286,1979.

[7] C. Minkenberg and T. Engbersen, A Combined Input and Output Queued Packet-Switched System Based on PRIZMA Switch-on-a-Chip Technology, in IEEE Communication Magazine, Dec. 2000.

[8] G. Pfister, In search of clusters,Prentice Hall, 1995.

[9] F. Silla and J. Duato, Tuning the Number of Virtual Channels in Networks of Workstations, inProc. of the 10th Int. Conf. on Parallel and Distributed Computing Systems, Oct. 1997.

[10] M. D. Schroeder et al., Autonet: A high-speed, self-configuring local area network using point-to-point links,SRC research report 59, DEC, Apr. 1990.

[11] R. Sheifert, Gigabit Ethernet, Addison-Wesley, April 1998.

[12] Fibre Channel Industry Assocation, http://www.fibrechannel.com.

[13] R. Bopana, D. Cohen, R. Felderman, A. Kulawik, C.

Seitz, J. Seizovic and W. Su A Comparison of Adaptive Wormhole Routing Algorithms,Proc. 20th Annual Int.

Symp. Comp. Architecture, May 1993.

[14] W. Dally Virtual Channel Flow Control,IEEE Trans.

Parallel Distributed Syst., vol. 3, no. 2, pp. 194-205, Mar. 1992.

[15] Y. Tamir, G.L. Fraizer High Performance Multi-queue Buffers for VLSI Communication Switches, 15th Int.

Symp. on Computer Architecture, June 1988.

[16] J.C. Sancho, J. Flich, A. Robles, P. L´opez, J.

Duato. Analyzing the Influence of Virtual Lanes on the Performance of InfiniBand Networks,Workshop on Communication Architecture for Clusters, 2002.

[17] SCSI Trade Association,http://www.scsita.com.

[18] H. Obara, T.Yasushi. An Efficient Contention Resolution Algorithm for Input Queueing ATM Cross- connect Switch,Int. J. Digital and Analog Cabled Syst., vol. 2, pp. 261-267, Dec. 1989.

[19] N.W. McKeown. Scheduling Algorithms for Input- queued Switches,Ph.D. thesis, UC Berkeley, 1995.

[20] N.W. McKeown, A. Anantharam, J. Walrand.

Achieving 100% in an Input-queued Switch, IEEE Trans. on Communications, vol. 47, no. 8, pp. 1260- 1272, Aug. 1999.

[21] C. Eddington. InfiniBrigde: An InfiniBand Channel Adapter with Integrated Switch, IEEE Micro, vol. 22, no. 2, pp. 48-56, Mar-Apr 2002.

[22] J.C. Sancho, A. Robles, J. Flich, P. L´opez, and J. Duato. Effective Methodology for Deadlock-free Minimal Routing in InfiniBand Networks, Procc. of Int. Conf. on Parallel Processing, Aug. 2003.

[23] J. Flich, P. L´opez, J.C. Sancho, A. Robles, and J.

Duato. Improving InfiniBand Routing through Multiple Virtual Networks, Proc. of Int. Symp. on High Performance Computing, May 2002.