Low-Latency Networks .1 Fully Connected Network

Engineering Reliable, Low-Latency Networks

3. Network Organization

3.1 Low-Latency Networks .1 Fully Connected Network

From the standpoint of latency, the optimal network is a fully-connected network in which every processor has a direct connection to every other processor (See Figure 3.1). Here, there is no switching latency (i.e.

T

s ⁼ 0). The problem with this network, of course, is that the processor node size grows linearly with the size of the system. This is not practical for several reasons. We cannot build very large networks with bounded pin-out components, and a different component size is needed for each different network size. Using techniques from [Tho80] and [LR86], we find the interwiring resources will grow asΘ⁽

N

³⁾. Wiring constraints alone require that the best packaging volume grows asΘ⁽

N

³⁾, making, in the best case, the wiring distances,

d

^{, grow as}^Θ⁽

N

⁾^{. Such an}

organization is not very practical.

3.1.2 Full Crossbar

Next, we consider a full crossbar arrangement (See Figure 3.2). If we could build a large enough crossbar, we only traverse on switching node between any source-destination pair. Unfortunately,

Figure 3.1: Fully Connected Networks

Figure 3.2: Full 1616 Crossbar

Figure 3.3: Distributed 1616 Crossbar

our pin limitations (Section 2.7.1), will not allow us to build a single crossbar of arbitrary size. In practice, we would have to distribute the function across many different components as shown in Figure 3.3. This would incur

O

⁽

n

⁾switching latency and require

O

⁽

n

²⁾such switches.

3.1.3 Hypercube

We might consider building a hypercube network to exploit locality and distributed routing control. The switching latency is log₂⁽

N

⁾ as we need traverse at most one switching link in each dimension of the hypercube. Unfortunately, to maintain this characteristic, the switching node degree grows as Θ⁽log⁽

N

⁾⁾. Node size soon runs into our pin limitations (Section 2.7.1) and a different size node is needed for each size of the machine constructed. Additionally, when implemented in three-dimensional space, the interconnection requirements cause the machine volume to grow asΘ⁽

N

³²⁾. This result is also derivable from the techinques presented in [Tho80]

and [LR86] by considering the number of wires which must cross through the middle of the machine

Shown above is a 16 processor hypercube. (Drawing by Frederic Chong)

Figure 3.4: Hypercube

Figure 3.5: Mesh –

k

^-ary-

n

^{-cube with}

k

⁼²

in any decomposition. If we divide an

N

-processor machine in half, the number of wires crossing the bisecting plane will beΘ⁽

N

⁾. If we distribute these wires in the two-dimensional plane dividing the two halves, then the plane isΘ⁽^p

N

⁾wire widths wide in each dimension. Considering that we get the same effect if we divide the machine via an orthogonal plane which also bisects the machine, we see that the machine isΘ⁽^p

N

⁾long in each dimension and hence the volume isΘ⁽

N

³²⁾^{. From}

this we can see that the transit distance,

d

, will generally grow asΘ⁽^p²

N

⁾^.

Making some compromises for practicality on the basic hypercube structure, a number of derivative networks result. The next two sections cover two major classes, multistage networks and k-ary-n-cubes.

3.1.4

k

^-ary-

n

^-cube

For

k

^-ary-

n

-cubes, we fix the dimension (

k

) to avoid the switching node size growth problem associated with the pure hypercube. We still get the locality and distributed routing. The switching latency grows as

O

⁽^p^k

N

⁾ since there are at most ^k

N

routers which must be traversed in each

Shown above is a 27 processor cube network. (Drawing by Frederic Chong)

Figure 3.6: Cube –

k

^-ary-

n

^{-cube with}

k

⁼³

Figure 3.7: Torus –

k

^-ary-

n

^{-cube with}

k

⁼2 and Wrap-Around Torus Connections dimension. Many popular

k

^-ary-

n

-cubes networks in use today set

k

⁼^{2 or}

k

⁼3 to build mesh (See Figure 3.5) or cube (See Figure 3.6) structures [Dal87]. For these networks, the distances between components can be made uniformly short such that the switching latency dominates the transit latency. When constrained to three-dimensional space, larger values of

k

, will tend to have transit latencies which scale asΩ⁽^p³

N

⁾^{. Toroidal}

k

^-ary-

n

-cubes can be used to cut the worst case switching latency in each dimension in half and avoid hot-spot problems in simple

k

^-ary-

n

^-cubes

(See Figure 3.7) [DS86].

Figure 3.8: 1616 Omega Network Constructed from 22 Crossbars 3.1.5 Flat Multistage Networks

A multistage network distributes each hypercube routing element spatially so that fixed-degree switches can be used for routing. Like the hypercube, routing can occur in a distributed manner requiring only logr⁽

N

⁾ stages between any pair of nodes in the network. Here

r

is a constant known as the radix which denotes the number of distinct directions to which each routing switch can route. Unlike the hypercube and

k

^-ary-

n

-cube, the multistage network does not provide any locality. The number of switches required by a multistage network grows as

O

⁽

N

^log⁽

N

⁾⁾^{. The}

best-case packaging volume grows asΘ⁽

N

³²⁾ and the transit latency grows as Θ⁽^p

N

⁾ ^{like the}

hypercube [LR86].

Quite a variety of networks can be classified as multistage networks including: Butterfly networks, Banyan networks, Bidelta networks [KS86], Benes networks, and Multibutterfly networks.

Figures 3.8 through 3.11 show some popular multistage networks. Each stage in these networks routes by successively subdividing the set of possible destinations into a number of equivalence classes equal to the radix of the routing components. For example, consider a radix-2 network.

When connections enter the network, any input can reach any destination. The first stage of routing components divides this class into two different equivalence classes based on desired destination.

Each succeeding network stage further subdivides a previous stage’s equivalence classes into two more equivalence classes. When there is a single destination in each equivalence class, the network has uniquely determined the desired destination and can connect to the destination endpoints. This successive subdivision can be easily seen in the network shown in Figure 3.9.

3.1.6 Tree Based Networks

Properly constructed, a tree-based, multistage network avoid the major liabilities associated with the standard multistage networks. Specifically, we consider fat-tree networks as described in [Lei85] and [GL85] and shown in Figure 1.2. The switching delay remains

O

⁽^log⁽

N

⁾⁾ ^as

Figure 3.9: 1616 Bidelta Network

Figure 3.10: Benes Network

with hypercubes and multistage networks. Routing may occur in a distributed fashion. Unlike the multistage networks described above, the tree-based networks do allow locality exploitation. When the bandwidth between successive stages of the tree is chosen appropriately, the tree structures can be arranged efficiently in three-dimensional space; switching and wiring resources grow asΘ⁽

N

⁾

and transit latency will grow asΘ⁽^p³

N

⁾. While a tree-based network may have less cross-machine bandwidth than a hypercube with the same number of nodes, the tree-based machine requires

O

⁽^log⁽

N

⁾⁾less interconnect hardware. As a result, if one were to compare machines of the same size, taking into account three-dimensional space restrictions, the tree machine provides at least as much bandwidth while supporting

O

⁽^log⁽

N

⁾⁾more nodes. Leiserson shows that properly sized fat trees can efficiently perform any communication performed by any other similarly sized network

Figure 3.11: 1616 Multibutterfly Network [Lei85].

3.1.7 Express Cubes

Express cubes [Dal91] are a hybrid between a tree-structure and a

k

^-ary-

n

-cube (See Fig- ure 3.12). By placing interchange switches periodically in a

k

^-ary-

n

-cube, the switching delay can be reduced fromΘ⁽^p^k

N

⁾ ^to^Θ⁽^log⁽

N

⁾⁾. Done properly, the transit latency remainsΘ⁽^p³

N

⁾^{. If}

we allow several different kinds of switching elements in the network, the size of each switching element can be limited to a fixed size.

3.1.8 Summary

Table 3.1 summarizes the major characteristics of the networks reviewed here. Asymptotically, at least, we see that fat trees and express cubes have the slowest growing transit and switching latencies while maintaining the slowest resource growth. For a limited range of network sizes, flat multistage networks and

k

^-ary-

n

-cubes may offer reasonable, or even superior, performance at reasonable hardware costs.

In document Robust, High-Speed Network Design for Large-Scale Multiprocessing (página 37-43)