Engineering Reliable, Low-Latency Networks
3. Network Organization
3.1 Low-Latency Networks .1 Fully Connected Network
From the standpoint of latency, the optimal network is a fully-connected network in which every processor has a direct connection to every other processor (See Figure 3.1). Here, there is no switching latency (i.e.
T
s = 0). The problem with this network, of course, is that the processor node size grows linearly with the size of the system. This is not practical for several reasons. We cannot build very large networks with bounded pin-out components, and a different component size is needed for each different network size. Using techniques from [Tho80] and [LR86], we find the interwiring resources will grow asΘ(N
3). Wiring constraints alone require that the best packaging volume grows asΘ(N
3), making, in the best case, the wiring distances,d
, grow asΘ(N
). Such anorganization is not very practical.
3.1.2 Full Crossbar
Next, we consider a full crossbar arrangement (See Figure 3.2). If we could build a large enough crossbar, we only traverse on switching node between any source-destination pair. Unfortunately,
Figure 3.1: Fully Connected Networks
Figure 3.2: Full 1616 Crossbar
Figure 3.3: Distributed 1616 Crossbar
our pin limitations (Section 2.7.1), will not allow us to build a single crossbar of arbitrary size. In practice, we would have to distribute the function across many different components as shown in Figure 3.3. This would incur
O
(n
)switching latency and requireO
(n
2)such switches.3.1.3 Hypercube
We might consider building a hypercube network to exploit locality and distributed routing control. The switching latency is log2(
N
) as we need traverse at most one switching link in each dimension of the hypercube. Unfortunately, to maintain this characteristic, the switching node degree grows as Θ(log(N
)). Node size soon runs into our pin limitations (Section 2.7.1) and a different size node is needed for each size of the machine constructed. Additionally, when implemented in three-dimensional space, the interconnection requirements cause the machine volume to grow asΘ(N
32). This result is also derivable from the techinques presented in [Tho80]and [LR86] by considering the number of wires which must cross through the middle of the machine
Shown above is a 16 processor hypercube. (Drawing by Frederic Chong)
Figure 3.4: Hypercube
Figure 3.5: Mesh –
k
-ary-n
-cube withk
=2in any decomposition. If we divide an
N
-processor machine in half, the number of wires crossing the bisecting plane will beΘ(N
). If we distribute these wires in the two-dimensional plane dividing the two halves, then the plane isΘ(pN
)wire widths wide in each dimension. Considering that we get the same effect if we divide the machine via an orthogonal plane which also bisects the machine, we see that the machine isΘ(pN
)long in each dimension and hence the volume isΘ(N
32). Fromthis we can see that the transit distance,
d
, will generally grow asΘ(p2N
).Making some compromises for practicality on the basic hypercube structure, a number of derivative networks result. The next two sections cover two major classes, multistage networks and k-ary-n-cubes.
3.1.4
k
-ary-n
-cubeFor
k
-ary-n
-cubes, we fix the dimension (k
) to avoid the switching node size growth problem associated with the pure hypercube. We still get the locality and distributed routing. The switching latency grows asO
(pkN
) since there are at most kp
N
routers which must be traversed in eachShown above is a 27 processor cube network. (Drawing by Frederic Chong)
Figure 3.6: Cube –
k
-ary-n
-cube withk
=3Figure 3.7: Torus –
k
-ary-n
-cube withk
=2 and Wrap-Around Torus Connections dimension. Many populark
-ary-n
-cubes networks in use today setk
=2 ork
=3 to build mesh (See Figure 3.5) or cube (See Figure 3.6) structures [Dal87]. For these networks, the distances between components can be made uniformly short such that the switching latency dominates the transit latency. When constrained to three-dimensional space, larger values ofk
, will tend to have transit latencies which scale asΩ(p3N
). Toroidalk
-ary-n
-cubes can be used to cut the worst case switching latency in each dimension in half and avoid hot-spot problems in simplek
-ary-n
-cubes(See Figure 3.7) [DS86].
Figure 3.8: 1616 Omega Network Constructed from 22 Crossbars 3.1.5 Flat Multistage Networks
A multistage network distributes each hypercube routing element spatially so that fixed-degree switches can be used for routing. Like the hypercube, routing can occur in a distributed manner requiring only logr(
N
) stages between any pair of nodes in the network. Herer
is a constant known as the radix which denotes the number of distinct directions to which each routing switch can route. Unlike the hypercube andk
-ary-n
-cube, the multistage network does not provide any locality. The number of switches required by a multistage network grows asO
(N
log(N
)). Thebest-case packaging volume grows asΘ(
N
32) and the transit latency grows as Θ(pN
) like thehypercube [LR86].
Quite a variety of networks can be classified as multistage networks including: Butterfly net- works, Banyan networks, Bidelta networks [KS86], Benes networks, and Multibutterfly networks.
Figures 3.8 through 3.11 show some popular multistage networks. Each stage in these networks routes by successively subdividing the set of possible destinations into a number of equivalence classes equal to the radix of the routing components. For example, consider a radix-2 network.
When connections enter the network, any input can reach any destination. The first stage of routing components divides this class into two different equivalence classes based on desired destination.
Each succeeding network stage further subdivides a previous stage’s equivalence classes into two more equivalence classes. When there is a single destination in each equivalence class, the network has uniquely determined the desired destination and can connect to the destination endpoints. This successive subdivision can be easily seen in the network shown in Figure 3.9.
3.1.6 Tree Based Networks
Properly constructed, a tree-based, multistage network avoid the major liabilities associated with the standard multistage networks. Specifically, we consider fat-tree networks as described in [Lei85] and [GL85] and shown in Figure 1.2. The switching delay remains
O
(log(N
)) asFigure 3.9: 1616 Bidelta Network
Figure 3.10: Benes Network
with hypercubes and multistage networks. Routing may occur in a distributed fashion. Unlike the multistage networks described above, the tree-based networks do allow locality exploitation. When the bandwidth between successive stages of the tree is chosen appropriately, the tree structures can be arranged efficiently in three-dimensional space; switching and wiring resources grow asΘ(
N
)and transit latency will grow asΘ(p3
N
). While a tree-based network may have less cross-machine bandwidth than a hypercube with the same number of nodes, the tree-based machine requiresO
(log(N
))less interconnect hardware. As a result, if one were to compare machines of the same size, taking into account three-dimensional space restrictions, the tree machine provides at least as much bandwidth while supportingO
(log(N
))more nodes. Leiserson shows that properly sized fat trees can efficiently perform any communication performed by any other similarly sized networkFigure 3.11: 1616 Multibutterfly Network [Lei85].
3.1.7 Express Cubes
Express cubes [Dal91] are a hybrid between a tree-structure and a
k
-ary-n
-cube (See Fig- ure 3.12). By placing interchange switches periodically in ak
-ary-n
-cube, the switching delay can be reduced fromΘ(pkN
) toΘ(log(N
)). Done properly, the transit latency remainsΘ(p3N
). Ifwe allow several different kinds of switching elements in the network, the size of each switching element can be limited to a fixed size.
3.1.8 Summary
Table 3.1 summarizes the major characteristics of the networks reviewed here. Asymptotically, at least, we see that fat trees and express cubes have the slowest growing transit and switching latencies while maintaining the slowest resource growth. For a limited range of network sizes, flat multistage networks and