PDF Performance Evaluation of COWs under Real Parallel Applications

(1)

Performance Evaluation of COWs under Real Parallel Applications

^*

J. C. Sancho, J. C. Mart´ınez, A. Robles, P. L´opez, J. Flich and J. Duato Departamento de Inform´atica de Sistemas y Computadores

Universidad Polit´ecnica de Valencia P.O.B. 22012, 46071 - Valencia, SPAIN

E-mail:

^f

jcsancho,jc,arobles,plopez,jflich,jduato

^g

@gap.upv.es

Abstract

Clusters of workstations (COWs) are often arranged as a switch-based network with irregular topology. Usually, the evaluation of interconnection networks for COWs has been carried out by simulation using synthetic traffic and by traces from real parallel applications. Although both types of traffics are used as a first approximation of the behavior of the system, a more accurate behavior can be obtained by using real parallel applications. In this paper, a new sim- ulation framework has been developed in order to evalu- ate interconnection networks under real parallel applica- tions by using an execution-driven simulator. Moreover, the new simulator can be used to evaluate the impact on the performance of the whole system of several design parame- ters in addition to the interconnection network. Evaluation results show that the execution time of real parallel appli- cations can be reduced by using an effective routing algo- rithm. Moreover, in some cases, the achieved improvements are higher than the ones achieved by improving other de- sign issues, such as the processor instruction issue rate, the cache size or the network bandwidth.

Keywords: COWs, irregular topologies, routing algo- rithms, real parallel applications, execution-driven simula- tion.

1. Introduction

Clusters of workstations (COWs) have become a cost- effective alternative to parallel computers, not only for high- performance scientific computing, but mostly as a plat- form for high-end servers. The interconnection networks of COWs are based on switch interconnects where the workstations are connected to. These switch-based interconnects provide the wiring flexibility, scalability, and incremental

*This work was supported by the Spanish CICYT under Grant TIC2000-1151-C07 and by Generalitat Valenciana under Grant GV00- 131-14.

expansion required in this environment. Some commercial interconnects for COWs are Myrinet [2], Servernet II [7], Gigabit Ethernet [18], and InfiniBand [10].

In current COWs, the topology is defined by the cus- tomer. Switch designs must be flexible enough to support any topology with a degree bounded by the number of switch ports. Often, the connections between switches do not follow any regular pattern. The resulting topologies are referred to as irregular topologies. A generic routing algorithm that is able to efficiently forward packets between any pair of hosts of the network should be used.Up*/down*is one of the most known routing algorithms.

Inup*/down*routing [17], a breadth-first search spanning tree (BFS) is computed. This algorithm is quite simple, and has the property that all the switches in the network will eventually agree on a unique spanning tree. A direction (“up” or “down”) is assigned to each network link, based on the position in the spanning tree, and messages are routed through sequences of “up” or “down” channels. The

“down” ^! “up” transition is forbidden in order to avoid deadlocks. As a consequence, in most cases, up*/down*

routing does not always supply minimal paths between non- adjacent switches.

Several solutions have been proposed in order to im- prove theup*/down*routing scheme, such as the Minimal Adaptive routing [21], the In-transit Buffer [6] and the DFS methodology [16]. The Minimal Adaptive routing [21] increases adaptivity and provides minimal paths in most cases. This algorithm requires the use of two virtual channels in order to avoid deadlocks. One virtual channel is used to route packets through minimal paths and the other one provides an escape path using theup*/down*routing. On the other hand, the In-transit Buffer [6] and the DFS methodology [16] increase the number of minimal paths without requiring the use of virtual channels.

In the In-transit Buffer mechanism [6], all the minimal paths are allowed by absorbing the messages in those inter- mediate nodes of the path where there is a forbidden transition (“down”^!“up”) according to theup*/down*routing

(2)

algorithm. On the other hand, the DFS methodology [16]

proposes a new methodology to compute the up*/down*

routing tables that makes a different assignment of direction (“up” or “down”) to links in order to increase the number of minimal paths followed by the messages. This methodology is based on obtaining a depth-first search spanning tree (DFS) instead of the BFS spanning tree used in the original methodology.

Usually, the evaluation of routing algorithms has been carried out under synthetic traffic loads, such as the bit- reversalandmatrix transposetraffic distributions [5, 6, 16, 21], or using traces from real parallel applications [22, 21].

Although both types of traffic are used as a first approximation of the behavior of the system, a more accurate behavior can be obtained by using real parallel applications. Several works [23, 14, 5] have addressed this issue in the context of MPPs, using interconnection networks with regular topologies. In [23], the influence of virtual channels and adaptive routing on the performance of parallel applications is analyzed, whereas [14] focuses on the influence of the router arbitration units.

The execution-driven simulator RSIM [13] allows to evaluate the behavior of multiprocessor systems under real parallel applications. However, the RSIM simulator cannot model the behavior of COWs. On the other hand, the NET- SIM simulator [6, 22, 16] has been widely used to model at the register transfer level the interconnection networks of COWs, although only under synthetic traffic patterns.

In this paper, a new simulation framework is presented in order to model the behavior of distributed shared-memory COWs with a hardware cache-coherent protocol under real parallel applications. The new simulator replaces the simple interconnection network provided by RSIM with the NET- SIM simulator. By using the new simulator, the behavior of any routing algorithm can be evaluated under different real parallel applications.

Moreover, the new simulator is able to evaluate the influence on performance not only of the routing algorithm, but also of different system design issues, such as the processor instruction issue rate, the cache size, or the network bandwidth. In this paper, we show that simply modifying the routing tables (i.e. using DFS instead of BFS to compute them) is enough to reduce the execution time of the parallel applications running on a COW. Furthermore, in certain cases, the achieved improvement in the execution time is greater than the one achieved by hardware improvements.

The rest of the paper is organized as follows. Section 2 briefly describes the execution-driven simulator RSIM.

Then, Section 3 describes the interconnection network simulator NETSIM. The new simulation framework created by connecting both simulators will be presented in Section 4.

Section 5 describes the DFS and BFS methodologies to compute theup*/down*routing tables. In section 6, perfor-

BUS cache WB

L1

cache L2

Network Interface Directory

Memory Processor

Network

Figure 1. RSIM block diagram.

mance evaluation results under real parallel applications are shown. Finally, in section 7 some conclusions are drawn.

2. RSIM Simulator

The RSIM simulator (Rice Simulator for ILP Multi- processors) [13] is an execution-driven simulator developed at theRice University, primarily designed to analyze shared-memory multiprocessor architectures. Compared to other current publicly available shared-memory simulators [8, 11], the key advantage of RSIM is that it supports a processor model that aggressively exploits instruction level parallelism (ILP), is representative of the state-of-the-art and near-future processors, and is highly configurable.

Figure 1 shows the memory and system organization of RSIM. RSIM simulates a hardware cache-coherent distributed shared memory system (CC NUMA) and a re- laxed memory orderingmemory consistency protocol [9].

Each processing node consists of a processor, a two-level cache hierarchy (with coalescing write buffer), a portion of the system distributed physical memory and its associated directory, and a network interface. A pipelined split- transaction bus connects the secondary cache, the memory and directory modules, and the network interface. The network interface connects the node to a multiprocessor interconnection network for remote communication.

The network interface module is depicted in Figure 2. It consists of two input and output ports to send/receive mes-

(3)

sages to/from the network switch which are associated with two types of messages, RequestandReply. Requestmessages are sent by the coherence protocol to invalidate or re- place blocks in the cache, and the second ones are generated as an answer to theRequestmessages.RequestandRe- plymessages are sent on separate virtual channels to avoid deadlocks.

3. NETSIM Simulator

The NETSIM simulator is an event-driven simulator that models at register transfer level the interconnection network of COWs. This simulator have been used in previous works of our research group [6, 16, 22]. This simulator models wormhole switching [4] and deterministic source routing, like in Myrinet networks [2]. Each host is connected to the network through a network interface card (NIC). The NIC contains the routing table, which is determined by the routing scheme in order to provide the path to reach the destina- tion node in the network. Virtual channels are not supported.

Each network switch has a crossbar that allows multiple messages to be transmitted simultaneously without interfer- ence from the input ports to the output ports. The crossbar arbiter that selects the outgoing channel processes one message header at a time. It is assigned to waiting messages in a demand-slotted round-robin fashion. If the required output channel is busy, the message must wait in the input buffer until its next turn.

A hardware “stop and go” flow control protocol [2] is used to prevent packet loss. In this protocol, the receiving switch transmits a stop(go) control flit when its input buffer fills over (empties below) 56 bytes (40 bytes) of its capacity.

The slack buffer size is fixed at 80 bytes.

Values for temporal parameters from Myrinet switches are used. In particular, the latency through the switch for the first flit is ¹⁵⁰ns, and after transmitting the first flit, the switch transfers data at the link rate of^6;25ns per flit.

The clock cycle is^6;25ns. Flits are one byte wide and the physical channel is also one flit wide. For the Myrinet links, we assume short LAN cables to interconnect switches and hosts. These cables are 10 meters long and have a delay of 4.92 ns/m. Transmission of data across channels is pipelined [20].

4. Connecting RSIM and NETSIM Simula- tors

The new simulator replaces the interconnection network provided by RSIM with the NETSIM simulator. Figure 2 depicts the interconnection between both simulators.

The RSIM simulator executes the program code of the parallel applications compiled and linked using SPARC

Multiplexor

Switch

Demultiplexor

NIC RSIM

Reply Request

Request Reply

Output Output

Input Input

Network Interface

NETSIM

Figure 2. Interconnection between RSIM and NETSIM.

programming tools. A cache controller handles memory ac- cesses issued from the local processor as well as coherence commands. As a consequence, the cache controller gener- ates both messages that are injected into the network and processes received messages from the network. These messages are injected to or ejected from the network through an internal bus.

The SMNET module of the RSIM handles communication between the internal bus and the network interface. This module receives messages destined for the network from the internal bus, injects them into the appropriate network ports and initiates the communication, and handles incom- ing messages from the network by removing them from the network port and delivering them to the bus. These actions are managed by scheduling an appropriate event. The events ReqSendSemaWaitandReplySendSemaWaitare used for the RequestandReplyoutput ports, respectively. These events ensure that there is sufficient space in the network interface buffers before creating the packets and initiating the communication. Also, the eventsReqRcvSemaWaitandRe- plyRcvSemaWaitare used for receiving messages from the network. These events wait on semaphores associated with the network output ports to receive messages. As soon as a message is received, it is forwarded to the appropriate bus port, depending on its type,RequestorReply. The bus will actually deliver the message to the cache controller.

In order to link this module to the NETSIM, the SMNET module composes the messages according to the Myrinet network by adding the appropriate message headers. Then,

(4)

it places the message in the corresponding output queue and informs NETSIM that it has a pending message to be injected into the network.

On the other hand, messages received from the network are placed by NETSIM in the corresponding input queue of the NIC according to the type of the message (Requestor Reply).

As NETSIM does not support virtual channels,Request andReplyoutput ports are multiplexed through the same input channel of the network switch. Waiting messages in the output ports are assigned in a demand-slotted round-robin fashion in order to uniformly balance these output ports.

In order to guarantee deadlock-freedom between Re- questand Reply messages, infinite queues are considered at the input ports of the network interface [3]. This input queues can be implemented in a real network by using the memory available at network interfaces. In the case of over- flowing this memory, the virtual memory of the processor nodes can be used, at the expense of increasing the latency.

Time is globally managed by RSIM. In order to model the differences in speed between the local processors and the interconnection network, RSIM executes NETSIM once everyⁿcycles and processes the events at that time. This number (ⁿ) represents the relative speed between the network and processor clock frequencies.

The following functions are implemented in RSIM to communicate with NETSIM:

Init simulator. It initializes data structures and statistics.

Send message. This function is used to inform NET- SIM that it has a pending message to be injected into the network.

Compute cycle. NETSIM can process all the events scheduled for that time.

End simulation. It is called when RSIM simulation is finished to inform NETSIM that it has to collect all the network statistics.

On the other hand, NETSIM uses the following functions to synchronize with RSIM:

Check port. This function determines whether the input port of RSIM can accept new messages. Otherwise, messages have to wait at the input queues of the network interface.

Free port. This function frees the corresponding RSIM output port, ready to accept new messages.

Receive message. This function informs RSIM that it has a message to process in its input ports.

"up" a

b c d e

i g h

f

root node

Figure 3. Generated BFS spanning tree and assignment of directions to links for a 9- switch network.

5. **Up/down Routing**

Up*/down*routing is the most popular routing scheme currently used in commercial networks, such as Myrinet [2].

In order to compute the up*/down*routing tables, different methodologies can be applied. These methodologies are based on an assignment of direction (“up” or “down”) to the operational links in the network by building a spanning tree.

These methodologies differ in the type of spanning tree to be built. The original methodology is based on BFS spanning trees (BFS methodology), as it was proposed in Au- tonet [17], whereas an alternative methodology is based on DFS spanning trees (DFS methodology), as it was recently proposed in [16].

Theup*/down*routing algorithm handles deadlocks by restricting routing in such a way that cyclic channel depen- dencies are avoided. In order to avoid deadlocks while still allowing all links to be used,up*/down*routing uses the following rule: a legal route must traverse zero or more links in the “up” direction followed by zero or more links in the

“down” direction. Thus, a message cannot traverse a link along the “up” direction after having traversed one in the

“down” direction.

5.1. BFS Methodology

This methodology assigns the directions (“up” or

“down”) to links by building a BFS spanning tree.

First, to compute a BFS spanning tree, a switch must be chosen as the root. Starting from the root, the rest of the switches in the network are arranged on a single spanning tree [17]. Then, an assignment of direction (“up” or

“down”) to links is performed. The “up” end of each link

(5)

is defined as:¹⁾the end whose switch is closer to the root in the spanning tree;²⁾the end whose switch has the lower identifier, if both ends are in switches at the same tree level. The result of this assignment is that each cycle in the network has at least one link in the “up” direction and one link in the “down” direction. Figure 3 shows the BFS spanning tree and the link direction assignment for a 9-switch network.

5.2. DFS Methodology

Like in the BFS spanning tree, an initial switch must be chosen as the root before starting the computation of the DFS spanning tree. The selection of the root is made by using heuristic rules [15]. For instance, the switch with the highest average topological distance to the rest of the switches will be selected as the root node. The rest of the switches are added to the DFS spanning tree following a recursive procedure. Unlike the BFS spanning tree, adding switches is made by using heuristic rules. We apply the heuristic rule recently proposed in [15]. Starting from the root switch, the switch with the highest number of links connecting to switches that already belong to the tree is selected as the next switch in the tree. In case of tie, the switch with the highest average topological distance to the rest of the switches will be selected first.

Next, in order to assign directions to links, switches in the network must be labeled with positive integer numbers.

When assigning directions to links, the “up” end of each link is defined as the end whose switch has a higher label.

Figure 4 shows the new link direction assignment for the same network graph depicted in Figure 3.

It has been shown that the DFS methodology [16] provides more minimal paths than the BFS one, resulting in a significant increase in network performance [15].

b a e h i g f

d

c

0 2 4 6 7

5

8 3

1

"up"

root node

Figure 4. Generated DFS spanning tree and assignment of directions to links for the same network of Figure 3.

6. Performance Evaluation

This section is mainly intended to illustrate the application of the proposed simulation framework, analyzing the influence of using different procedures to compute the the routing tables on the execution time of some parallel applications.

For this aim, we evaluate by simulation the behavior of the DFS methodology to compute theup*/down*routing tables (which will be referred to as UD DFS) for COWs. For comparison purposes, we have also evaluated the original BFS methodology (which will be referred to as UD BFS).

The evaluation will be performed under the traffic generated by parallel real applications by using the new simulation framework, as described in Section 4.

The proposed simulation framework can also be used to analyze the impact on overall system performance of different hardware improvements when a realistic interconnection network model is used. As an example, we show the improvement in system performance resulting from increasing the processor instruction issue rate, the cache size, or the network bandwidth.

6.1. Network and Traffic Model

Several irregular network topologies have been random- ly generated. Network sizes of 16 and 32 switches have been evaluated. The evaluation results shown in this paper cor- respond to the topologies that exhibit an average behavior under synthetic traffic [16].

We consider that there are 64 processors in the system for all the evaluated network sizes. Hence, every switch is connected to four and two processors for 16-switch and 32- switch networks, respectively, leaving four ports to connect to other switches. The larger the number of switches, the longer the network distances. This allows us to analyze how the distances traveled by the messages influence the relative behavior of the evaluated routing strategies.

Moreover, we consider the default RSIM parameters, such as a L1 cache size of 16 Kbyte and a processor instruction issue rate of 4. Also, we use a L2 cache size of 64 KBytes and a relative processor and network clock frequencies of 4. As we use temporal parameters taken from the Myrinet network, which runs at 160 MHz, the simulated processor clock frequency is 640 MHz. These processor parameters are close to the Alpha 21164 processor [1] (4-way instruction issue, 8 KBytes L1 cache, 96 KBytes unified L2 cache, and 667 MHz clock frequency).

The real parallel applications used in the simulations are selected from the SPLASH [12] and SPLASH-2 [19] suites.

We consider only the parallel applications FFT and MP3D.

These two applications are selected since they show low data locality (so they are very network demanding) and have

(6)

FFT, 16 switches

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Traffic(bytes/ns/processor)

Time(e+06 ns)

’UD_BFS’

’UD_DFS’

(a)

MP3D, 16 switches

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

10 20 30 40 50 60 70 80

Time(e+06 ns)

’UD_BFS’

’UD_DFS’

(b) FFT, 32 switches

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Time(e+06 ns)

’UD_BFS’

’UD_DFS’

(c)

MP3D, 32 switches

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

10 20 30 40 50 60 70 80

Time(e+06 ns)

’UD_BFS’

’UD_DFS’

(d)

Figure 5. Evaluation of up*/down* routing algorithms based on the DFS methodology (UD DFS) and BFS methodology (UD BFS).

a reasonable execution time. These applications have also been considered in other similar works [14].

The FFT application consists of transforming a set of complex data points (ⁿⁿ), which are organized as

p

n=p

p

n=pmatrices partitioned so that every processor (^p) is assigned a contiguous set of rows, which are allocated to its local memory. Each processor transposes a contiguous matrix from every other processors, and transposes a matrix locally. We considered the default problem size of 64K complex data points in order to reduce the simulation time.

The MP3D application is a particle-based wind tunnel simulation used for aeronautical tests. This application di- vides the space into cells in order to calculate the interac- tion between particles. We selected the default problem size, 50,000 particles over a geometry of 2353 cells.

6.2. Evaluating DFS vs. BFS methodologies Figure 5 shows the simulation results corresponding to the up*/down* routing algorithms based on the DFS and BFS methodologies, for parallel applications FFT and MP3D and network sizes of 16 and 32 switches. As can be seen, the DFS methodology significantly reduces the execution time of both parallel applications. In particular, UD DFS reduces the execution time of FFT and MP3D applications by 6 % and 15 %, respectively, for a 16-switch network (see Figures 5.a and Figures 5.b, respectively). As network size increases, a greater reduction in the execution time is achieved. Specifically, the execution time of the FFT application is decreased by 11 %, whereas the MP3D application execution time is decreased by 18 % (see in figures 5.c and 5.d, respectively.

The reduction in the execution time achieved by the DFS methodology is due to the greater capacity of the

(7)

FFT

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Time(e+06 ns)

’UD_BFS_S1’

’UD_BFS_S2’

’UD_DFS_S1’

(a)

MP3D

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

10 20 30 40 50 60 70 80

Time(e+06 ns)

’UD_BFS_S1’

’UD_BFS_S2’

’UD_DFS_S1’

(b)

Figure 6. Evaluation of up*/down* routing algorithms based on the DFS and BFS methodologies with processor instruction issue rate of 4 (UD BFS S1 and UD DFS S1, respectively) and 8 (UD BFS S2). 16-switch network.

network to absorb the bursty traffic generated by parallel applications. This effect is highlighted by the higher traffic values supported by DFS with respect to BFS. For instance, in a 32-switch network and for the MP3D application (Figure 5.d), UD DFS achieves a traffic value of 0.1565 bytes/ns/processor at 25e+06 ns of simulation time.

This is 70 % higher than the one obtained for UD BFS (0.0736 bytes/ns/processor). As a consequence, the time required to move a certain amount of data through the network is considerably reduced, leading to progressively decrease the execution time of the parallel applications.

For the MP3D application, the DFS methodology achieves higher improvements than the ones achieved for the FFT application because of the higher network demands of MP3D. As a consequence, this application takes more advantage of using the more effectiveup*/down*routing tables provided by the DFS methodology.

Despite the fact that a different simulation framework is used, conclusions are similar in general to that obtained in [23] (i. e. improving the routing scheme has a relatively low influence on the execution time of parallel applications).

However, unlike in [23], in this work we have analyzed the impact of improving routing tables on the performance of some parallel applications in the context of COWs with irregular network topologies without virtual channels.

In this context, the benefits of improving the routing scheme are often greater than in regular networks. This is because in irregular networks improving the routing scheme additionaly contributes to reduce the distances traveled by the messages.

6.3. Modifying routing tables vs. improving hard- ware

Figure 6 shows the simulation results for theup*/down*

routing algorithms based on the DFS and BFS methodologies when using processor instruction issue rates of 4 (system configuration S1) and 8 (system configuration S2) for parallel applications FFT and MP3D in a 16-switch network. Increasing the issue rate help to exploit ILP.

As was expected, by increasing the processor instruction issue rate, a reduction in the execution time of parallel applications will be achieved, since the processor can execute more instructions per cycle. In particular, UD BFS S2 reduces the execution time by 2 % with respect to UD BFS S1 for both parallel applications. How- ever, UD DFS S1 achieves a higher reduction in the execution time, by 6 % and 15 % for the FFT and MP3D applications, respectively, despite the fact that processors with a lower instruction issue rates are used. Therefore, in this case, applying the DFS methodology to compute the up*/down*routing tables allows to achieve greater performance than the one achieved by using more powerful processors.

On the other hand, Figure 7 shows the simulation results for theup*/down*routing algorithms based on the DFS and BFS methodologies when increasing the network clock frequency by a percentage of 33 % in a 32-switch network for parallel applications FFT and MP3D. This study tries to illustrate the fact that the network bandwidth is becoming closer to the processor bandwidth. This is the case of 4X InfiniBand switches [10]. This analysis is performed considering that the relative speed of the processor with respect to the network is 3 (system configuration S2) instead of 4

(8)

FFT

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Time(e+06 ns)

’UD_BFS_S1’

’UD_BFS_S2’

’UD_DFS_S1’

(a)

MP3D

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

10 20 30 40 50 60 70 80

Time(e+06 ns)

’UD_BFS_S1’

’UD_BFS_S2’

’UD_DFS_S1’

(b)

Figure 7. Evaluation of up*/down* routing algorithms based on the DFS and BFS methodologies with relative speed of processor with respect to network of 4 (UD BFS S1 and UD DFS S1, respectively) and 3 (UD BFS S2). 32-switch network.

FFT

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Time(e+06 ns)

’UD_BFS_S1’

’UD_BFS_S2’

’UD_DFS_S1’

(a)

MP3D

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

10 20 30 40 50 60 70 80 90

Time(e+06 ns)

’UD_BFS_S1’

’UD_BFS_S2’

’UD_DFS_S1’

(b)

Figure 8. Evaluation of up*/down* routing algorithms based on the DFS and BFS methodologies with secondary cache size of 32 KBytes (UD BFS S1 and UD DFS S1, respectively) and 64 KBytes (UD BFS S2). 16-switch network.

(system configuration S1), which was assumed in the previous analysis.

As can be seen in Figure 7.a for the FFT application, when using a higher speed network, UD BFS S2 achieves a reduction in the execution time of 11 % with respect to UD BFS S1. However, this improvement is similar to the one achieved by UD DFS S1, despite using a slower network. On the other hand, for the MP3D application (Figure 7.b), UD BFS S2 achieves a greater reduction in the execution time (about 23 %) with respect to that achieved by UD DFS S1 (18 %). In this case, using a more effective routing tables is not enough to offset the disadvantage of using a lower speed network. This is because of the greater network demand exhibited by the MP3D application.

Finally, Figure 8 shows the simulation results for up*/down*routing algorithms based on the DFS and BFS

methodologies when using L2 cache sizes of 32 KBytes (system configuration S1) and 64 KBytes (system configuration S2) in a 16-switch network for the parallel applications FFT and MP3D. As it is known, increasing the L2 cache size reduces miss rate. Therefore, a smaller number of messages requesting remote data blocks will be generated, thus decreasing the traffic in the network. In particular, the percentage of performance improvement of UD BFS S2 with respect to UD BFS S1 is 1.5 % and 14.3 % for FFT and MP3D applications, respectively. However, a higher reduction in the execution time of the analyzed parallel applications can be achieved, without having to increase the cache size, by simply replacing theup*/down*routing tables with that provided by the DFS methodology. In particular, the execution time is decreased by 5.2 % and 23.5 % for the FFT and MP3D applications, respectively.

(9)

FFT, 16 switches

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Time(e+06 ns)

’UD_DFS_S1’

’UD_DFS_S2’

(a)

FFT, 32 switches

0 0.05 0.1 0.15 0.2 0.25

5 10 15 20 25 30 35 40

Time(e+06 ns)

’UD_DFS_S1’

’UD_DFS_S2’

(b) MP3D, 16 switches

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

10 20 30 40 50 60 70 80

Time(e+06 ns)

’UD_DFS_S1’

’UD_DFS_S2’

(c)

Figure 9. Evaluation of up*/down* routing algorithm based on the DFS methodology with (a) processor instruction issue rate of 4 (S1) and 8 (S2), (b) relative speed of processor with respect to network of 4 (S1) and 3 (S2), and (c) secondary cache size of 32 KBytes (S1) and 64 KBytes (S2).

These results show that some parallel applications may benefit from using effective routing tables more than they benefit from the lower network traffic demand achieved by the use of larger L2 cache sizes.

Despite the fact that the use of more effective routing tables allows execution time of parallel applications to be reduced, parallel applications continue to noticeably benefit from the hardware improvements, as can be seen in Fig- ure 9. This figure shows the performance evaluation results when both improvements in routing and hardware are applied simultaneously.

When using the DFS methodology, an increase in the processor instruction issue rate from 4 to 8 instructions per cycle causes an additional reduction in the execution time of 3.4 % for the FFT application in a 16-switch network, as can be seen in Figure 9.a. Moreover, when using a network that is 33 % faster, an additional improvement of 9.6 % is achieved in a 32-switch network for the same application

(Figure 9.b). Finally, by increasing the cache size from 32 KBytes to 64 KBytes, the execution time of the MP3D application can be additionally reduced by 5 % (Figure 9.c).

7. Conclusions

A simulation framework has been developed to model COWs by connecting the well-known RSIM simulator and the NETSIM interconnection network simulator. This new simulator allows us to evaluate more accurately the behavior of COWs under shared-memory real parallel applications, analyzing the influence on performance of several system design issues.

In particular, we have applied this tool to analyze the performance of COWs when using different routing algorithm (the original BFS up*/down* and the improved DFS up*/down*). The effect in performance of improving other

(10)

design issues, such as processor issue rate, cache size and network bandwidth have also been evaluated. The obtained results show that the use of an improved routing algorithm in the network may have more impact on overall performance than the above mentioned improvements in hardware. Moreover, the use of a improved routing algorithm only implies modifying the way routing tables are computed, whereas the improvements in hardware has some impact on the system cost.

While these results can not be generalized, the use of the tool presented in this paper allows the designer to analyze the performance of the system by considering jointly the effect of several design issues.

References

[1] Alpha 21164 Microprocessor Product Brief.

http://www.intl.samsungsemi.com

[2] N. J. Boden et al., Myrinet - A gigabit per second local area network,IEEE Micro, vol. 15, Feb. 1995.

[3] D. Culler J. P. Singh,Parallel Computer Architecture. A Hardware/Software Approach, Morgan Kaufmann Pub- lishers, 1999.

[4] W. J. Dally and C. L. Seitz, Deadlock-free message routing in multiprocessors interconnection networks, IEEE Trans. on Computers, vol. C-36, no. 5, pp. 547- 553, May. 1987.

[5] J. Duato and P. L´opez, Performance evaluation of adaptive routing algorithms for k-ary n-cubes, inProc. of the 1997 Parallel Computer Routing and Communication Workshop, June 1997.

[6] J. Flich, M.P. Malumbres, P. L´opez and J. Duato, Per- formance evaluation of a new routing strategy for irregular networks, inProc. 2000 Int. Conf. on Supercom- puting, May 2000.

[7] D. Garc´ıa and W. Watson, Servernet II, inProc. of the 1997 Parallel Computer Routing and Communication Workshop, June 1997.

[8] S. Goldschmidt, Simulation of multiprocessors: Accu- racy and performance,Ph. D. Thesis, Stanford Univer- sity, June 1993.

[9] John L. Hennessy and David A. Patterson, Comput- er Architecture: A Quantitative Approach, 3rd Edition, Morgan Kaufmann Publishers, 2002.

[10] InfiniBand^TM Trade Association, InfiniBand^T^M ar- chitecture. Specification Volumen 1. Release 1.0.a.

Available at http://www.infinibandta.com.

[11] D. Magdic, Limes: A multiprocessor simulation environment, inIEEE TCCA Newsletter, pp 68-71, March 1997.

[12] J. Pal Singh et al., SPLASH: Stanford Parallel Applications for Shared-Memory Multiprocessors and Uniprocessors, inComputer Architecture News, vol. 20, no. 1, pp 5-44, May 1992.

[13] V. S. Pai et al., Rsim: An Execution-Driven Simula- tor for ILP-Based Shared-Memory Multiprocessor and Uniprocessors, inIEEE TCCA Newsletter, Oct. 1997.

[14] V. Puente, J. A. Gregorio, C. Izu, R. Beivide, and F.

Vallejo, Low-level Router Design and its Impact on Supercomputer System Performance, in Proc. of the Int. Conf. on Supercomputing, June 1999.

[15] J.C. Sancho and A. Robles, Improving the Up/DownRouting Scheme for Networks of Worksta- tions, inProc. of Euro-Par 2000, Aug. 2000.

[16] J.C. Sancho, A. Robles, and J. Duato, New Methodol- ogy to Compute Deadlock-Free Routing Tables for Ir- regular Networks, inProc. of 4th Workshop on Commu- nication, Architecture and Applications for Network- based Parallel Computing, Jan. 2000.

[17] M. D. Schroeder et al., Autonet: A high-speed, self-configuring local area network using point-to-point links,SRC research report 59, DEC, Apr. 1990.

[18] R. Sheifert, Gigabit Ethernet, Addison-Wesley, Apr.

1998.

[19] S. C. Woo et al. , The SPLASH-2 Programs: Charac- terization and Methodological Considerations, inProc.

of the 22nd Int. Symp. on Computer Architecture., pp.

24-36, Jun. 1995.

[20] S. L. Scott and J. R. Goodman, “The impact of pipelined channels on k-ary n-cube networks”, IEEE Trans. Parallel and Distributed Systems, vol.5, no. 1, pp. 2-16, January 1994.

[21] F. Silla and J. Duato, “High-Performance Routing in Networks of Workstations with Irregular Topology”, IEEE Trans. on Parallel and Distributed Systems, vol.

11, no. 7, July 2000.

[22] F. Silla, M.P. Malumbres, J.Duato, D. Dai, and D. K.

Panda, Impact of Adaptivity on the Behavior of Net- works of Workstations under Bursty Traffic, inProc. of the 1998 Int. Conf. on Parallel Processing, Aug. 1998.

[23] A. S. Vaidya, A. Sivasubramaniam, C. R. Das, “ Im- pact of Virtual Channels and Adaptive Routing on Ap- plication Performance”, IEEE Trans. on Parallel and Distributed Systems, vol. 12, no. 2, Feb. 2001.

PDF Performance Evaluation of COWs under Real Parallel Applications

Performance Evaluation of COWs under Real Parallel Applications

J. C. Sancho, J. C. Mart´ınez, A. Robles, P. L´opez, J. Flich and J. Duato Departamento de Inform´atica de Sistemas y Computadores

Universidad Polit´ecnica de Valencia P.O.B. 22012, 46071 - Valencia, SPAIN

E-mail:

jcsancho,jc,arobles,plopez,jflich,jduato

@gap.upv.es

Abstract

1. Introduction

2. RSIM Simulator

3. NETSIM Simulator

4. Connecting RSIM and NETSIM Simula- tors

5. Up*/down* Routing

6. Performance Evaluation

7. Conclusions

References

5. **Up/down Routing**