• No se han encontrado resultados

gram

The distribute statements shown in Figure 6.17, implement the divided range prime search program. To aid discussion they have been labelled D1, D2, D3 and D4.

The first two rules (D1, D2) in Figure 6.17 calculate the initial set of primes (the primes required to calculate the rest of the primes). The other two rules (D3, D4) divide the search range into slices, and each task finds the primes within a given slice of the search range.

Distribute statement D1 combines with the mult rule, restricting it to generating mult(M,P,I) tuples where M ≤ d√M axe. As M is generated by multiplying P by I, the distribute statement restricts I so

that P· I ≤ d√M axe.

Likewise, distribute statement D3 restricts the mult rule, so that it only calculates mult(M,P,I) tuples where M is within the slice allocated to the current task.

Distribute statement D2 restricts num(N) so that N is in the initial set (i.e. N ≤ d√M axe). This

ensures that the primes in the initial set are generated on every node.

Distribute statement D4 ensures that the num(N) tuples are within the slice allocated to the task. This causes the program to generate the prime(N) tuple within that slice.

It might seem at first glance that the distribute statements are not necessary for both the num and mult rules, as the logic which divides the range into slices is duplicated in both distribute statements. But they are necessary because of the negation in the prime rule, not(mult(N, , )). If expression not(mult(N, , )) is true, it may be that N is not a prime, but mult(N, , ) may have been calculated on a different task. To have a correct program, distribute statements D2 and D4 are required so that incorrect prime(N) tuples are not generated.

6.7.3

Load Balancing

Load balancing in this implementation is far more straightforward than in the filter pipeline. The range of numbers to be checked for primes can be divided up evenly up among the nodes, assuming that all compute nodes have the same performance.

If the machines had varying performance then the search space could be divided into a larger number of subranges, and faster machines could be allocated a larger number of subranges.

6.7.4

Experiments

A set of experiments were run to measure the speedup achieved by this version of the program. The time taken to count the primes less than 10 billion was measured when using 2, 4, 8, 16 and 32 nodes.

% Statements D1 and D2 generate the set of "initial primes" (all % primes less than sqrt(Max)) on each task.

% ---

% D1: Generate all multiples up to sqrt(Max) on each task

distribute mult(M,P,I) to T using

max(Max), I =< ceil(sqrt(Max)) // P, curr_task(T).

% D2: Search for primes up to sqrt(Max) on each task

distribute num(N) to T using

max(Max), N =< ceil(sqrt(Max)), curr_task(T).

% Rules D3 and D4 allocate each task a "slice" of the total range to % be searched. Each task finds primes within the slice allocated to it. % ---

% D3: Find multiples with M within the slice allocated to the task distribute mult(M,P,I) to T using

curr_task(T), num_tasks(NumT), max(Max), SliceSize is Max // NumT + 1,

SliceStart is T*SliceSize, SliceEnd is (T+1) * SliceSize, SliceStartI is SliceStart // P, SliceEndI is (SliceEnd // P) + 1, I >= SliceStartI, I <= SliceEndI.

% D4 Find primes within the slice allocated to the task distribute num(N) to T using

curr_task(T), num_tasks(NumT), max(Max), SliceSize is Max // NumT + 1,

SliceStart is T*SliceSize, SliceEnd is (T+1) * SliceSize, SliceStart =< N, N < SliceEnd.

// Find the initial set of primes add 2 to initial primes

initial_primes_count = 1 i=3

while i * i < max:

if i is not a multiple of initial primes: add i to initial primes

initial_primes_count++ i++

// Calculate the range of numbers checked by this task

i = bottom of range checked by this task

top = top of range checked by this task // Count primes in number range

prime_count = 0 while i < top:

if i is not a multiple of initial primes: prime_count++

i++

// Total prime counts from all tasks if task number = 1:

receive prime counts from all nodes

total_prime_counts = total prime counts from all nodes primes = total_prime_counts + initial_primes_count else:

send prime_count to task 1

0 20 40 60 80 100 120 0 5 10 15 20 25 30 35 Speedup Nodes

Search Space Split Search Space Split, Skip Multiples of 2

Figure 6.19: Runtime using the Search Range Distribution implementation.

The speedup is calculated using the single node runtime value given in Section 6.5.

6.7.5

Conclusions

The search range division implementation is able to achieve nearly linear speedup. The calculation of the initial set of primes (those less than the square root of the maximum) is duplicated across all nodes, but the amount of time required for this part of the program is very small. For example, when counting the primes up to 10 billion, the initial set of primes are the primes less than 100,000. This is only 0.001% of the total amount of work to be done. The only communication required is at the end of the program when the subtotals are collected and tallied.

6.8

Conclusions

From the results of the experiments presented in this chapter, it is clear that the partition range archi- tecture is superior for the primes counting program.

• Communication costs limit the scalability of the pipeline primes program

The partition range architecture has very little communication costs compared to the pipeline filter architecture. The high communication to computation ratio of the pipeline architecture means that

0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 Speedup Nodes

Search Space Split Search Space Split, Skip Multiples of 2 Linear Speedup

Figure 6.20: Speedup using the Search Range Distribution implementation.

the scalability of the pipeline is limited. Using the mark and check algorithm with the pipeline architecture is not able to achieve any speedup over the single node version.

Using the priority queue algorithm with the pipeline architecture is able to achieve close to linear speedup over the single node priority queue program when using 8 nodes or less. However, the program is not able to achieve good levels of speedup beyond 8 nodes.

The partition range architecture, on the other hand, scales very well. This program is basically embarrassing parallel. When using the mark and check algorithm with the partition range archi- tecture linear speedup is obtained with 32 nodes, and excellent speedups should be possible even with very large numbers of nodes.

• The partition range architecture is easier to load balance than the pipeline architecture.

The partition range architecture is much easier to load balance as the workload can easily be divided among the filter nodes. For the pipeline architecture, finding the optimal load balancing scheme is not as easy. Earlier nodes in the pipeline have to receive and send large amounts of data. For nodes at the end of the pipeline most of the original data has been eliminated. This can create a bottleneck at the beginning of the pipeline which leaves the rest of the nodes underutilised.

in mind.

In Section 6.2 two different JStar primes programs were considered. The first had been written for an implementation of Starlog for a standalone, single core computer. However, it would not be easy to transform this program into the Divide Search Space implementation. The original program had to be modified to make Divide Search Space implementation possible.

Chapter 7

Case Study: Conway’s Game of Life

This chapter explores some of the choices that can be made when implementing the JStar Game of Life program on a distributed computer, and the effect of those choices on the implementation’s performance.

7.1

Introduction

This section introduces the Game of Life and discusses different algorithms for computing it.