H OTRAS INFORMACIONES DE INTERÉS
NOTA ACLARATORIA AL APARTADO C.1.10.-
Our second case study illustrates the use of RPCs to implement asynchronous access to a distributed data structure. Programs 5.16 and 5.17 sketch a CC++ implementation of the parallel Fock matrix construction algorithm of Section 2.8. Recall that in this algorithm, P computation tasks must be able to read and write distributed arrays. The programs presented here achieve this capability by distributing the arrays over a set of processor objects. Computation threads, created one per processor object, operate on the distributed array by invoking RPCs implementing operations such as accumulate and read.
The distributed array itself is implemented by the class Fock and the processor object class FockNode, presented in Program 5.16. These are derived from the classes POArray and POArrayNode of Program 5.10, much as in the climate model of Program 5.11, and provide
definitions for the virtual functions create_pobj and init_pobj. The derived classes defined in
Program 5.16 are used to create the array of processor objects within which computation will occur. The data structures that implement the distributed array are allocated within these same processor objects, with each of the posize processor objects being assigned blocksize array
elements. Notice how the initialization function for FockNode allocates and initializes the elements
For brevity, Program 5.16 implements only an accumulate operation. This function is defined in
Program 5.17. Notice how it issues an RPC to a remote processor object (number
index/blocksize) requesting an operation on a specified sequence of values. The function
invoked by the RPC (accum_local) is an atomic function; this ensures that two concurrent
Having defined the classes Fock and FockNode, the implementation of the rest of the ccode is
fairly straightforward. We first create and initialize P processor objects of type FockNode, as
follows.
Fock darray(1024); // 1024 is block size darray.init(P, nodes); // P and nodes as usual
Then, we invoke on each processor object a task (fock_build) responsible for performing one
component of the Fock matrix computation. Each such task makes repeated calls to accumulate to
place the results of its computation into the distributed array.
5.13 Summary
In this chapter, we have learned about a programming language, CC++, that provides a small set of extensions to C++ for specifying concurrency, locality, communication, and mapping. CC++ does not support the task and channel abstractions of Part I directly, but its constructs can be used to build libraries that provide these abstractions. In keeping with the design methodology of Chapter 2, CC++ allows mapping decisions to be changed independently of other aspects of a design. The performance modeling techniques of Chapter 3 and the modular design techniques of Chapter 4 also apply directly. Table 5.1 summarizes the language constructs that have been introduced. The CC++ programs presented in this chapter tend to be more verbose than the equivalent Fortran M, High Performance Fortran, or Message Passing Interface programs to be presented in the chapters that follow. To a greater extent than these other systems, CC++ provides basic mechanisms that can be used to implement a variety of different parallel program structures. For example, we must implement a channel library in order to use channels in a CC++ program, and a processor object array library to create arrays of processor objects. Once these libraries have been written, however, they can be reused in many situations. This reuse is facilitated by the object- oriented features of C++.
Table 5.1: CC++ quick reference: the constructs described in this chapter, the section in which
they are described, and programs that illustrate their use.
Exercises
1. Extend Program 5.4 to allow for nonbinary trees: that is, trees with an arbitrary number of subtrees rooted at each node.
2. Design and construct a CC++ implementation of the manager/worker structure used in the parameter study problem described in Section 1.4.4.
3. Design and construct a decentralized version of the manager/worker structure developed in Exercise 2. Design and carry out experiments to determine when each version is more efficient.
4. Design and implement a program that can be used to quantify CC++ processor object and thread creation costs, both within the same processor and on remote processors. Conduct experiments to measure these costs, and obtain estimates for and .
5. Implement and instrument the channel library presented in Section 5.11, and use this code to measure CC++ communication costs on various parallel computers.
6. Modify the program developed in Exercise 5 to use spawn to implement the RPC used for
a send operation, and conduct experiments to compare the performance of the two versions.
7. Complete Program 5.13, using the channel library of Section 5.11 to perform communication.
8. Complete Program 5.14, using the channel library of Section 5.11 to perform communication.
9. Design and carry out experiments to compare the performance of the programs developed in Exercises 7 and 8.
10. Use the POArray class of Program 5.10 to implement a version of Program 5.4 in which
search tasks are implemented as threads mapped to a fixed number of processor objects. 11. Extend the program developed in Exercise 7 to provide a 2-D decomposition of principal
data structures.
12. Extend the channel library presented in Section 5.11 to allow polling for pending messages.
13. Extend the channel library presented in Section 5.11 to provide a merger that allows multiple senders on a channel.
14. Implement a hypercube communication template (see Chapter 11). Use this template to implement simple reduction, vector reduction, and broadcast algorithms.
15. Construct a CC++ implementation of the tuple space module described in Section 4.5. Use this module to implement the database search problem described in that section.
Chapter Notes
The C programming language was designed by Kernighan and Ritchie [170]. C++, which extends C in many respects, was designed by Stroustrup [92,270]. A book by Barton and Nackman [29] provides an introduction to C++ for scientists and engineers. Objective C [66] is another object- oriented extension to C.
C* [281], Data-parallel C [136], and pC++ [38] are data-parallel C-based languages (see Chapter 7). COOL [50] and Mentat [125] are examples of parallel object-oriented languages. Concurrent C [117] and Concert C [19] are parallel C-based languages; the latter supports both remote procedure call and send/receive communication mechanisms. C-Linda [48] augments C with primitives for creating processes and for reading and writing a shared tuple space (Section 4.5).
CC++ was designed by Chandy and Kesselman [53]. The monograph A Tutorial for CC++ [261] provides a tutorial and reference manual for the Caltech CC++ compiler. The sync or single- assignment variable has been used in a variety of parallel languages, notably Strand [107] and PCN [55,105].
6 Fortran M
In this chapter, we describe Fortran M (FM), a small set of extensions to Fortran for parallel programming. In FM, tasks and channels are represented explicitly by means of language constructs. Hence, algorithms designed using the techniques discussed in Part I can be translated into programs in a straightforward manner.
Because Fortran M is a simple language, we are able in this chapter to provide both a complete language description and a tutorial introduction to important programming techniques. (Some familiarity with Fortran is assumed.) In the process, we show how the language is used to implement various algorithms developed in Part I.
After studying this chapter, you should be able to write simple FM programs. You should know how to create tasks and channels, how to implement structured, unstructured, and asynchronous communication patterns, and how to control the mapping of tasks to processors. You should also know both how to guarantee deterministic execution and when it is useful to introduce nondeterministic constructs. Finally, you should understand how FM supports the development of modular programs, and know how to specify both sequential and parallel composition.
6.1 FM Introduction
Fortran M provides language constructs for creating tasks and channels and for sending and receiving messages. It ensures that programs are deterministic if specialized constructs are not used, and it provides the encapsulation properties needed for modular programming. Its mapping constructs affect performance but not correctness, thereby allowing mapping decisions to be modified without changing other aspects of a design. These features make it particularly easy to translate algorithms designed using the techniques of Part I into executable FM programs.
FM is a small set of extensions to Fortran. Thus, any valid Fortran program is also a valid FM program. (There is one exception to this rule: the keyword COMMON must be renamed to PROCESS COMMON. However, FM compilers usually provide a flag that causes this renaming to be performed
automatically.) The extensions are modeled whenever possible on existing Fortran concepts. Hence, tasks are defined in the same way as subroutines, communication operations have a syntax similar to Fortran I/O statements, and mapping is specified with respect to processor arrays. The FM extensions are summarized in the following; detailed descriptions are provided in subsequent sections. In this chapter, FM extensions (and defined parameters) are typeset in UPPER CASE, and other program components in lower case.
1. A task is implemented as a process. A process definition has the same syntax as a subroutine, except that the keyword PROCESS is substituted for the keyword subroutine.
Process common data are global to any subroutines called by that process but are not shared with other processes.
2. Single-producer, single-consumer channels and multiple-producer, single-consumer mergers are created with the executable statements CHANNEL and MERGER, respectively.
These statements take new datatypes, called inports and outports, as arguments and define them to be references to the newly created communication structure.
3. Processes are created in process blocks and process do-loops, and can be passed inports and outports as arguments.
4. Statements are provided to SEND messages on outports, to RECEIVE messages on inports,
and to close an outport (ENDCHANNEL). Messages can include port variables, thereby
allowing a process to transfer to another process the right to send or receive messages on a channel or merger.
5. Mapping constructs can be used to specify that a program executes in a virtual processor array of specified size and shape, to locate a process within this processor array, or to specify that a process is to execute in a subset of this processor array.
6. For convenience, processes can be passed ordinary variables as arguments, as well as ports; these variables are copied on call and return, so as to ensure deterministic execution. Copying can be suppressed to improve performance.
The FM language design emphasizes support for modularity. Program components can be combined using sequential, parallel, or concurrent composition, as described in Chapter 4. In parallel and concurrent composition, the process serves as the basic building block. A process can encapsulate data, computation, concurrency, and communication; the ports and other variables passed as arguments define its interface to the rest of the program. The techniques used to implement sequential composition will be discussed in Section 6.9.
FM extensions can be defined for both the simpler and more established Fortran 77 and the more advanced Fortran 90. For the most part, we use Fortran 77 constructs in this chapter, except when Fortran 90 constructs are significantly more concise.
Example . Bridge Construction:
Program 6.1 illustrates many FM features. This is an implementation of the bridge construction algorithm developed in Example 1.1. The program creates two processes, foundry and bridge,