AYUDA INTERNACIONAL - OBRAS ESCOGIDAS DE MAO TSETUNG

A DSM developer needs to take cognisance of an often quoted reason, aside from poor performance, why DSM failed to gain acceptance among the parallel programming com- munity, i.e. programability. All DSMs implemented their own non-standard Application Programming Interface (API), an API might be simple and intuitive, such as that pre- sented by Treadmarks, but it is still (another) non-standard. Educating a group of application developers for a new API could prove costly for something that could fall out of fashion. It should be noted that another recent effort exists, UPC [27], that define a shared memory programming model that makes provides for architectures with different levels of memory consistency, but as it has yet to gain acceptance it is not covered here. Another, Global Arrays(GA [123]), offers functionality for accessing shared arrays, assume a NUMA-like platform, but has been constructed on top of a message-passing library. It has similar semantics to the put/get operations in MPI-2 outlined below. The philosophy that will govern the API design of the SMG DSM will ultimately be to support a standard either directly or indirectly. In this section we examine some of the standards that provide some direction. Each programming model imposes a certain level of burden. This is highlighted in the examination of the OpenMP programming API, the MPI-2 one-sided communication routines, and also an implementation of OpenMP for distributed memory machines. An API such as pthreads is not considered due to a lack of a defined memory model.

5.3.1 OpenMP

The (OpenMP) standard [4] has as its main focus the parallelisation of structured loops in application code. It employs a fork-join model of parallel execution which is particularly suited to applications involving large iterative operations on array-based structures. This is achieved through the annotation of parallel code areas through the use of compiler directives. In the C version of the specification (C++ and FORTRAN also exist) an OpenMP directive has the following form:

#pragma omp directive [clause,...] newline

An important motivation for using OpenMP is its ability to simultaneously support both serial and parallel variants of programs through the use of these#pragma omp compiler directives (that define when parallelisation is required), and so can be turned on/off at compile time, but can also be introduced throughout the appliaction in an incremental fashion. Directives can be parallel, work-sharing, or synchronisation in nature and are discussed in the following sections. Also explored are the APIs and memory consistency required by OpenMP. A more substantive definition of the API is given in Appendix D. Profiling libraries may be implemented using the profiling interface definition, POMP, and other work has been done on combining the data obtained using this interface for static analysis to obtain speedups [124], and in a dynamic context for run-time optimisation [125, 126].

THE USER API: CASE STUDIES 69

Work-Sharing Directives

The parallel directive essentially directs the initial user thread to create a team of threads. The parallel directive can also be paired with afor directive, that simultaneously creates the team of threads and apportions each a portion of the associated loop workload (See Listing 5.2 below: each thread in the team, of size N, will get allocated approximately X/N iterations of the for workload). Allocation strategies may be in- stigated by specifying a SCHEDULING clause (Appendix D). The sections directive complements thefordirective, allowing parallelism to be functionally decomposed, and thereby enabling a number of further discretesectionconstructs to be divided among the threads and executed concurrently.

In the C programming language theforloop is parallelised, by dividing the work among the team of threads of execution according to some function (specified by the clause). In the default case each thread is assigned a static chunk (determined by the number of iterations to be performed and the number of threads in the team). Barriers are used implicitly at the start and end of the snippet, where any thread will wait at the end of the structured code block until all threads have arrived, except, for example, where a nowait clause has been declared. In order for concurrency to be allowed inside this parallel section, the shared memory regions must be concurrently writable by multiple writers, i.e. a multiple writer protocol must be supported.

#pragma omp p a r a l l e l f o r /* Begin parallel section */

f o r ( k = 0 ; k < X; k++){

s u b t o t a l += a [ k ] ;

} /* End parallel section */

#pragma omp p a r a l l e l s e c t i o n s /* Begin parallel section */

{

#pragma omp s e c t i o n /* Methods foo & bar

{ foo(); } executed concurrently by

different threads _*/ #pragma omp s e c t i o n

{ bar ( ) ; }

} /* End parallel section */

Listing 5.2: OpenMP Work Sharing Directives

API calls & Environment variables

The OpenMP specification also includes API calls to enable the programmer to query and set the value of OpenMP environment variables, to use lock synchronisation routines, and timer routines. The most notable of the environmental routines enable the dynamic setting of the default number of threads in a OpenMP team (this value can be specified by a clause to the parallel directive).

THE USER API: CASE STUDIES 70

Mutual Exclusion Directives

Mutual Exclusion is supported through the use of synchronisation directives (highlighted in Listing 5.3), and additionally though primitives with associated API routines (see next section). The barrier and flush operations, referred to above, have explicit directives. The other mutual exclusion directives that ensure structured access to shared data are as follows.

/* Only the master thread will execute code */ #pragma omp m a s t e r

{. . .}

/* Only one thread will execute the code */ #pragma omp s i n g l e

{. . .}

/* All threads will execute the

code, but only one at a time */ #pragma omp c r i t i c a l

{. . .}

/* Atomic update to a shared variable */ #pragma omp a t o m i c

Listing 5.3: Format of OMP Mutual Exclusion directives

These directives must be nested within aparallel(or variant) directive. The first two are functionally equivalent, and allow only one thread to execute a block of code. In the ompcase the master thread will always execute the block, while in general withsingle the first thread to reach the block will execute it. The critical directive allows the code block to be executed by all the threads in the team, but ensures that only one will do so at a time. Theatomic directive is a restricted version of critical, allowing a single statement, such as a variable increment e.g. i++, to be executed atomically. OpenMP Memory consistency model

In the early OpenMP standards no reference was made to the memory consistency model that was needed in order that an OpenMP application would be correct. Version 2.5 of the OpenMP standard somewhat rectified this by specifying that the memory consistency model required is a relaxed consistency model, similar to weak ordering as described in [127]. Various data scoping attribute clauses can also be supplied (see Appendix D, page 230). All shared memory references must be performed with respect to OpenMP flush directives. While an explicit flush exists they are also implicit

THE USER API: CASE STUDIES 71

with regard to the work-sharing directives. The following requirements need to be met with respect to flush operations [128]:

• If the intersection of the flush-sets of two flushes performed by two different threads is non-empty then the two flushes must be completed as if in some sequential order, seen by all threads.

• if the intersection of the two flushes performed by one thread is non-empty, then the two flushes must appear to be completed in that thread’s program order.

• If the intersection of the flush-sets of two flushes is empty, the threads can observe these flushes in any order.

In relation to this thesis, the specification of a relaxed consistency model was very encouraging, as it strengthened the case for SMG to be designed as a potential target for an OpenMP compiler.

5.3.2 Cluster OpenMP

Cluster OpenMP is a recent optional module of the Intel compiler family (C, C++ & FORTRAN) [129] that allows applications written using the OpenMP interface to execute transparently across distributed memory machines. Although previous efforts strove to provide such functionality, this is the first to be provided commercially, and is fully supported in terms of commercial-grade documentation and support tools (compiler, thread checker, thread profiler). This distributed OpenMP is built upon a modified version of the Treadmarks DSM (see Section 4.8.4), rectifying many of the inherent deficits listed, i.e. larger number of user processes, multiple threads per process, larger quanti- ties of sharable data, and increased support for modern interconnects.

In addition to additional functionality, some deviations from the OpenMP standard have been made. The most noticeable departure is from the memory model, whereby by default, all variables are not not shared among all threads of execution. Shared variables are explicitly declared shared using a newsharabledirective which is an Intel extension to OpenMP standard. Some compiler support exists (the -clomp-sharable-propagation compiler directive) for identifying variables that need to be declared using this directive [130]. The minimum granularity at which memory consistency is guaranteed is four bytes.

As the mmap system call is used internally by Cluster OpenMP, use of this function by user code should be treated with caution; alternative functions are provided for use. OpenMP lock variables omp lock t must be also explicitly allocated, and dynamic memory must be allocated/freed using API calls instead of using the standard library functions (malloc/free). Nested parallelism, i.e. a parallel directive within the scope of another, is not supported.

Like Treadmarks, remote process start-up is done using the basic remote shell (rsh), or the secure variant ssh. Communication is still via sockets, but the DSM has increased user thread-ability where the number of DSM system threads (termed bottom- half threads) is proportional to the number of user application threads.

THE USER API: CASE STUDIES 72

5.3.3 MPI-2 Shared Memory

The remote memory access functions included in version 2 of the MPI interface enable one sided communication, while at the same time not requiring a uniform shared address space. This remote memory access, although somewhat removing the communicating pair requirement, is still explicit in nature. MPI-2 functions MPI Get, MPI Put, and MPI Accumulateallow respectively for the initiation of one-sided remote memory read and writing. However, before these can be used, constructions known as ’memory win- dows’ need to be established, using the MPI Win create, that specify a contiguous address range (memory address and size) where those memory operations are mapped into the local address space.

The above one-sided operations are non-blocking, so the point at which these remote memory actions initiate and ultimately become visible are denoted using synchronisation operations specified using the active (i.e. a collective operation, so all processes are involved) ’Fence’ mechanism MPI Win fence, which is analogous to a barrier. If multiple processes perform a put to the same window location the the result is unde- fined. The passive routines (i.e. where only one caller is involved),MPI Win lock and MPI Win unlock, are available to provide multiple-reader/single-writer functionality, but have yet to be implemented in open-source MPI-2 implementations.

The data being transferred is well defined using the MPI type routines, so when com- pletely implemented, this functionality will allow for data access in heterogeneous systems. The main drawbacks with the MPI-2 RMA operations are the requirement for the user to explicitly specify the read and write routine and to synchronise access. Other work has concluded that MPI-2 does not provide an adequate compilation target for global address space languages [131] or parallelising compilers [85]. Additionally there is a lack of support for fault tolerance, since the error handlers only allow for the cleanup of the process and not adaption to the loss of a process. Where some implementations include such support they do not totally adhere to the MPI standards.

MPI Win create // Create a memory access window

MPI Win free // Free a window

MPI Put // Write operation on remote memory

MPI Get // Read operation on remote memory

MPI Accumulate // Perform operation while performing put

MPI Win fence // Collective, barrier-like operation

MPI Win lock // Lock (r/w) the memory access window

MPI Win unlock // Unlock (read/write) the window

MPI Win start // Start an interval to the memory window

MPI Win complete // Complete a memory access interval

MPI Win post // Start an interval locally

MPI Win wait // Signal completion of interval locally

THE USER API: CASE STUDIES 73

The code example given in Listing 5.5 demonstrates the one-sided communication abil- ities as provided by the MPI-2 standard.

MPI Alloc mem (s i z e o f(i n t)∗s i z e , MPI INFO NULL , &a ) ; MPI Alloc mem (s i z e o f(i n t)∗s i z e , MPI INFO NULL , &b ) ; MPI Win create ( a , s i z e , s i z e o f(i n t) , MPI INFO NULL ,

MPI COMM WORLD, &win ) ; f o r ( i = 0 ; i < s i z e ; i ++){ a [ i ] = rank ∗ 100 + i ; } p r i n t f ( ” P r o c e s s %d has t h e f o l l o w i n g : ” , rank ) ; f o r ( i = 0 ; i < s i z e ; i ++){ p r i n t f ( ” %d” , a [ i ] ) ; } p r i n t f ( ”\n” ) ;

MPI Win fence ( (MPI MODE NOPUT | MPI MODE NOPRECEDE) , win ) ;

i f( op == GET){

f o r ( i = 0 ; i < s i z e ; i ++){

MPI Get(&b [ i ] , 1 , MPI INT , i , rank , 1 , MPI INT , win ) ;

} }e l s e{

f o r ( i = 0 ; i < s i z e ; i ++){

MPI Put(&a [ i ] , 1 , MPI INT , i , rank , 1 , MPI INT , win ) ;

} }

MPI Win fence (MPI MODE NOSUCCEED, win ) ;

p r i n t f ( ” P r o c e s s %d o b t a i n e d t h e f o l l o w i n g : ” , rank ) ; f o r ( i = 0 ; i < s i z e ; i ++){

p r i n t f ( ” %d” , b [ i ] ) ;

}

p r i n t f ( ”\n” ) ;

MPI Win free (&win ) ; MPI Free mem ( a ) ; MPI Free mem ( b ) ;

CHAPTER

6

Shared Memory for Grids (SMG)

The explorations of this thesis were conducted by implementing a DSM called SMG (Shared Memory for Grids). One of the primary goals of this thesis is the explo- ration of facilities to allow existing parallel applications to execute efficiently on a grid with little, if any modifications required. To achieve this proper attention need to be paid to the requirements of existing parallel programming standards. As OpenMP is the current de factostandard for parallel application development on shared memory architectures it is appropriate to ultimately design the DSM to be a target of an OpenMP source-to-source compiler. Some of the OpenMP parallel constructs, covered briefly in Section 5.3.1), will be further examined in order to ascertain such requirements. This will form the starting point for the implementation of DSM topics that were discussed in Chapter 4.

This chapter presents an overview of the internals of the base SMG system such as how the DSM interacts with user application threads, how communication between processes is effectively managed, and the start-up and shutdown stages of a SMG DSM application. This permits both application-level performance optimisation as well as algorithm implementation and problem tracing, and is crucial to facilitating higher performance on a Grid. Chapters 7 and 8 will deal with the relevant aspects of integrating memory management and synchronisation. SMG allows applications to be monitored either by the application itself, by another process, or by the user. Chapter 9 deals with the implementation of the libraries to access information and monitoring systems.

The steps involved in the compilation and execution of a simple Helloworld SMG application are also covered. The chapter concludes by describing some of the system implementation issues that were encountered.

DSM REQUIREMENTS 76

6.1 DSM Requirements

The system must present the programmer with an easy-to-use and intuitive Application Programming Interface (API) in order that the additional burden in the construction of a DSM application is minimal. This requires that the semantics must be as close to that of normal shared memory programming as possible. The SMG project aims to borrow from the successes and learn from the mistakes of previous DSM implementations. When designing a DSM system, the main decision is to what extent the user will be re- sponsible for maintaining the shared memory consistent. In general, less burden on the programmer results in more work to be performed at the DSM management layer. Ulti- mately this may result in performance deterioration that will limit the overall scalability of the system. This is further exacerbated when there is little or no hardware support available to the DSM. There are successful implementations of hardware-support using off the shelf components for distributed shared memory, employing interconnects such as SCI [20] and Myrinet [21], but one of the aims of this research was to avoid the need for specialised hardware support.

Apart from a being a popular research topic, DSM has otherwise been a failure, a number of reasons have been suggested [92], but one significant factor is the poor take-up outside research labs due to the lack of any open standards in the area. Any new DSM implementation is usually accompanied by a new programmers API (some are given in Appendix C), and involves a developer learning a new set of programming semantics, with an additional drawback in the lack of portability of the application. Therefore parallel application developers are loathe to adopt any new non-standardised API. For a DSM to gain acceptance, compliance with an open standard is necessary to en- courage use, so it is with this in mind that the DSM must be designed to be potentially a target of a parallelising compiler, such as OpenMP, thereby allowing the use of existing parallel code with support for future application development. In Section 5.3.1 some OpenMP directives were examined. One can identify some of the design requirements of the DSM if it is to form the target of a parallelising compiler.

To implement these directives, a compiler targeting a DSM system only requires a consistency model that ensures shared memory areas are consistent after a synchronisation operation has occurred. In the parallelised for example, shared memory sections are required to be consistent at the entrance and exit of the section. This allows any of the more relaxed consistency models to be used, as there is a close affinity with synchronisation primitives and the fact that shared memory can be explicitly declared as such using the OpenMPsharedclause.

Where a developer targets the DSM, its primary functions are then used to act as proxy between SMG processes and provide the developer with a transparent method of access-

In document OBRAS ESCOGIDAS DE MAO TSETUNG (página 182-191)