ING CESAR ALBERTO SAUCEDO ALONSO RUBRICA.

CONVOCATORIAS PARA CONCURSOS DE ADQUISICIONES, ARRENDAMIENTOS, OBRAS Y SERVICIOS DEL SECTOR PUBLICO

ING CESAR ALBERTO SAUCEDO ALONSO RUBRICA.

As different algorithms can implement the same functionality, different implementations of an algorithm can target different computing architectures. Programmers can implement one CE to run on a core of the CPU or a GPU thread.

Integrating OpenCL as a possible way to implement CEs arises two issues: the actual implementation of the CE as an OpenCL kernel and the abstraction of the OpenCL platform management. Some state-of-the-art projects, like Aparapi [3] and Sumatra [80], try to hide both issues away from the programmer by automatically generating the code of the kernel out of sequential Java code; mainly they try to automatically parallelize the outmost loop of the code and invoke the kernel with as many work-items as iterations of the loop. The solution proposed in this dissertation does not go as far as these projects, and programmers still provide the code of the kernels. Since OpenCL kernels use a C99-based language which is not compilable by Java, programmers attach their code as resources of the Java application.

6.3. PROGRAMMING MODEL EXTENSION

method declaration in the CEI with the @OpenCL directive. In this case, instead of pointing out the class implementing the method, programmers indicate the name of the resource con- taining the OpenCL code of the kernel. To automatically determine the number of work-items running a kernel, the developer has to specify, as an attribute of the @OpenCL annotation, the global_work_size to use on the submission of the command to execute the kernel. However, the actual value of these variables may depend on the input values or its size. For that purpose, COMPSs allows simple algebraic expressions using the values and length of the parameters as variables. For referring to a parameter the developer uses the reserved word par followed by the index of the parameter. For instance, the developer points to the first parameter of the call using par0; for the third one, par2. If the parameter is a number, it allows to use its value; if the parameter is an array, it can use the value of one of its positions or its length. For multi-dimensional arrays, developers can refer to the length of any of its dimensions. For doing so, the developer uses the reserved names x, y and z to indentify respectively the first, second and third dimensions of the array. For instance, to refer to the length of the first dimension of the first parameter of the call, the developer uses the term par0.x; for referring to the second dimension of the third parameter, par2.y.

Besides the global_work_size, developers can also define values for global_work_offset and local_work_size. Both attributes are optional; in the case that the programmer does not specify any value for them, COMPSs forwards the decision to OpenCL. For global_work_offset, it does not apply any offset and sets the value to (0, 0,... 0); and for local_work_size, it delegates the decomposition into work-groups to the library by passing a NULL value.

Another important characteristic of OpenCL is that kernels do not return values. To avoid constraining the usage of OpenCL to CEs returning nothing, COMPSs assumes that the return value, if any, is the last parameter of the kernel; therefore, kernels implementing a CE with a return value have an additional parameter compared to its Java method version. As opposed to regular methods, where the return value is created within the method code, the memory space for the return value of OpenCL implementations needs to be allocated prior the invocation of the kernel. The runtime has to manage the allocation of result values automatically when it decides to run an OpenCL kernel. Again, the amount of memory to allocate depends on each CE and, likely, on the input values; therefore, programmers need to specify the number of elements within each dimension of the return value with an algebraic expression as the resultSize attribute of the annotation. The actual number of bytes is inferred according to the return type of the declaration.

Figure 6.1 depicts an example of a COMPSs application performing a matrix multiplication. The actual computation of the operation is encapsulated within a CE, multiply, implemented as a regular method and as an OpenCL kernel. As aforementioned, kernels have no return value; therefore, the OpenCL implementation of the CE has a third parameter corresponding to the return value of the Java implementation.

package es.bsc.compss.matmul; public class Matmul {

public static voidmain(String[] args) { int[][] A; int[][] B; int[][] C; ... C =multiply(A, B); ... }

public static int[][]multiply(int[][] A, int[][] B) {

// Matrix multiplication code // C = AB

...

return C; }

}

(a) Application Java code __kernel voidmultiply (

__global const int *a, __global const int *b, __global int *c) {

//Matrix multiplication code // C = AB

...

}

(b) OpenCL code in matmul.cl public interfaceCEI {

@OpenCL(kernel="matmul.cl", globalWorkSize="par0.x,par1.y", resultSize="par0.x,par1.y") @Method(declaringClass="es.bsc.compss.matmul.Matmul") int[][]multiply ( @Parameter(direction = IN) int[][] a, @Parameter(direction = IN) int[][] b ); }

Figure 6.1: Example of a matrix multiplication with two implementations: one in OpenCL and one as a regular method. The code of the kernel is in the matmul.cl resource, and it has to be executed by as many threads as the number of rows in matrix a times the number of columns of matrix b. The result of the method is a bi-dimensional matrix with as many rows as matrix a and as many columns as matrix b.

In document LICITACION PUBLICA NACIONAL VISITA AL LUGAR DE LOS TRABAJOS (página 49-51)