Based on the requirements of the file format identified in Section 3.1.3 we consider each of these points individually. For maximum compatibility across systems and inter-operability, we assume the input file is simply a plain text file, with each new line defining the next input parameter to be processed. Each line of this input file is read, and parsed by a Backus-Naur Form (BNF) based syntax parser, and loaded into the model. Please note for brevity we
5x5 prev: 2 char 1x5 A,B,C,D,E char 1x5 C,D,E,F,G mem: 0 maintain: 1
Fig. 3.14 A complete input file for a longest common subsequence problem, with two input strings 5 characters in length, stored in constant memory on the GPU, maintaining the entire scoring grid
omit the BNF definitions for standard text forms such as ‘character’, ‘integer’ and ‘line terminator’.
Allowing the user to define the size of the scoring grid is a case of having it defined in the input file, as two integers which represent the n and m. When the file is read, the model is then aware of how much memory to allocate on the host device to maintain the entire scoring grid. Also from these dimensions, the algorithm can calculate the size of the longest diagonal, so the width of the grid that is to be maintained on the GPU can also be calculated. The input file should contain two integers on a single line, separated by the character ’x’, allowing it to be captured by the respective BNF tokens <dimension> ::= <n> "x" <m> <EOL>, n ::= <integer>, m ::= <integer>.
The second value that is defined in the input file is the number of previous diagonals to store in memory on the GPU. This is defined by the user based on the dependencies that are present in the dynamic programming definition. In terms of the memory that is stored on the GPU during execution, this value represents the height of the grid the GPU is maintaining in global memory. As introduced in Sec. 3.1.2, once rows are rotated past the top of this grid, they are asynchronously transferred back to the larger grid being maintained in the host memory, should the user require. This is represented in the input file by an integer prefaced by the characters prev: allowing it to be captured by the BNF token <previous> ::= "prev:" <integer>.
Now that the dimensions of the scoring grid to be maintained in both system and GPU memory are defined, the user needs a way of passing in the input data which the algorithm is to process. Each piece of complete input data must be on a single line of the input file. On each line first the user needs to define the type of this specific piece of data. For this the line must start with a small string token identifying the type, such as ‘char’, which would represent input data that is in the format of characters, ‘int’, which would denote the input data is going to be a sequence of integers, and so on. Following this, the user needs to define the dimension of the input data. Finally, following this, the user only needs to provide a string of comma separated values which form the actual test data, with escape characters denoting when each row of the test data matrix ends, should it be 2D. The user can add as many lines of input data as they wish to the file which will be captured by the tokens:
• <input> ::= <type> <dimension> <input-data> • <input-data> ::= <data> <EOL> | <data> <data>
• <data> ::= <data-row> <EOL> | <data-row> "&" <data-row> • <data-row> ::= <comma-sep-values>
• <comma-sep-values> ::= (omitted for brevity)
For example to represent a string for the longest common subsequence in the input file a user would write char 1x5 A,B,C,D,E. After defining the dimensions of the scoring grid, and the number of diagonals to maintain, the user can define as many lines of input test data as they wish, all of which will be accessible from model.
Finally the user places on the last lines of the input file a switch, denoting whether test data should be stored within the texture memory or constant memory. We assume constant to be the default option therefore setting this switch to 0 uses constant, and 1 uses texture. Also the user should define whether they want to maintain the scoring grid on the host as
well, or just the current iterations on the GPU. This defaults to only storing what is currently on the GPU, therefore a value of 0 does not maintain the whole scoring grid, and a value of 1 does. These will be captured by the BNF tokens <mem> ::= "mem: 0" | "mem: 1" and <persist> ::= "maintain: 0" | "maintain: 1", respectively
A complete example of this, using the longest common sub-sequence as problem would appear as shown in Fig. 3.14
Representing the Dynamic Programming Case
Allowing the user to define the case definition of the dynamic programming case is a more challenging consideration, as it must be present at compile time for the model to be able to be able to generate the CUDA kernel. Due to the fact it must be present at compile time the definition of the dynamic programming case cannot be in the input file as with the other parameters, and must be written into a small function. To make this as easy as possible for the user, we develop small wrapper for the user, allowing them to implement different problems without needing to know the intricacies of CUDA programming.
A function is defined which is called each and every time a cell of the scoring grid is required to be filled, and the user simply needs to fill in the definition of the dynamic programming case here before compiling the program. This function is provided with a struct containing pointers to all the input data the user has defined, allowing the user to access all of the data they have specified in the input file. Also this function is provided with the current iteration number, as well as the i and j index of the cell that is being filled.
This function is also provided with the length of the current iteration, which is used during memory accesses, as well as a pointer to the data struct containing the number of previous iterations the user has opted to maintain, allowing them to load any data dependencies they need. It should be noted at this point that obviously memory access to previous iterations cannot simply take place through the desired (i, j) values. For example, if the wavefront
was in the middle of the scoring grid, and needed to load a cell from the previous diagonal iteration, the actual (i, j) value of the desired cell in the context of the entire scoring could be a very large number. Therefore this cannot be used to directly load data from the smaller grid on the GPU which is maintaining previous iterations. We provide a function the user must use when making memory accesses to previous iterations, allowing the user to make access to previous diagonals through global (i, j) references. This is covered in the following Sect.