• No se han encontrado resultados

62 4.2.2 Inmunofluorescencia y microscopía confocal

8- IMPLICACIONES CLÍNICAS

For this single-threaded optimiser, it seems approximately3.8seconds is necessary to compile200random programs for the host processor. This does change however, depending on the structure of the candidates. For instance, if the tree of the program is sufficiently deep, many statements would need to be compiled, which will increase this time considerably.

The time taken to execute200random programs is approximately1.7msec. The Predator-prey model discussed with regard to parallelism in the next section is far more computationally expensive, and requires careful consid- eration, as it is also stochastic to a certain degree. In the case of the Santa Fe Ant Trail problem, the result is deterministic, and only one agent is active, which makes it very suited to being computed on the host.

6.5

Parallel MOL

(I, p.112)

It was observed that the performance of the optimising MOL language was quite prohibitively low, should the fitness evaluation phase be complex. Computing times vary, but are generally of the order of hours for a single experiment. Lessons learned from meta-optimisation (see Section3.5) reaffirm that performance would be extremely problematic, mostly due to the averaging and re-averaging necessary to obtain a good fitness estimate (should the problem be nondeterministic), and that candidates are actually entire simulations. Van Berkel’s effort was distributed across a set of processor nodes, but performance results given indicated that the execution of a single program took upwards from 350ms for a program of lowest complexity [283]; together with averaging, the author reported total runtime of around three hours for one experiment.

Stronger interest in large-scale agent-based models is also surfacing, where a large number of agents are simulated [220]. Being able to represent large populations is sometimes a necessity. For example, in ecology, a technique was even developed for approximating the influence of multiple agents in a “super-agent” [252]. When referring to a population of agents, of the order of106+, it becomes impractical to use machine learning for developing models. At this point in time, even with aggressive code optimisations, evolving large systems is still out of reach, but efforts making use of GPU hardware bring this goal closer. Even if large systems are not used during optimisation, these can be scaled up afterwards; though, the dynamics of the system may change dramatically.

Fortunately, NVIDIA have released a backend for LLVM which generates Parallel Thread Execution (PTX) instructions for NVIDIA GPU hardware [165]. The implications of this is that any LLVM frontend can now (with appropriate modifications) generate the appropriate LLVM IR code suitable for compiling to PTX instructions. Terra [46], upon which MOL is built, is also capable of this. As explained earlier, agent-based model simulations implemented on GPU hardware have in the past involved custom code, or code transformations [244] from agent specifications such as the X-machine [37]. Also, previous implementations of Genetic Programming algorithms on GPU hardware were implemented in a way that candidates would be evaluated by using an interpreter [156]. Very sophisticated methods such as the evolution of CUDA PTX programs themselves in 2011 [41] using the CUDA driver API was perhaps a sign of what was to come with the NVIDIA LLVM backend.

In order to enable the lattice-based MOL language to be compiled for execution on GPU hardware, it is necessary to:

1. Adjust how the data is stored in the host C++ program

2. Support a population of models executing concurrently

142 6. PARALLEL DOMAIN-SPECIFIC OPTIMISATION IN ABM

4. Generate CUDA code instead of host code

5. Reimplement a suitable source of random deviates

6. Consider different parallelisation strategies and how to implement these automatically

7. Consider the possibility of concurrency race conditions

Focus is given to the efficient computation of the objective function, in other words, the simulation of the candidate models so that fitness scores can be gathered quickly. In Chapter4, genetic operators were themselves implemented in parallel to cope with large numbers of candidates. While this provided some additional good scaling characteristics, the computing of the objective function proved to be far more computationally expensive. The successful mitigation of which will surely dwarf the potential benefit that can be achieved by parallelising the genetic operators.

Previously, the lattice along with a temporary write-only lattice was allocated on the host. Should CUDA be enabled in a MOL model, the data is instead allocated on the GPU hardware by using the CUDA API [202]. These device pointers are provided to the Terra compiled function instead of pointers to host memory. The compiled MOL code is therefore able to operate on the lattice, as allocated by the host on the GPU hardware. The code parser and type checker are identical, but a separate CUDA code generator is used in order to accommodate the restrictions imposed by the CUDA GPU architecture. Compiled code is then mostly PTX instructions, wrapped with the necessary host code to launch CUDA kernels with the correct thread grid and block dimensions. Once a timestep is computed, the data is copied back from the GPU to the host and then passed to the visualiser.

To accommodate a population of different candidate models as opposed to simply simulating a single given MOL-implemented model, it is necessary to extend the visualisation module as well as allocating enough memory in the above-mentioned GPU memory fornseparate candidate models. In essence, the separate portions of the allocated memory represent independent models, which are handed to their corresponding optimiser-modified MOL programs.

Compiling Terra code for CUDA is straightforward, provided that the boundaries of the device in terms of memory and thread resources are respected. The usual Terra code generated is essentially compiled into a single CUDA kernel, which is launched with a grid and block configuration, and its arguments, by a separate host Terra function. Given that an appropriate grid and block must be provided, this presents an opportunity to discuss different parallelisation techniques.

Three parallelisation strategies are implemented from which the user may freely choose. The first is a simple “one-thread, one-model” (1T1M) strategy, where a single CUDA thread is assigned a candidate model. This CUDA thread is then responsible for executing the entire model simulation once per time step. This is unsuitable most of the time, especially when one candidate model operates on a larger lattice, or the model is demanding of processing time required. The second strategy is named “one-block, one-model” (1B1M), in which an entire CUDA block is dedicated to computing a single model simulation once per timestep. While this may seem the obvious choice in nearly all circumstances, the limitations of block sizes (1024 threads maximum at the time of writing), mean that the lattice sizes have a limit. A great many candidate model simulations can be executed concurrently at reasonable speeds using this, but the limitation in lattice size is a considerable issue. The third strategy is termed “many-blocks, one-model” (*B1M), where multiple blocks are assigned to a single candidate model. This allows much larger model sizes, but race conditions become more difficult to eliminate, which requires further strategies.

The user must choose a strategy to overcome potential race conditions in a model simulation. Two strategies are provided from which the user may choose for the 1B1M and *B1M parallelisation strategies. The first is a

Documento similar