Avances y transformaciones en la Educación Media Superior

Two main CUDA libraries were utilized when developing the code for this thesis. Thrust [92], a library based of the C++ STL for implementing high performance parallel applications, is utilized for sorting operations. CURAND [90] is used for random number generation in parallel. An overview of both libraries are brieﬂy covered in Sections 3.5.1 and 3.5.2.

3.5.1 Thrust

The Thrust template library [92] contains numerous high-performance parallel algorithms al- lowing programmers to optimize these parts of their code quickly and efﬁciently. Thrust is based on the C++ STL and is installed with the CUDA Toolkit. All that is required to use thrust functionality is including the relevant header ﬁles in the CUDA code.

More notable usages of Thrust when regarding parallel particle implementations include the zip_iterator, which allows multiple particle attributes to be packed together into a single vector and iterated over at the same time. For instance, during an iteration of numerical integration, the particle’s position, velocity and force can be zipped and iterated over using just one iterator instead of three.

For parallel implementations of particle systems, organizing particle spatial data becomes very important. This will be covered in greater detail when describing the spatial grid approach in Chapter 5, but for now consider the basic problem that particles must be sorted with respect to their position in a three dimensional world space. One way of doing this is to divide the world space into a three dimensional grid and from the bottom-up link particles to particular cells in the grid. Sorting of these cells needs to be efﬁcient as there will be many particles in a large system simulation and using inefﬁcient sorting algorithms will slow down performance.

THRUST’ssort_by_key, a radix sortsorting algorithm. Radix sort is one of the best

known sorting algorithms and is very efﬁcient for sorting small keys, assuming each key is

represented as an integer number in some radix notation [114]. Radix sorts can fall under two

main variants,least signiﬁcantandmost signiﬁcantdigits (LSD and MSD respectively). LSD is

the better choice for sorting integer keys, while MSD is better suited for sorting characters [97]. LSD groups keys based on the least signiﬁcant digit of each key, and sorting these digits usually with a bucket or counting sort [21]. This process is repeated for more signiﬁcant digits. Radix

sorts are often alternatives to comparison-based sorting algorithms such as the merge sort,

which are generally more efﬁcient for more complex keys.

Thrust also provides preﬁx-sum scan operations are also very useful for parallel sum calcu- lations. For example, an exclusive preﬁx sum scan can be used for calculating the total number of cells in a grid which are occupied by particles. Exclusive scan is used in the generation of surfaces in Chapter 7.

SECTION 3.5: CUDA LIBRARIES 45 c u r a n d S t a t e l o c a l S t a t e = s t a t e [ i d ] ; / / C i r c u l a r b a s e f l o a t t h e t a = ( 2 ∗ P I ) ∗ c u r a n d u n i f o r m (& l o c a l S t a t e ) ; f l o a t r = c u r a n d u n i f o r m (& l o c a l S t a t e ) ; p o s [ i d ] . x = 0 . 2 ∗ ( r ∗ c o s ( t h e t a ) ) ; p o s [ i d ] . y = 0 . 0 f ; p o s [ i d ] . z = 0 . 2 ∗ ( r ∗ s i n ( t h e t a ) ) ; / / Make r e d more p r o m i n e n t t o w a r d s t h e o u t s i d e o f t h e f l a m e c o l [ i d ] . x = 1 . 0 f ; c o l [ i d ] . y = ( 0 . 5 + 0 . 3 ∗ c u r a n d u n i f o r m (& l o c a l S t a t e ) ) ∗ (1−r ) ; c o l [ i d ] . z = ( 0 . 0 + 0 . 5 ∗ c u r a n d u n i f o r m (& l o c a l S t a t e ) ) ∗ (1−r ) ;

Listing 3.2: Code fragment of a ﬁre system kernel using CURAND randomization.

3.5.2 CURAND Random Number Generation

The highly randomized structure of some simulations demands efﬁcient generation of random numbers but this is not so simple when generating in parallel. If each particle has to randomize attribute values, the thread may not be able to guarantee that the random seed it is using is unique in the particle system, and this would have the effect of many particles generating the same random numbers and thus behaving in exactly the same way. Considering the size of particles in an average system is quite large, it is important to make sure every thread has a unique random seed of good quality. The CURAND library takes care of this.

CURAND [90] is a random number generation library designed speciﬁcally for use with CUDA. It allows generation of high quality pseudo-random and quasi-random number sequences efﬁciently in parallel. CURAND provides functionality for random number generation on both the host and device. If generated directly on the device, the random numbers

are stored in global memory. In order to use the library, the header ﬁles curand.h and

curand_kernel.hneed to be included in the program. CURAND libraries are constantly being updated. The current version v6.5 comes with the CUDA 6 Toolkit. The simulation itself only needs to make use of the basic random number generation functionality and the libraries are capable of much more complex generation. The following is a brief explanation of how to use the library at a basic level in relation to the particle system.

An array of the special type curandState is initialized for use on thedevice only (it

needs no host equivalent). Each particle in the simulation should have its own random number generator, so memory is allocated on the device for the number of particles in the simulation

times the size of the curandState type. Before any random numbers can be generated,

the random number generator must ﬁrst be initialized to setup the CURAND states, using

the functioncurand_init. This function will setup an initial state from the givenseed,

sequence number (id) and offset (0). The pseudo-random number sequences generated have

a period of at least 2190_[90].

It is best practice to use a unique seed every time the random number generator is initialized, and multiple kernel launches should use the same seed but assign different sequence

Number of Cores 2304

Clock Frequency (MHz) 863

Texture Fill Rate(billion/sec) 160.5

Memory Speed (Gbps) 6.0

Memory Conﬁguration 3072 MB GDDR5

Memory Interface Width 384-bit

Memory Bandwidth 288.4

Table 3.2: NVIDIA GeForce GTX 780 Speciﬁcations

numbers in a monotonically increasing way [90]. This is sufficiently random as long as the random numbers do not have to be unpredictable; for example, this would not be sufficiently random for security-related problems such as password generation. However, for the purposes of random number generation for particle systems, this method will efficiently produce pseudo- random numbers of sufficient quality in parallel.

Once the CURAND states have been initialized, the main simulation kernel can be ex- ecuted. A random number can be generated from a given state by calling variations of the curand()function. In this case,curand_uniformis used to generate a random number

in a uniform distribution between0.0and1.0. Other distributions such as the normal and pois-

son distributions are also available, as well as double-precision variants of these. An example of using CURAND random number generation is shown in Listing 3.2, a portion of a global

kernel call for updating particles which usecurand_uniformto randomize a particle’s ini-

tial starting position and colour variables.

In document antologia de maestria (página 68-97)