DATOS CLIMÁTICOS MÁS IMPORTANTES
DIMENSIONES PASARELAS DE MADERA
4. ELEMENTOS DE URBANIZACIÓN DE LOS SENDEROS
4.1.2. CARTELES DESCIPTIVOS
The Robust Color Morphological Gradient computes the vectorial gradient of an-dimensional image based on the distance between pixel vectors and resulting in a one band gradient image, (see Section 2.2.2).
From a processing point of view, two different algorithms have been implemented to com- pute the RCMG. The first one is based on the spatial-domain partitioning within a block, as illustrated in Figure 5.1(a), and the second algorithm is based on the spectral-domain parti- tioning within a block, Figure 5.1(b). Each shaded rectangle in this figure represents a block of X×Y threads. Thus, in the spatial partitioning, the pixel vectors are kept as a whole, while the pixel vectors are subdivided into slices made up of contiguous spectral bands in the spectral-partitioning. In both cases, data are stored in global memory so that consecutive threads access consecutive global memory locations (coalescent accesses).
In these algorithms, the processing of one pixel requires data from the neighboring pixels. So, in order to compute the RCMG, we must extend the block with a border, resulting in a block with an apron of size one, as explained in Section 4.1.1.
The pseudocode introduced in Figure 5.2 shows the RCMG kernel that was described in Section 3.2.1. The gradient calculation is divided into three steps. First, threads within a block load data from global to shared memory, including the extended border of size one (line 2). Second, the threads of the same block cooperate to calculate the distances of the setχ (lines
(a) (b)
Figure 5.1:Block configuration for spatial (a) and spectral (b) partitioning.
previous step. This kernel makes use of the shared memory at each step, so data are reused within the block.
Spatial partitioning algorithm for RCMG
In this implementation, threads in theX dimension load different components of the same pixel vector simultaneously into the shared memory. In each block, the threadt1loads one feature of one pixel vector of the image, the threadt2the second feature of the same pixel vector of the image, and the threadtkthek-th feature withkthe number of hyperspectral bands of the image.
With the data of a pixel vector in shared memory, each thread computes a partial result (xi
k−x j
k, withi,j∈χ as described in (2.8). First, threads in theX dimension cooperate in
a parallel reduction [73] within the block for computing the CMG. Half of the threads work in the reduction, and the number of active threads is halved at each iteration as the reduction proceeds. Second, one thread per pixel vector finds the pair of pixels that generated the maximum distance and computes the RCMG with the remaining distances. It should be noted that the distances are stored in shared memory, and therefore are available for all threads within the block. Finally, in the third step, the RCMG is written in global memory, resulting in one band gradient image which is kept in the global memory of the GPU.
5.2. CA–WSHED–GPU 137
Input: hyperspectral datasetX Output: one band image
1: foreach bandkofX do
2: load bandkin shared memory .step 1 (SM)
3: foreach pixelxin bandkdo .step 2 (SM)
4: compute and accumulate the corresponding termxi k−x
j kin
the Euclidean distanceDi,j|i,j∈χ, withχthe set of neighbors of pixelx.
5: end for
6: synchronize thread within the block
7: end for
8: compute CMG(X) =maxi,j∈χDi,j
9: find the pair of pixelsRs= (i,j)|Di,j=CMG(X)
10: compute RCMG(X) =maxi,j∈χ−Rs{Di,j} .step 3 (SM)
11: store RCMG(X)to global memory
SM states for computation in shared memory.
Figure 5.2:Pseudocode for the RCMG CUDA kernel executed in shared memory (spectral-domain partitioning)
Spectral partitioning algorithm for RCMG
In the spectral partitioning implementation, each thread processes all the spectral components of a pixel vector in a loop through all the hyperspectral bands. The pseudocode shown in Figure 5.2 corresponds to the spectral partitioning algorithm. At each iteration k, all the threads load data in shared memory corresponding to the k-th band, computing the partial results (xik−xkj) for each pair of neighborsi,j. At the end of the loop, all the distances for each pixel are available in shared memory. To compute the CMG, each thread finds the maximum of the distances of its setχ, (step two in Figure 5.2), and the corresponding pair of pixels which generated that maximum (line 9). Having identified the pixel vectors that are furthest apart, each thread computes the RCMG with the remaining distances, (step three in Figure 5.2), and writes the result back to global memory, which is the last step of the algorithm. This implementation requires less shared memory that the previous one owing to the sequential scanning in the spectral domain.
RCMG 128×4 32×4 32×8 32×16 Spatial Partitioning L1 ( /0) na na na Spectral Partitioning L1 na 1 ( /0) ( /0) Spatial Partitioning Sh 1 na na na Spectral Partitioning Sh na 4 2 1
Table 5.1:Number of active blocks per SMX for the spatial-domain and spectral-domain partitioning, based on the block size and the shared memory requirements. L1 indicates that 48 KB are used for the L1 memory and 16 KB for the shared memory. Sh states for the opposite configuration. Results for Pavia University dataset. The best occupancy is indicated in bold.
Performance analysis for RCMG on GPU
In order to include the best implementation of the RCMG in the CA–WSHED–GPU scheme, we have first evaluated the performance on the Intel quad-core i7-860 microprocessor and the GTX 680 GPU based on the Kepler architecture (compute capability 3.0). The RCMG is compared to a parallel multi-threaded CPU implementation using OpenMP. The OpenMP implementation uses 4 threads and it is based on the spectral-domain partitioning approach. The work is scheduled statically among the threads, through a loop construct. The datasets used in this analysis are the Pavia University which has 103 spectral bands and the Salinas Valley dataset with 220 spectral bands. Details of these datasets are described in Section 2.9.3. Different block configurations were tested and finally the spectral partitioning RCMG implementation was configured with blocks of 32×4 threads. For the spatial partitioning RCMG implementation, 128×4 threads per block and 256×2 threads per block were con- sidered. Each block in the spatial partitioning approach processes a region of 4×4 pixel vectors for the first case and a region of 2×2 pixel vectors for the second one. Thus, each thread in a block processes 4 or 2 pixel vectors in this implementation.
Table 5.1 shows the maximum number of active blocks for the each implementation using double precision arithmetic. The text “na” states that the block configuration was not available for that implementation, and ( /0) that insufficient resources are available for that configuration. For example, the spatial partitioning RCMG if configured with 128×4 threads per block re- quires 42240 bytes of shared memory per block, in order to compute the distances of each pixel vector in shared memory. Thus, using the L1 configuration with 16 KB of shared mem- ory is not enough to execute at least one block in the SMX. The operations are performed
5.2. CA–WSHED–GPU 139
Spatial-domain OpenMP GTX 680 GTX 680 Partitioning (4 threads) simple double University of Pavia 0.1702s 0.0537s (2.8×) 0.1317s (1.3×) Salinas Valley 0.1959s 0.0638s (3.1×) /0 Spectral-domain OMP GTX 680 GTX 680 Partitioning (4 threads) simple double University of Pavia 0.1702s 0.0085s(17.8×) 0.0231s(7.3×) Salinas Valley 0.1959s 0.0092s(21.3×) 0.0272s(6.4×)
Table 5.2:Performance results for the spatial and spectral partitioning algorithms for the RCMG computing. Execution times in seconds. Speedups based on the OpenMP implementation using four threads. Best results in bold.
in double precision, which requires twice the amount of memory than simple precision arith- metic.
The computation of the distances between all pairs of pixel vectors requires a lot of shared memory. By using 4–connectivity, we have 10 pairs of combinations in the set χ (see Sec- tion 3.2.1), and 36 pairs in the case of 8–connectivity. The highest number of concurrent blocks is always achieved with the Sh configuration, i.e., using 48 KB of shared memory.
Table 5.2 shows the execution time and the speedup using 4-connectivity. The best results are for the spectral partitioning RCMG with speedups of 17.8×and 21.3×operating in simple precision. The shared memory requirements for the spectral partitioning RCMG are 5.7 KB (simple) and 11.4 KB (double) per block, while the spatial partitioning RCMG requires up to 20.6 KB (simple) and 41.2 KB (double). Thus, more blocks per SMX are concurrently executed in the spectral approach which leads to a better speedup. The same applies to the case of double precision arithmetic.
By performing the calculations in double precision arithmetic, which is the case for the spectral-spatial classification scheme, we observe that the spatial partitioning approach, where the pixel vectors are kept as a whole, does not work for the Salinas hyperspectral image. The reason is that the shared memory requirements for a block size of 256×2 threads rises to 76.3 KB. Thus, there is insufficient shared memory in the SXMs to execute one block of threads. However, as the computation can be done individually in each band, the spectral-domain par- titioning can be used for computing the RCMG in GPU. The spectral partitioning RCMG obtains speedups of 6.5×for the University of Pavia and 7.2×for the Salinas valley image,
comparing the execution time in double precision arithmetic with the OpenMP implementa- tion.