4 Concepto de manejo
4.2 Elementos de manejo
The running times for some CUDA kernel calls for the two GPUs in use are presented in table 4.1. The time reduction is given as the dierence between the new and old running times as percentages of the old running time, in order to have a common number of merit for all kernels regardless of original running time. This is not a very good gure of merit, but can be used for comparisons.
As is seen, the running time, not unexpectedly, is lower on the newer graphics card but not by a large amount. The total running time on the older GPU was 25 ms, while the higher-end graphics card has a running time of 7 ms. The age gap between the two graphics cards is about 2 years, and though they are developed for dierent applications, for such a large upgrade the dierences in running time should be far more.
The new graphics card has a larger amount of registers. This is well received by ReInvMuad and ReInvMuae as these have a performance boost of around 80% as seen in the table. The limiting
Table 4.1: Comparison of running times for some CUDA kernels across graphics cards.
Operation Quadro FX 3700M GeForce GTX 670 Time reduction
(ms) (ms) (%)
Invert µa,d, 28 bands 0.68 0.10 85
SCA, 28 bands, 4 endmembers 1.10 0.51 54
Invert µa,e, 28 bands 2.51 0.64 75
Monochrom. unmixing, 28 bands 0.02 0.02 0
Invert µa,d, 160 bands 3.74 0.50 87
SCA, 31 bands, 5 endmembers 1.52 1.09 28
SCA, 20 bands, 7 endmembers 2.42 1.83 24
SCA, 26 bands, 5 endmembers 1.49 1.08 28
factor for these two were the reuse of computations and temporary saving of variables and the register overhead, which is lessened on the new GPU. There are not many ways to optimize this further as all the computations are needed. The only way to optimize is to rearrange and try to reuse as much of the calculations as possible, but major optimizations have already been put through.
Performance was not found to increase by increasing the number of threads per block. This is due to the amount of registers in use compared to the amount of registers being available still being high. SCA, however, does not have the same performance boost. The amount of shared memory has not changed much for the newer GPU [2]. All of the arrays are saved in shared memory for faster memory access, but evidently, this strategy is not scalable. Saving the matrix multiplication STS to shared
memory is reasonable as this is used often and across threads. Saving the fractions to shared memory is dubious. This is used to reduce overhead across iterations, but the global memory lag it is supposed to hide will be exchanged with a shared memory lag that cannot be hidden away. Too much shared memory is used on a per-block basis for the GPU to be able to exchange one block with another to hide away the memory lag. It is possible that the GPU, on the other hand, would be able to reduce global memory lag with a higher thread occupancy when the shared memory strain is reduced. The investigation of this is shown in table 4.2.
Table 4.2: Comparison of running times for dierent variants of SCA either allocating all arrays in shared memory or using the global memory. 20 bands and 7 endmembers.
Variant Running time per line Theor. occupancy Occupancy Shared memory/block
(ms) (%) (%) (kB) w/ shared memory 1.83 23.4 11.1 12.9 w/ global memory 4.81 93.8 11.2 0.40 300 threads/block 4.82 93.8 15.6 0.40 800 threads/block 5.11 78.1 38.7 0.40 800 threads/block, 4 lines 1.53 78.1 44.6 0.20
Less use of shared memory increases the occupancy, but the running time increases. This will be due to the global memory being far slower. Are the number of threads per block increased is the running time not readily decreased, but this is due to lower multiprocessor utilization, as the number of blocks will not match the number of multiprocessors. Are multiple lines inverted, in order to have a number of blocks matching the number of multiprocessors, is the running time actually decreased compared to the version using shared memory.
The above results were run for compute capability 1.1. The newer GPU also has support for compute capability up till 3.0. The running times for the dierent compute capabilities is shown in gure 4.3. The main thing that is the most apparent here is that the thread- and block distribution is no longer optimal, and the running times are higher with increasing compute capability.
Table 4.3: Comparison of running times for SCA (20 bands and 7 endmembers), ReIsoL2InvMuad (160 bands) and ReIsoL2InvertMuae (28 bands) for dierent compute capabilities.
sm *InvMuad *InvMuae SCA
1.1 0.644 0.503 1.827 1.2 0.644 0.503 1.827 1.3 0.644 0.503 1.827 2.0 1.351 3.803 3.191 2.1 1.357 3.809 3.177 3.0 1.611 3.780 3.075
For now has there been no need for features available from higher compute capabilities, and the program has been kept to 1.x regardless that the GPU in use has support for a higher compute capability. A version of SCA using the registers instead of shared memory was proposed. The result is shown in table 4.4.
Table 4.4: Comparison between SCA and SCAFast, 20 bands and 7 endmembers Function Time Shmem Registers Occ. Theor. occ.
(ms) (kB) (%) (%)
SCA 1.824 12.9 18 11.1 23.4
SCAFast 0.321 0.43 63 10.9 46.9
The register usage becomes very large with this change, comparable to the register usage of InvMuae. The performance boost is on the other hand huge, but the maintainability of the code becomes worse. There are two dierent extremities present - one of the methods uses excruciatingly much shared memory, while the other method puts a huge strain on the registers. Straining the registers seem to result in a more optimal function than straining the shared memory. This will partially be due to the fact that the registers are far faster than the shared memory.
Some concern may also be directed towards the fact that the thread distribution might not be optimal. The performance boost of the monochromatic unmixing (MultVector()) will for example be negligible since the workload in each kernel is low and the most of the computation time will be due to global memory access. This would scale better with hardware if there was a higher threads per block, as was seen in table 4.2.
ISRA was found to use 2.5 s for the unmixing of one line at all three intervals, using 1000 iterations.