When working with a GPU the very first step is always to identify the GPU and set the device. Except for the situation of a single GPU on a system with only one user, this ensures that indeed a device is ready to execute kernels or perform memory transfers. To simplify the procedure we offer the very simple tool DeviceSelector with the interface
DeviceSelector.h
class DeviceSelector {public:
DeviceSelector() = delete;
static int setFirstAvailable();
};
Compiled with the nvcc compiler it will loop through all GPUs connected to the system and try to occupy one of them. The return value is the index of the GPU that has been set successfully. If no device is available it exits the program. For pure host compilers
the function always returns zero.
To receive more basic informations about the device one may use the helper class DeviceInfo, which collects some data of the current device and can print them into std::cout. Its synopsis is
DeviceInfo.h
class DeviceInfo {public:
static constexpr unsigned int hostBlockDimDefault = 1024;
static constexpr unsigned int hostWarpSize = 32;
CUDA_HOST DeviceInfo();
CUDA_HOST void update();
CUDA_HOST unsigned int getMaxBlockDim() const;
CUDA_HOST unsigned int getWarpSize() const;
CUDA_HOST void print() const;
};
The perhaps most difficult part when handling with GPU code is proper memory handling. For this purpose we wrote the class SmartArray, which is our work horse for the memory management of CUDA applications. Its public interface is
SmartArray.h
template<typename T, CudaHostDeviceType hostDeviceType = CudaHostDeviceType::HOST>
class SmartArray {public:
using type = T;
static constexpr CudaHostDeviceType hostOrDevice = hostDeviceType;
friend class SmartArray<T, CudaHostDeviceType::HOST>;
#ifdef __CUDACC__
friend class SmartArray<T, CudaHostDeviceType::DEVICE>;
#endif // __CUDACC__
CUDA_HOST SmartArray();
CUDA_HOST explicit SmartArray( const unsigned int sizeIn );
template<class InputIter>
CUDA_HOST SmartArray( InputIter first, InputIter last );
CUDA_HOST SmartArray( const SmartArray<T, hostDeviceType>& orig );
template<CudaHostDeviceType SourceType>
CUDA_HOST SmartArray( const SmartArray<T, SourceType>& orig );
CUDA_HOST ~SmartArray();
CUDA_HOST SmartArray<T, hostDeviceType>& operator=( SmartArray<T, hostDeviceType> orig );
,→
template<CudaHostDeviceType SourceType>
CUDA_HOST SmartArray<T, hostDeviceType>& operator=( const SmartArray<T, SourceType>&
orig );
,→
template<class Array>
CUDA_HOST SmartArray<T, hostDeviceType>& operator=( const Array& orig );
CUDA_HOST_DEVICE T& operator[]( const unsigned int index ) CUDA_HOST_DEVICE T& at( const unsigned int index )
CUDA_HOST_DEVICEconst T& operator[]( const unsigned int index ) const;
CUDA_HOST_DEVICEconst T& at( const unsigned int index ) const;
CUDA_HOST T getToHost( const unsigned int index ) const;
CUDA_HOST T getToHostWithIndexCheck( const unsigned int index ) const;
CUDA_HOST void newArray( const unsigned int newSize );
template<CudaHostDeviceType SourceType, class Array>
CUDA_HOST void newAssign( const Array& orig );
CUDA_HOST const T* data() const;
CUDA_HOST_DEVICEunsigned int size() const;
CUDA_HOST unsigned int useCount() const;
void swap( SmartArray<T, hostDeviceType>& orig );
};
It is obviously a mixture of a container class and a smart pointer. In principle it shares the basic idea of the std::shared_ptr, being a reference counting object. All initiated objects live on the host, but manage memory on the host or device depending on their second template argument hostDeviceType. Copy construction and assignment are extremely cheap if both objects have the exact same type due to the reference counting ansatz which does not perform a deep copy. A copy construction or an assignment of two SmartArrays with different CudaHostDeviceTypes creates a completely new object, copying the managed data from the host to the device or vice versa. To do the same if both objects are of the same type, one can use the function newAssign. But it is important that the template argument Array is a type that controls contiguous memory, otherwise the behavior is undefined. The same is true for the corresponding assignment operator and (of course) the constructor using iterators. To check the number of objects referring to the same data, use the function useCount. The part
that is more related to a container class manifests in typical member functions like the subscript operators, the at or the size function. They offer access to the managed memory on the host or device, depending on the CudaHostDeviceType. For device arrays the member functions getToHost and getToHostWithIndexCheck copy also single data back to the host. However, this is quite ineffective and should be used only if absolutely necessary. The index checking functions will print a warning via printf, because throwing an exception is not feasible on the device.
To make a full C++feeling possible and avoid raw pointers everywhere we also needed to write a stack object for arrays, similar to std::array, as SmartArray is a heap object like std::vector. Albeit we decided to alter the interface a bit, being now
Array.h
template<typename T, unsigned int size>
class Array {public:
using value_type = T;
static constexpr unsigned int SIZE = size;
CUDA_HOST_DEVICE Array();
explicit CUDA_HOST_DEVICEArray( const T& value );
CUDA_HOST_DEVICE Array( const T (&data_in)[size] );
CUDA_HOST_DEVICE void set( const unsigned int index, const T& value );
CUDA_HOST_DEVICE void setAt( const unsigned int index, const T& value );
CUDA_HOST_DEVICE void set( const T& value );
CUDA_HOST_DEVICE void set( const T (&data_in)[size] );
CUDA_HOST_DEVICE T& operator[]( const unsigned int index );
CUDA_HOST_DEVICE T& at( const unsigned int index );
CUDA_HOST_DEVICE const T& operator[]( const unsigned int index ) const;
CUDA_HOST_DEVICE const T& at( const unsigned int index ) const;
};
The obvious difference to std::array are the functions set and setAt and the ad-ditional constructors. They all aim for a more convenient construction and change of all data stored in the object. The function taking only a value will assign that value to all indices at once. The functions with pointer arguments are mainly to allow kind of std::initializer_list like behavior. The functions taking an index as well as a value are superfluous due to the standard access functions but are provided as they give a more complete interface having the other set functions anyway. Similar
to the subscript operator and the at function, which give access to the managed data, also the set and the setAt function access the data without and with an index check respectively.
Finally we missed also std::pair on the device, and thus defined the simple struct
Pair.h
template<typename T1, typename T2>
struct Pair
{ using first_type = T1;
using second_type = T2;
T1 first;
T2 second;
CUDA_HOST_DEVICEPair();
CUDA_HOST_DEVICEPair( T1 firstInput, T2 secondInput );
};
We like to mention that std::pair is sufficient to access the members (and clearly the types) as long as no member functions are used on the device. However, this excludes constructors and assignment operators (except the copy constructor during the kernel call). To perform these quite desirable actions also on the device one has to use Pair.
Only for the host we provide a simple tool to measure execution times, which has also been used for all measurements in chapter 2. The synopsis is
StopWatch.h
template<class stdChronoDuration>
class StopWatch {public:
using clock = std::chrono::steady_clock;
StopWatch();
~StopWatch();
};
and it works pretty simple: in the constructor the time measurement start, the destructor stops the measurement and prints the past time into std::cout. Hence to measure the time of a scope one simply has to initiate one instance of StopWatch with the desired precision as template argument.