Back-ends
Accelerator Implementations
The table shows which native implementation or information is used to represent an alpaka functionality.
alpaka |
Serial |
std::thread |
OpenMP 2.0 |
OpenMP 4.0 |
CUDA 9.0+ |
---|---|---|---|---|---|
Devices |
Host Core |
Host Cores |
Host Cores |
Host Cores |
NVIDIA GPUs |
Lib/API |
standard C++ |
std::thread |
OpenMP 2.0 |
OpenMP 4.0 |
CUDA 9.0+ |
Kernel execution |
sequential |
std::thread(kernel) |
omp_set_dynamic(0), #pragma omp parallel num_threads(iNumKernelsInBlock) |
#pragma omp target, #pragma omp teams num_teams(…) thread_limit(…), #pragma omp distribute, #pragma omp parallel num_threads(…) |
cudaConfigureCall, cudaSetupArgument, cudaLaunch |
Execution strategy grid-blocks |
sequential |
sequential |
sequential |
undefined |
undefined |
Execution strategy block-kernels |
sequential |
preemptive multitasking |
preemptive multitasking |
preemptive multitasking |
lock-step within warps |
getIdx |
emulated |
block-kernel: mapping of std::this_thread::get_id() grid-block: member variable |
block-kernel: omp_get_num_threads() to 3D index mapping grid-block: member variable |
block-kernel: omp_get_num_threads() to 3D index mapping grid-block: member variable |
threadIdx, blockIdx |
getExtents |
member variables |
member variables |
member variables |
member variables |
gridDim, blockDim |
getBlockSharedMemDynSizeBytes |
allocated in memory prior to kernel execution |
allocated in memory prior to kernel execution |
allocated in memory prior to kernel execution |
allocated in memory prior to kernel execution |
__shared__ |
allocBlockSharedMem |
master thread allocates |
syncBlockKernels -> master thread allocates -> syncBlockKernels |
syncBlockKernels -> master thread allocates -> syncBlockKernels |
syncBlockKernels -> master thread allocates -> syncBlockKernels |
__shared__ |
syncBlockKernels |
not required |
barrier |
#pragma omp barrier |
#pragma omp barrier |
__syncthreads |
atomicOp |
hierarchy depended |
std::lock_guard< std::mutex > |
#pragma omp critical |
#pragma omp critical |
atomicXXX |
ALPAKA_FN_HOST_ACC, ALPAKA_FN_ACC, ALPAKA_FN_HOST |
inline |
inline |
inline |
inline |
__device__, __host__, __forceinline__ |
Serial
The serial accelerator only allows blocks with exactly one thread. Therefore it does not implement real synchronization or atomic primitives.
Threads
Execution
To prevent recreation of the threads between execution of different blocks in the grid, the threads are stored inside a thread pool. This thread pool is local to the invocation because making it local to the KernelExecutor could mean a heavy memory usage and lots of idling kernel-threads when there are multiple KernelExecutors around. Because the default policy of the threads in the pool is to yield instead of waiting, this would also slow down the system immensely.
OpenMP
Execution
Parallel execution of the kernels in a block is required because when syncBlockThreads is called all of them have to be done with their work up to this line.
So we have to spawn one real thread per kernel in a block.
omp for
is not useful because it is meant for cases where multiple iterations are executed by one thread but in our case a 1:1 mapping is required.
Therefore we use omp parallel
with the specified number of threads in a block.
Another reason for not using omp for
like #pragma omp parallel for collapse(3) num_threads(blockDim.x*blockDim.y*blockDim.z)
is that #pragma omp barrier
used for intra block synchronization is not allowed inside omp for
blocks.
Because OpenMP is designed for a 1:1 abstraction of hardware to software threads, the block size is restricted by the number of OpenMP threads allowed by the runtime. This could be as little as 2 or 4 kernels but on a system with 4 cores and hyper-threading OpenMP can also allow 64 threads.
Index
OpenMP only provides a linear thread index. This index is converted to a 3 dimensional index at runtime.
Atomic
We can not use #pragma omp atomic
because braces or calling other functions directly after #pragma omp atomic
are not allowed.
Because we are implementing the CUDA atomic operations which return the old value, this requires #pragma omp critical
to be used.
omp_set_lock
is an alternative but is usually slower.
CUDA
Nearly all CUDA functionality can be directly mapped to alpaka function calls. A major difference is that CUDA requires the block and grid sizes to be given in (x, y, z) order. alpaka uses the mathematical C/C++ array indexing scheme [z][y][x]. In both cases x is the innermost / fast running index.
Furthermore alpaka does not require the indices and extents to be 3-dimensional. The accelerators are templatized on and support arbitrary dimensionality. NOTE: Currently the CUDA implementation is restricted to a maximum of 3 dimensions!
NOTE: You have to be careful when mixing alpaka and non alpaka CUDA code. The CUDA-accelerator back-end can change the current CUDA device and will NOT set the device back to the one prior to the invocation of the alpaka function.
Function Attributes
Depending on the cmake argument ALPAKA_ACC_GPU_CUDA_ONLY_MODE
the function attributes are defined differently.
ALPAKA_ACC_GPU_CUDA_ONLY_MODE=OFF (default)
CUDA |
alpaka |
---|---|
|
|
|
– |
|
– |
|
|
ALPAKA_ACC_GPU_CUDA_ONLY_MODE=ON
CUDA |
alpaka |
---|---|
|
|
|
|
|
– |
|
|
Note
There is no alpaka equivalent to __global__
because the design of alpaka does not allow it. When running a alpaka kernel, alpaka creates a __global__
kernel that performs some setup functions, such as creating the acc object, and then runs the user kernel, which must be a CUDA __device__
function.
Note
You can not call CUDA-only methods, except when ALPAKA_ACC_GPU_CUDA_ONLY_MODE
is enabled.
Note
When calling a constexpr
function from inside a device function, also mark the called function as a device function, e.g. by prepending ALPAKA_FN_ACC
.
Note that some compilers do that by default, but not all. For details please refer to #1580 .
Memory
CUDA |
alpaka |
---|---|
|
|
|
|
|
|
Declare a block shared variable.
The variable is uninitialized and not default constructed! The variable can be accessed by all threads within a block. Access to the variable is not thread safe.
- Template Parameters:
T – The element type.
TuniqueId – id those is unique inside a kernel
TBlockSharedMemSt – The block shared allocator implementation type.
- Parameters:
blockSharedMemSt – The block shared allocator implementation.
- Returns:
Uninitialized variable stored in shared memory.
-
ALPAKA_STATIC_ACC_MEM_CONSTANT
This macro defines a variable lying in constant accelerator device memory.
Example: ALPAKA_STATIC_ACC_MEM_CONSTANT alpaka::DevGlobal<TAcc, const int> variable;
Those variables behave like ordinary variables when used in file-scope, but inside kernels the get() method must be used to access the variable. They are declared inline to resolve to a single instance across multiple translation units. Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.
In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.
- Attention
It is not allowed to initialize the variable together with the declaration. To initialize the variable alpaka::memcpy must be used.
ALPAKA_STATIC_ACC_MEM_CONSTANT alpaka::DevGlobal<TAcc, const int> foo; struct DeviceMemoryKernel { ALPAKA_NO_HOST_ACC_WARNING template<typename TAcc> ALPAKA_FN_ACC void operator()(TAcc const& acc) const { auto a = foo<TAcc>.get(); } } void initFoo() { auto extent = alpaka::Vec<alpaka::DimInt<1u>, size_t>{1}; int initialValue = 42; alpaka::ViewPlainPtr<DevHost, int, alpaka::DimInt<1u>, size_t> bufHost(&initialValue, devHost, extent); alpaka::memcpy(queue, foo<Acc>, bufHost, extent); }
-
ALPAKA_STATIC_ACC_MEM_GLOBAL
This macro defines a variable lying in global accelerator device memory.
Example: ALPAKA_STATIC_ACC_MEM_GLOBAL alpaka::DevGlobal<TAcc, int> variable;
Those variables behave like ordinary variables when used in file-scope, but inside kernels the get() method must be used to access the variable. They are declared inline to resolve to a single instance across multiple translation units. Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.
In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.
- Attention
It is not allowed to initialize the variable together with the declaration. To initialize the variable alpaka::memcpy must be used.
ALPAKA_STATIC_ACC_MEM_GLOBAL alpaka::DevGlobal<TAcc, int> foo; struct DeviceMemoryKernel { ALPAKA_NO_HOST_ACC_WARNING template<typename TAcc> ALPAKA_FN_ACC void operator()(TAcc const& acc) const { auto a = foo<TAcc>.get(); } } void initFoo() { auto extent = alpaka::Vec<alpaka::DimInt<1u>, size_t>{1}; int initialValue = 42; alpaka::ViewPlainPtr<DevHost, int, alpaka::DimInt<1u>, size_t> bufHost(&initialValue, devHost, extent); alpaka::memcpy(queue, foo<Acc>, bufHost, extent); }
Index / Work Division
CUDA |
alpaka |
---|---|
|
|
|
|
|
|
|
|
|
|
Types
CUDA |
alpaka |
---|---|
|
|
CUDA Runtime API
The following tables list the functions available in the CUDA Runtime API and their equivalent alpaka functions:
Device Management
Error Handling
CUDA |
alpaka |
---|---|
cudaGetErrorName |
n/a (handled internally, available in exception message) |
cudaGetErrorString |
n/a (handled internally, available in exception message) |
cudaGetLastError |
n/a (handled internally) |
cudaPeekAtLastError |
n/a (handled internally) |
Queue Management
CUDA |
alpaka |
---|---|
cudaLaunchHostFunc cudaStreamAddCallback |
alpaka::enqueue(queue, [](){dosomething();}) |
cudaStreamAttachMemAsync |
– |
cudaStreamCreate |
|
cudaStreamCreateWithFlags |
see cudaStreamCreate (cudaStreamNonBlocking hard coded) |
cudaStreamCreateWithPriority |
– |
cudaStreamDestroy |
n/a (Destructor) |
cudaStreamGetFlags |
– |
cudaStreamGetPriority |
– |
cudaStreamQuery |
bool alpaka::empty(queue) |
cudaStreamSynchronize |
void alpaka::wait(queue) |
cudaStreamWaitEvent |
void alpaka::wait(queue, event) |
Event Management
CUDA |
alpaka |
---|---|
cudaEventCreate |
alpaka::Event< TQueue > event(dev); |
cudaEventCreateWithFlags |
– |
cudaEventDestroy |
n/a (Destructor) |
cudaEventElapsedTime |
– |
cudaEventQuery |
bool alpaka::isComplete(event) |
cudaEventRecord |
void alpaka::enqueue(queue, event) |
cudaEventSynchronize |
void alpaka::wait(event) |
Memory Management
CUDA |
alpaka |
---|---|
cudaArrayGetInfo |
– |
cudaFree |
n/a (automatic memory management with reference counted memory handles) |
cudaFreeArray |
– |
cudaFreeAsync |
n/a (automatic memory management with reference counted memory handles) |
cudaFreeHost |
n/a (automatic memory management with reference counted memory handles) |
cudaFreeMipmappedArray |
– |
cudaGetMipmappedArrayLevel |
– |
cudaGetSymbolAddress |
– |
cudaGetSymbolSize |
– |
cudaHostAlloc |
alpaka::allocMappedBuf<TElement, TIdx>(host, platform, extents) 1D, 2D, 3D supported! [1] |
cudaHostGetDevicePointer |
– |
cudaHostGetFlags |
– |
cudaHostRegister |
– |
cudaHostUnregister |
– |
cudaMalloc |
alpaka::allocBuf<TElement>(device, extents1D) |
cudaMalloc3D |
alpaka::allocBuf<TElement>(device, extents3D) |
cudaMalloc3DArray |
– |
cudaMallocArray |
– |
cudaMallocAsync |
alpaka::allocAsyncBuf<TElement>(queue, extents1D) |
cudaMallocHost |
alpaka::allocMappedBuf<TElement, TIdx>(host, platform, extents) 1D, 2D, 3D supported! [1] |
cudaMallocManaged |
– |
cudaMallocMipmappedArray |
– |
cudaMallocPitch |
alpaka::allocBuf<TElement>(device, extents2D) |
cudaMemAdvise |
– |
cudaMemGetInfo |
|
cudaMemPrefetchAsync |
– |
cudaMemRangeGetAttribute |
– |
cudaMemRangeGetAttributes |
– |
cudaMemcpy |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
cudaMemcpy2D |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents2D) |
cudaMemcpy2DArrayToArray |
– |
cudaMemcpy2DAsync |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents2D) |
cudaMemcpy2DFromArray |
– |
cudaMemcpy2DFromArrayAsync |
– |
cudaMemcpy2DToArray |
– |
cudaMemcpy2DToArrayAsync |
– |
cudaMemcpy3D |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
cudaMemcpy3DAsync |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
cudaMemcpy3DPeer |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
cudaMemcpy3DPeerAsync |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
cudaMemcpyArrayToArray |
– |
cudaMemcpyAsync |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
cudaMemcpyFromArray |
– |
cudaMemcpyFromArrayAsync |
– |
cudaMemcpyFromSymbol |
– |
cudaMemcpyFromSymbolAsync |
– |
cudaMemcpyPeer |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
cudaMemcpyPeerAsync |
alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
cudaMemcpyToArray |
– |
cudaMemcpyToArrayAsync |
– |
cudaMemcpyToSymbol |
– |
cudaMemcpyToSymbolAsync |
– |
cudaMemset |
alpaka::memset(queue, memBufDst, byte, extents1D) |
cudaMemset2D |
alpaka::memset(queue, memBufDst, byte, extents2D) |
cudaMemset2DAsync |
alpaka::memset(queue, memBufDst, byte, extents2D, queue) |
cudaMemset3D |
alpaka::memset(queue, memBufDst, byte, extents3D) |
cudaMemset3DAsync |
alpaka::memset(queue, memBufDst, byte, extents3D) |
cudaMemsetAsync |
alpaka::memset(queue, memBufDst, byte, extents1D) |
makecudaExtent |
– |
makecudaPitchedPtr |
– |
makecudaPos |
– |
cudaMemcpyHostToDevice |
n/a (direction of copy is determined automatically) |
cudaMemcpyDeviceToHost |
n/a (direction of copy is determined automatically) |
[1] Not every platform supports mapped buffers, so alpaka::allocMappedBufIfSupported<TElement, TIdx>(host, platform, extents) should be used instead to support these platforms as well.
Execution Control
CUDA |
alpaka |
---|---|
cudaFuncGetAttributes |
– |
cudaFuncSetCacheConfig |
– |
cudaFuncSetSharedMemConfig |
– |
cudaLaunchKernel |
|
cudaSetDoubleForDevice |
n/a (alpaka assumes double support) |
cudaSetDoubleForHost |
n/a (alpaka assumes double support) |
Occupancy
CUDA |
alpaka |
---|---|
cudaOccupancyMaxActiveBlocksPerMultiprocessor |
– |
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags |
– |
Unified Addressing
CUDA |
alpaka |
---|---|
cudaPointerGetAttributes |
– |
Peer Device Memory Access
CUDA |
alpaka |
---|---|
cudaDeviceCanAccessPeer |
– |
cudaDeviceDisablePeerAccess |
– |
cudaDeviceEnablePeerAccess |
automatically done when required |
OpenGL, Direct3D, VDPAU, EGL, Graphics Interoperability
not available
Texture/Surface Reference/Object Management
not available
Version Management
not available
HIP
Warning
The HIP documentation is outdated and must be overworked.
Current Restrictions on HCC platform
Workaround for unsupported
syncthreads_{count|and|or}
.Uses temporary shared value and atomics
Workaround for buggy
hipStreamQuery
,hipStreamSynchronize
.Introduces own queue management
hipStreamQuery
andhipStreamSynchronize
do not work in multithreaded environment
Workaround for missing
cuStreamWaitValue32
.Polls value each 10 ms
Device constant memory not supported yet
Note that
printf
in kernels is still not supported in HIPExclude
hipMalloc3D
andhipMallocPitch
when size is zero otherwise they throw an Unknown ErrorTestAccs
excludes 3D specialization of HIP back-end for now becauseverifyBytesSet
fails inmemView
for 3D specializationdim3
structure is not available on device (usealpaka::Vec
instead)Constructors’ attributes unified with destructors’.
Host/device signature must match in HIP(HCC)
A chain of functions must also provide correct host-device signatures
E.g. a host function cannot be called from a host-device function
Recompile your target when HCC linker returned the error: “File format not recognized clang-7: error: linker command failed with exit code 1”
If compile-error occurred the linker still may link, but without the device code
AMD device architecture currently hardcoded in
alpakaConfig.cmake
Compiling HIP from Source
Follow HIP Installation guide for installing HIP. HIP requires either nvcc or hcc to be installed on your system (see guide for further details).
If you want the HIP binaries to be located in a directory that does not require superuser access, be sure to change the install directory of HIP by modifying the
CMAKE_INSTALL_PREFIX
cmake variable.Also, after the installation is complete, add the following line to the
.profile
file in your home directory, in order to add the path to the HIP binaries to PATH:PATH=$PATH:<path_to_binaries>
git clone --recursive https://github.com/ROCm-Developer-Tools/HIP.git
cd HIP
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE="${CMAKE_BUILD_TYPE}" -DCMAKE_INSTALL_PREFIX=${YOUR_HIP_INSTALL_DIR} -DBUILD_TESTING=OFF ..
make
make install
Set the appropriate paths (edit
${YOUR_**}
variables)
# HIP_PATH required by HIP tools
export HIP_PATH=${YOUR_HIP_INSTALL_DIR}
# Paths required by HIP tools
export CUDA_PATH=${YOUR_CUDA_ROOT}
# - if required, path to HCC compiler. Default /opt/rocm/hcc.
export HCC_HOME=${YOUR_HCC_ROOT}
# - if required, path to HSA include, lib. Default /opt/rocm/hsa.
export HSA_PATH=${YOUR_HSA_PATH}
# HIP binaries and libraries
export PATH=${HIP_PATH}/bin:$PATH
export LD_LIBRARY_PATH=${HIP_PATH}/lib64:${LD_LIBRARY_PATH}
Test the HIP binaries
# calls nvcc or hcc
which hipcc
hipcc -V
which hipconfig
hipconfig -v
Verifying HIP Installation
If PATH points to the location of the HIP binaries, the following command should list several relevant environment variables, and also the selected compiler on your
system-\`hipconfig -f\`
Compile and run the square sample, as pointed out in the original HIP install guide.
Compiling Examples with HIP Back End
As of now, the back-end has only been tested on the NVIDIA platform.
NVIDIA Platform
One issue in this branch of alpaka is that the host compiler flags don’t propagate to the device compiler, as they do in CUDA. This is because a counterpart to the
CUDA_PROPAGATE_HOST_FLAGS
cmake variable has not been defined in the FindHIP.cmake file. alpaka forwards the host compiler flags in cmake to theHIP_NVCC_FLAGS
cmake variable, which also takes user-given flags. To add flags to this variable, toggle the advanced mode inccmake
.
Random Number Generator Library rocRAND for HIP Back End
rocRAND provides an interface for HIP, where the cuRAND or rocRAND API is called depending on the chosen HIP platform (can be configured with cmake in alpaka).
Clone the rocRAND repository, then build and install it
git clone https://github.com/ROCmSoftwarePlatform/rocRAND
cd rocRAND
mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX=${HIP_PATH} -DBUILD_BENCHMARK=OFF -DBUILD_TEST=OFF -DCMAKE_MODULE_PATH=${HIP_PATH}/cmake ..
make
The CMAKE_MODULE_PATH
is a cmake variable for locating module finding scripts like FindHIP.cmake.
The paths to the rocRAND library and include directories should be appended to the CMAKE_PREFIX_PATH
variable.