Back-ends

Accelerator Implementations

The table shows which native implementation or information is used to represent an alpaka functionality.

alpaka

Serial

std::thread

Boost.Fiber

OpenMP 2.0

OpenMP 4.0

CUDA 9.0+

Devices

Host Core

Host Cores

Host Core

Host Cores

Host Cores

NVIDIA GPUs

Lib/API

standard C++

std::thread

boost::fibers::fiber

OpenMP 2.0

OpenMP 4.0

CUDA 9.0+

Kernel execution

sequential

std::thread(kernel)

boost::fibers::fiber(kernel)

omp_set_dynamic(0), #pragma omp parallel num_threads(iNumKernelsInBlock)

#pragma omp target, #pragma omp teams num_teams(…) thread_limit(…), #pragma omp distribute, #pragma omp parallel num_threads(…)

cudaConfigureCall, cudaSetupArgument, cudaLaunch

Execution strategy grid-blocks

sequential

sequential

sequential

sequential

undefined

undefined

Execution strategy block-kernels

sequential

preemptive multitasking

cooperative multithreading

preemptive multitasking

preemptive multitasking

lock-step within warps

getIdx

emulated

block-kernel: mapping of std::this_thread::get_id() grid-block: member variable

block-kernel: mapping of std::this_fiber::get_id() grid-block: member variable

block-kernel: omp_get_num_threads() to 3D index mapping grid-block: member variable

block-kernel: omp_get_num_threads() to 3D index mapping grid-block: member variable

threadIdx, blockIdx

getExtent

member variables

member variables

member variables

member variables

member variables

gridDim, blockDim

getBlockSharedMemDynSizeBytes

allocated in memory prior to kernel execution

allocated in memory prior to kernel execution

allocated in memory prior to kernel execution

allocated in memory prior to kernel execution

allocated in memory prior to kernel execution

__shared__

allocBlockSharedMem

master thread allocates

syncBlockKernels -> master thread allocates -> syncBlockKernels

syncBlockKernels -> master thread allocates -> syncBlockKernels

syncBlockKernels -> master thread allocates -> syncBlockKernels

syncBlockKernels -> master thread allocates -> syncBlockKernels

__shared__

syncBlockKernels

not required

barrier

barrier

#pragma omp barrier

#pragma omp barrier

__syncthreads

atomicOp

hierarchy depended

std::lock_guard< std::mutex >

n/a

#pragma omp critical

#pragma omp critical

atomicXXX

ALPAKA_FN_HOST_ACC, ALPAKA_FN_ACC, ALPAKA_FN_HOST

inline

inline

inline

inline

inline

__device__, __host__, __forceinline__

Serial

The serial accelerator only allows blocks with exactly one thread. Therefore it does not implement real synchronization or atomic primitives.

Threads

Execution

To prevent recreation of the threads between execution of different blocks in the grid, the threads are stored inside a thread pool. This thread pool is local to the invocation because making it local to the KernelExecutor could mean a heavy memory usage and lots of idling kernel-threads when there are multiple KernelExecutors around. Because the default policy of the threads in the pool is to yield instead of waiting, this would also slow down the system immensely.

Fibers

Execution

To prevent recreation of the fibers between execution of different blocks in the grid, the fibers are stored inside a fibers pool. This fiber pool is local to the invocation because making it local to the KernelExecutor could mean a heavy memory usage when there are multiple KernelExecutors around.

OpenMP

Execution

Parallel execution of the kernels in a block is required because when syncBlockThreads is called all of them have to be done with their work up to this line. So we have to spawn one real thread per kernel in a block. omp for is not useful because it is meant for cases where multiple iterations are executed by one thread but in our case a 1:1 mapping is required. Therefore we use omp parallel with the specified number of threads in a block. Another reason for not using omp for like #pragma omp parallel for collapse(3) num_threads(blockDim.x*blockDim.y*blockDim.z) is that #pragma omp barrier used for intra block synchronization is not allowed inside omp for blocks.

Because OpenMP is designed for a 1:1 abstraction of hardware to software threads, the block size is restricted by the number of OpenMP threads allowed by the runtime. This could be as little as 2 or 4 kernels but on a system with 4 cores and hyper-threading OpenMP can also allow 64 threads.

Index

OpenMP only provides a linear thread index. This index is converted to a 3 dimensional index at runtime.

Atomic

We can not use #pragma omp atomic because braces or calling other functions directly after #pragma omp atomic are not allowed. Because we are implementing the CUDA atomic operations which return the old value, this requires #pragma omp critical to be used. omp_set_lock is an alternative but is usually slower.

CUDA

Nearly all CUDA functionality can be directly mapped to alpaka function calls. A major difference is that CUDA requires the block and grid sizes to be given in (x, y, z) order. alpaka uses the mathematical C/C++ array indexing scheme [z][y][x]. In both cases x is the innermost / fast running index.

Furthermore alpaka does not require the indices and extents to be 3-dimensional. The accelerators are templatized on and support arbitrary dimensionality. NOTE: Currently the CUDA implementation is restricted to a maximum of 3 dimensions!

NOTE: You have to be careful when mixing alpaka and non alpaka CUDA code. The CUDA-accelerator back-end can change the current CUDA device and will NOT set the device back to the one prior to the invocation of the alpaka function.

Function Attributes

CUDA

alpaka

__host__

ALPAKA_FN_HOST

__device__

ALPAKA_FN_ACC

__global__

ALPAKA_FN_ACC

__host__ __device__

ALPAKA_FN_HOST_ACC

Note

You can not call CUDA-only methods, except when ALPAKA_ACC_GPU_CUDA_ONLY_MODE is enabled.

Memory

CUDA

alpaka

__shared__

alpaka::declareSharedVar<std::uint32_t, __COUNTER__>(acc)

__constant__

ALPAKA_STATIC_ACC_MEM_CONSTANT

__device__

ALPAKA_STATIC_ACC_MEM_GLOBAL

template<typename T, std::size_t TuniqueId, typename TBlockSharedMemSt>
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::declareSharedVar(TBlockSharedMemSt const &blockSharedMemSt)

Declare a block shared variable.

The variable is uninitialized and not default constructed! The variable can be accessed by all threads within a block. Access to the variable is not thread safe.

Return

Uninitialized variable stored in shared memory.

Template Parameters
  • T: The element type.

  • TuniqueId: id those is unique inside a kernel

  • TBlockSharedMemSt: The block shared allocator implementation type.

Parameters
  • blockSharedMemSt: The block shared allocator implementation.

ALPAKA_STATIC_ACC_MEM_CONSTANT

This macro defines a variable lying in constant accelerator device memory.

Example: ALPAKA_STATIC_ACC_MEM_CONSTANT int i;

Those variables behave like ordinary variables when used in file-scope. They have external linkage (are accessible from other compilation units). If you want to access it from a different compilation unit, you have to declare it as extern: extern ALPAKA_STATIC_ACC_MEM_CONSTANT int i; Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.

In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.

Attention

It is not allowed to initialize the variable together with the declaration. To initialize the variable alpaka::createStaticDevMemView and alpaka::memcpy must be used.

ALPAKA_STATIC_ACC_MEM_CONSTANT int foo;

void initFoo() {
    auto extent = alpaka::Vec<alpaka::DimInt<1u>, size_t>{1};
    auto viewFoo = alpaka::createStaticDevMemView(&foo, device, extent);
    int initialValue = 42;
    alpaka::ViewPlainPtr<DevHost, int, alpaka::DimInt<1u>, size_t> bufHost(&initialValue, devHost, extent);
    alpaka::memcpy(queue, viewGlobalMemUninitialized, bufHost, extent);
}

ALPAKA_STATIC_ACC_MEM_GLOBAL

This macro defines a variable lying in global accelerator device memory.

Example: ALPAKA_STATIC_ACC_MEM_GLOBAL int i;

Those variables behave like ordinary variables when used in file-scope. They have external linkage (are accessible from other compilation units). If you want to access it from a different compilation unit, you have to declare it as extern: extern ALPAKA_STATIC_ACC_MEM_GLOBAL int i; Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.

In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.

Attention

It is not allowed to initialize the variable together with the declaration. To initialize the variable alpaka::createStaticDevMemView and alpaka::memcpy must be used.

ALPAKA_STATIC_ACC_MEM_GLOBAL int foo;

void initFoo() {
    auto extent = alpaka::Vec<alpaka::DimInt<1u>, size_t>{1};
    auto viewFoo = alpaka::createStaticDevMemView(&foo, device, extent);
    int initialValue = 42;
    alpaka::ViewPlainPtr<DevHost, int, alpaka::DimInt<1u>, size_t> bufHost(&initialValue, devHost, extent);
    alpaka::memcpy(queue, viewGlobalMemUninitialized, bufHost, extent);
}

Index / Work Division

CUDA

alpaka

threadIdx

alpaka::getIdx<alpaka::Block, alpaka::Threads>(acc)

blockIdx

alpaka::getIdx<alpaka::Grid, alpaka::Blocks>(acc)

blockDim

alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)

gridDim

alpaka::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)

warpSize

alpaka::warp::getSize(acc)

Types

CUDA Runtime API

The following tables list the functions available in the CUDA Runtime API and their equivalent alpaka functions:

Device Management

CUDA

alpaka

cudaChooseDevice

cudaDeviceGetAttribute

cudaDeviceGetByPCIBusId

cudaDeviceGetCacheConfig

cudaDeviceGetLimit

cudaDeviceGetP2PAttribute

cudaDeviceGetPCIBusId

cudaDeviceGetSharedMemConfig

cudaDeviceGetQueuePriorityRange

cudaDeviceReset

alpaka::reset(device)

cudaDeviceSetCacheConfig

cudaDeviceSetLimit

cudaDeviceSetSharedMemConfig

cudaDeviceSynchronize

void alpaka::wait(device)

cudaGetDevice

n/a (no current device)

cudaGetDeviceCount

std::sizet alpaka::getDevCount< TPltf >()

cudaGetDeviceFlags

cudaGetDeviceProperties

alpaka::getAccDevProps(dev) (Only some properties available)

cudaIpcCloseMemHandle

cudaIpcGetEventHandle

cudaIpcGetMemHandle

cudaIpcOpenEventHandle

cudaIpcOpenMemHandle

cudaSetDevice

n/a (no current device)

cudaSetDeviceFlags

cudaSetValidDevices

Error Handling

CUDA

alpaka

cudaGetErrorName

n/a (handled internally, available in exception message)

cudaGetErrorString

n/a (handled internally, available in exception message)

cudaGetLastError

n/a (handled internally)

cudaPeekAtLastError

n/a (handled internally)

Queue Management

CUDA

alpaka

cudaLaunchHostFunc

cudaStreamAddCallback

alpaka::enqueue(queue, [](){dosomething();})

cudaStreamAttachMemAsync

cudaStreamCreate

  • queue=alpaka::QueueCudaRtNonBlocking(device);

  • queue=alpaka::QueueCudaRtBlocking(device);

cudaStreamCreateWithFlags

see cudaStreamCreate (cudaStreamNonBlocking hard coded)

cudaStreamCreateWithPriority

cudaStreamDestroy

n/a (Destructor)

cudaStreamGetFlags

cudaStreamGetPriority

cudaStreamQuery

bool alpaka::empty(queue)

cudaStreamSynchronize

void alpaka::wait(queue)

cudaStreamWaitEvent

void alpaka::wait(queue, event)

Event Management

CUDA

alpaka

cudaEventCreate

alpaka::Event< TQueue > event(dev);

cudaEventCreateWithFlags

cudaEventDestroy

n/a (Destructor)

cudaEventElapsedTime

cudaEventQuery

bool alpaka::isComplete(event)

cudaEventRecord

void alpaka::enqueue(queue, event)

cudaEventSynchronize

void alpaka::wait(event)

Memory Management

CUDA

alpaka

cudaArrayGetInfo

cudaFree

n/a (automatic memory management with reference counted memory handles)

cudaFreeArray

cudaFreeHost

n/a

cudaFreeMipmappedArray

cudaGetMipmappedArrayLevel

cudaGetSymbolAddress

cudaGetSymbolSize

cudaHostAlloc

n/a, the existing buffer can be pinned using alpaka::prepareForAsyncCopy(memBuf)

cudaHostGetDevicePointer

cudaHostGetFlags

cudaHostRegister

cudaHostUnregister

cudaMalloc

alpaka::allocBuf<TElement>(device, extents1D)

cudaMalloc3D

alpaka::allocBuf<TElement>(device, extents3D)

cudaMalloc3DArray

cudaMallocArray

cudaMallocHost

alpaka::allocBuf<TElement>(device, extents) 1D, 2D, 3D supported!

cudaMallocManaged

cudaMallocMipmappedArray

cudaMallocPitch

alpaka::allocBuf<TElement>(device, extents2D)

cudaMemAdvise

cudaMemGetInfo

  • alpaka::getMemBytes

  • alpaka::getFreeMemBytes

cudaMemPrefetchAsync

cudaMemRangeGetAttribute

cudaMemRangeGetAttributes

cudaMemcpy

alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D)

cudaMemcpy2D

alpaka::memcpy(queue, memBufDst, memBufSrc, extents2D)

cudaMemcpy2DArrayToArray

cudaMemcpy2DAsync

alpaka::memcpy(queue, memBufDst, memBufSrc, extents2D)

cudaMemcpy2DFromArray

cudaMemcpy2DFromArrayAsync

cudaMemcpy2DToArray

cudaMemcpy2DToArrayAsync

cudaMemcpy3D

alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D)

cudaMemcpy3DAsync

alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D)

cudaMemcpy3DPeer

alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D)

cudaMemcpy3DPeerAsync

alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D)

cudaMemcpyArrayToArray

cudaMemcpyAsync

alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D)

cudaMemcpyFromArray

cudaMemcpyFromArrayAsync

cudaMemcpyFromSymbol

cudaMemcpyFromSymbolAsync

cudaMemcpyPeer

alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D)

cudaMemcpyPeerAsync

alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D)

cudaMemcpyToArray

cudaMemcpyToArrayAsync

cudaMemcpyToSymbol

cudaMemcpyToSymbolAsync

cudaMemset

alpaka::memset(queue, memBufDst, byte, extents1D)

cudaMemset2D

alpaka::memset(queue, memBufDst, byte, extents2D)

cudaMemset2DAsync

alpaka::memset(queue, memBufDst, byte, extents2D, queue)

cudaMemset3D

alpaka::memset(queue, memBufDst, byte, extents3D)

cudaMemset3DAsync

alpaka::memset(queue, memBufDst, byte, extents3D)

cudaMemsetAsync

alpaka::memset(queue, memBufDst, byte, extents1D)

makecudaExtent

makecudaPitchedPtr

makecudaPos

cudaMemcpyHostToDevice

n/a (direction of copy is determined automatically)

cudaMemcpyDeviceToHost

n/a (direction of copy is determined automatically)

Execution Control

CUDA

alpaka

cudaFuncGetAttributes

cudaFuncSetCacheConfig

cudaFuncSetSharedMemConfig

cudaLaunchKernel

  • alpaka::exec<TAcc>(queue, workDiv, kernel, params…)

  • auto byteDynSharedMem = alpaka::getBlockSharedMemDynSizeBytes(kernel, …)

cudaSetDoubleForDevice

n/a (alpaka assumes double support)

cudaSetDoubleForHost

n/a (alpaka assumes double support)

Occupancy

CUDA

alpaka

cudaOccupancyMaxActiveBlocksPerMultiprocessor

cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags

Unified Addressing

CUDA

alpaka

cudaPointerGetAttributes

Peer Device Memory Access

CUDA

alpaka

cudaDeviceCanAccessPeer

cudaDeviceDisablePeerAccess

cudaDeviceEnablePeerAccess

automatically done when required

OpenGL, Direct3D, VDPAU, EGL, Graphics Interoperability

not available

Texture/Surface Reference/Object Management

not available

Version Management

not available

HIP

Warning

The HIP documentation is outdated and must be overworked.

Current Restrictions on HCC platform

  • Workaround for unsupported syncthreads_{count|and|or}.

    • Uses temporary shared value and atomics

  • Workaround for buggy hipStreamQuery, hipStreamSynchronize.

    • Introduces own queue management

    • hipStreamQuery and hipStreamSynchronize do not work in multithreaded environment

  • Workaround for missing cuStreamWaitValue32.

    • Polls value each 10 ms

  • Device constant memory not supported yet

  • Note that printf in kernels is still not supported in HIP

  • Exclude hipMalloc3D and hipMallocPitch when size is zero otherwise they throw an Unknown Error

  • TestAccs excludes 3D specialization of HIP back-end for now because verifyBytesSet fails in memView for 3D specialization

  • dim3 structure is not available on device (use alpaka::Vec instead)

  • Constructors’ attributes unified with destructors’.

    • Host/device signature must match in HIP(HCC)

  • A chain of functions must also provide correct host-device signatures

    • E.g. a host function cannot be called from a host-device function

  • Recompile your target when HCC linker returned the error: “File format not recognized clang-7: error: linker command failed with exit code 1”

  • If compile-error occurred the linker still may link, but without the device code

  • AMD device architecture currently hardcoded in alpakaConfig.cmake

Compiling HIP from Source

Follow HIP Installation guide for installing HIP. HIP requires either nvcc or hcc to be installed on your system (see guide for further details).

  • If you want the HIP binaries to be located in a directory that does not require superuser access, be sure to change the install directory of HIP by modifying the CMAKE_INSTALL_PREFIX cmake variable.

  • Also, after the installation is complete, add the following line to the .profile file in your home directory, in order to add the path to the HIP binaries to PATH: PATH=$PATH:<path_to_binaries>

git clone --recursive https://github.com/ROCm-Developer-Tools/HIP.git
cd HIP
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE="${CMAKE_BUILD_TYPE}" -DCMAKE_INSTALL_PREFIX=${YOUR_HIP_INSTALL_DIR} -DBUILD_TESTING=OFF ..
make
make install
  • Set the appropriate paths (edit ${YOUR_**} variables)

# HIP_PATH required by HIP tools
export HIP_PATH=${YOUR_HIP_INSTALL_DIR}
# Paths required by HIP tools
export CUDA_PATH=${YOUR_CUDA_ROOT}
# - if required, path to HCC compiler. Default /opt/rocm/hcc.
export HCC_HOME=${YOUR_HCC_ROOT}
# - if required, path to HSA include, lib. Default /opt/rocm/hsa.
export HSA_PATH=${YOUR_HSA_PATH}
# HIP binaries and libraries
export PATH=${HIP_PATH}/bin:$PATH
export LD_LIBRARY_PATH=${HIP_PATH}/lib64:${LD_LIBRARY_PATH}
  • Test the HIP binaries

# calls nvcc or hcc
which hipcc
hipcc -V
which hipconfig
hipconfig -v

Verifying HIP Installation

  • If PATH points to the location of the HIP binaries, the following command should list several relevant environment variables, and also the selected compiler on your system-\`hipconfig -f\`

  • Compile and run the square sample, as pointed out in the original HIP install guide.

Compiling Examples with HIP Back End

As of now, the back-end has only been tested on the NVIDIA platform.

  • NVIDIA Platform

    • One issue in this branch of alpaka is that the host compiler flags don’t propagate to the device compiler, as they do in CUDA. This is because a counterpart to the CUDA_PROPAGATE_HOST_FLAGS cmake variable has not been defined in the FindHIP.cmake file. alpaka forwards the host compiler flags in cmake to the HIP_NVCC_FLAGS cmake variable, which also takes user-given flags. To add flags to this variable, toggle the advanced mode in ccmake.

Random Number Generator Library rocRAND for HIP Back End

rocRAND provides an interface for HIP, where the cuRAND or rocRAND API is called depending on the chosen HIP platform (can be configured with cmake in alpaka).

Clone the rocRAND repository, then build and install it

git clone https://github.com/ROCmSoftwarePlatform/rocRAND
cd rocRAND
mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX=${HIP_PATH} -DBUILD_BENCHMARK=OFF -DBUILD_TEST=OFF -DCMAKE_MODULE_PATH=${HIP_PATH}/cmake ..
make

The CMAKE_MODULE_PATH is a cmake variable for locating module finding scripts like FindHIP.cmake. The paths to the rocRAND library and include directories should be appended to the CMAKE_PREFIX_PATH variable.