CUDA GPUs

Mapping the abstraction to GPUs supporting CUDA is straightforward because the hierarchy levels are identical up to the element level. So blocks of warps of threads will be mapped directly to their CUDA equivalent.

The element level is supported through an additional run-time variable containing the extent of elements per thread. This variable can be accessed by all threads and should optimally be placed in constant device memory for fast access.

Porting CUDA to alpaka

Nearly all CUDA functionality can be directly mapped to alpaka function calls. A major difference is that CUDA requires the block and grid sizes to be given in (x, y, z) order. Alpaka uses the mathematical C/C++ array indexing scheme [z][y][x]. In both cases x is the innermost / fast running index.

Furthermore alpaka does not require the indices and extents to be 3-dimensional. The accelerators are templatized on and support arbitrary dimensionality. NOTE: Currently the CUDA implementation is restricted to a maximum of 3 dimensions!

NOTE: You have to be careful when mixing alpaka and non alpaka CUDA code. The CUDA-accelerator back-end can change the current CUDA device and will NOT set the device back to the one prior to the invocation of the alpaka function.

Programming Interface

Function Attributes

CUDA

alpaka

__host__

ALPAKA_FN_HOST

__device__

ALPAKA_FN_ACC

__global__

ALPAKA_FN_ACC

__host__ __device__

ALPAKA_FN_HOST_ACC

Note

You can not call CUDA-only methods, except when ALPAKA_ACC_GPU_CUDA_ONLY_MODE is enabled.

Memory

CUDA

alpaka

__shared__

alpaka::block::shared::st::allocVar<std::uint32_t, __COUNTER__>(acc)

__constant__

ALPAKA_STATIC_ACC_MEM_CONSTANT

__device__

ALPAKA_STATIC_ACC_MEM_GLOBAL

template<typename T, std::size_t TuniqueId, typename TBlockSharedMemSt>
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::block::shared::st::allocVar(TBlockSharedMemSt const &blockSharedMemSt)

Allocates a variable in block shared static memory.

The allocated variable is uninitialized and not default constructed!

Template Parameters
  • T: The element type.

  • TuniqueId: id those is unique inside a kernel

  • TBlockSharedMemSt: The block shared allocator implementation type.

Parameters
  • blockSharedMemSt: The block shared allocator implementation.

ALPAKA_STATIC_ACC_MEM_CONSTANT

This macro defines a variable lying in constant accelerator device memory.

Example: ALPAKA_STATIC_ACC_MEM_CONSTANT int i;

Those variables behave like ordinary variables when used in file-scope. They have external linkage (are accessible from other compilation units). If you want to access it from a different compilation unit, you have to declare it as extern: extern ALPAKA_STATIC_ACC_MEM_CONSTANT int i; Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.

In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.

ALPAKA_STATIC_ACC_MEM_GLOBAL

This macro defines a variable lying in global accelerator device memory.

Example: ALPAKA_STATIC_ACC_MEM_GLOBAL int i;

Those variables behave like ordinary variables when used in file-scope. They have external linkage (are accessible from other compilation units). If you want to access it from a different compilation unit, you have to declare it as extern: extern ALPAKA_STATIC_ACC_MEM_GLOBAL int i; Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.

In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.

Index / Work Division

CUDA

alpaka

threadIdx

alpaka::idx::getIdx<alpaka::Block, alpaka::Threads>(acc)

blockIdx

alpaka::idx::getIdx<alpaka::Grid, alpaka::Blocks>(acc)

blockDim

alpaka::workdiv::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)

gridDim

alpaka::workdiv::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)

Types

CUDA

alpaka

dim3

alpaka::vec::Vec< TDim, TVal >

CUDA Runtime API

The following tables list the functions available in the CUDA Runtime API and their equivalent alpaka functions:

Device Management

CUDA

alpaka

cudaChooseDevice

cudaDeviceGetAttribute

cudaDeviceGetByPCIBusId

cudaDeviceGetCacheConfig

cudaDeviceGetLimit

cudaDeviceGetP2PAttribute

cudaDeviceGetPCIBusId

cudaDeviceGetSharedMemConfig

cudaDeviceGetQueuePriorityRange

cudaDeviceReset

alpaka::dev::reset(device)

cudaDeviceSetCacheConfig

cudaDeviceSetLimit

cudaDeviceSetSharedMemConfig

cudaDeviceSynchronize

void alpaka::wait::wait(device)

cudaGetDevice

n/a (no current device)

cudaGetDeviceCount

std::sizet alpaka::pltf::getDevCount< TPltf >()

cudaGetDeviceFlags

cudaGetDeviceProperties

alpaka::acc::getAccDevProps(dev) (Only some properties available)

cudaIpcCloseMemHandle

cudaIpcGetEventHandle

cudaIpcGetMemHandle

cudaIpcOpenEventHandle

cudaIpcOpenMemHandle

cudaSetDevice

n/a (no current device)

cudaSetDeviceFlags

cudaSetValidDevices

Error Handling

CUDA

alpaka

cudaGetErrorName

n/a (handled internally, available in exception message)

cudaGetErrorString

n/a (handled internally, available in exception message)

cudaGetLastError

n/a (handled internally)

cudaPeekAtLastError

n/a (handled internally)

Queue Management

CUDA

alpaka

cudaStreamAddCallback

alpaka::queue::enqueue(queue, [](){dosomething();})

cudaStreamAttachMemAsync

cudaStreamCreate

  • queue=alpaka::queue::QueueCudaRtNonBlocking(device);

  • queue=alpaka::queue::QueueCudaRtBlocking(device);

cudaStreamCreateWithFlags

see cudaStreamCreate (cudaStreamNonBlocking hard coded)

cudaStreamCreateWithPriority

cudaStreamDestroy

n/a (Destructor)

cudaStreamGetFlags

cudaStreamGetPriority

cudaStreamQuery

bool alpaka::queue::empty(queue)

cudaStreamSynchronize

void alpaka::wait::wait(queue)

cudaStreamWaitEvent

void alpaka::wait::wait(queue, event)

Event Management

CUDA

alpaka

cudaEventCreate

alpaka::event::Event< TQueue > event(dev);

cudaEventCreateWithFlags

cudaEventDestroy

n/a (Destructor)

cudaEventElapsedTime

cudaEventQuery

bool alpaka::event::test(event)

cudaEventRecord

void alpaka::queue::enqueue(queue, event)

cudaEventSynchronize

void alpaka::wait::wait(event)

Memory Management

CUDA

alpaka

cudaArrayGetInfo

cudaFree

n/a (automatic memory management with reference counted memory handles)

cudaFreeArray

cudaFreeHost

n/a

cudaFreeMipmappedArray

cudaGetMipmappedArrayLevel

cudaGetSymbolAddress

cudaGetSymbolSize

cudaHostAlloc

n/a, the existing buffer can be pinned using alpaka::mem::buf::prepareForAsyncCopy(memBuf)

cudaHostGetDevicePointer

cudaHostGetFlags

cudaHostRegister

cudaHostUnregister

cudaMalloc

alpaka::mem::buf::alloc<TElement>(device, extents1D)

cudaMalloc3D

alpaka::mem::buf::alloc<TElement>(device, extents3D)

cudaMalloc3DArray

cudaMallocArray

cudaMallocHost

alpaka::mem::buf::alloc<TElement>(device, extents) 1D, 2D, 3D suppoorted!

cudaMallocManaged

cudaMallocMipmappedArray

cudaMallocPitch

alpaka::mem::alloc<TElement>(device, extents2D)

cudaMemAdvise

cudaMemGetInfo

  • alpaka::dev::getMemBytes

  • alpaka::dev::getFreeMemBytes

cudaMemPrefetchAsync

cudaMemRangeGetAttribute

cudaMemRangeGetAttributes

cudaMemcpy

alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D)

cudaMemcpy2D

alpaka::mem::view::copy(memBufDst, memBufSrc, extents2D)

cudaMemcpy2DArrayToArray

cudaMemcpy2DAsync

alpaka::mem::view::copy(memBufDst, memBufSrc, extents2D, queue)

cudaMemcpy2DFromArray

cudaMemcpy2DFromArrayAsync

cudaMemcpy2DToArray

cudaMemcpy2DToArrayAsync

cudaMemcpy3D

alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D)

cudaMemcpy3DAsync

alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D, queue)

cudaMemcpy3DPeer

alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D)

cudaMemcpy3DPeerAsync

alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D, queue)

cudaMemcpyArrayToArray

cudaMemcpyAsync

alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D, queue)

cudaMemcpyFromArray

cudaMemcpyFromArrayAsync

cudaMemcpyFromSymbol

cudaMemcpyFromSymbolAsync

cudaMemcpyPeer

alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D)

cudaMemcpyPeerAsync

alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D, queue)

cudaMemcpyToArray

cudaMemcpyToArrayAsync

cudaMemcpyToSymbol

cudaMemcpyToSymbolAsync

cudaMemset

alpaka::mem::view::set(memBufDst, byte, extents1D)

cudaMemset2D

alpaka::mem::view::set(memBufDst, byte, extents2D)

cudaMemset2DAsync

alpaka::mem::view::set(memBufDst, byte, extents2D, queue)

cudaMemset3D

alpaka::mem::view::set(memBufDst, byte, extents3D)

cudaMemset3DAsync

alpaka::mem::view::set(memBufDst, byte, extents3D, queue)

cudaMemsetAsync

alpaka::mem::view::set(memBufDst, byte, extents1D, queue)

makecudaExtent

makecudaPitchedPtr

makecudaPos

cudaMemcpyHostToDevice

n/a (direction of copy is determined automatically)

cudaMemcpyDeviceToHost

n/a (direction of copy is determined automatically)

Execution Control

CUDA

alpaka

cudaFuncGetAttributes

cudaFuncSetCacheConfig

cudaFuncSetSharedMemConfig

cudaLaunchKernel

  • alpaka::kernel::exec<TAcc>(queue, workDiv, kernel, params…)

  • alpaka::kernel::BlockSharedExternMemSizeBytes< TKernel<TAcc> >::getBlockSharedExternMemSizeBytes<…>(…)

cudaSetDoubleForDevice

n/a (alpaka assumes double support)

cudaSetDoubleForHost

n/a (alpaka assumes double support)

Occupancy

CUDA

alpaka

cudaOccupancyMaxActiveBlocksPerMultiprocessor

cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags

Unified Addressing

CUDA

alpaka

cudaPointerGetAttributes

Peer Device Memory Access

CUDA

alpaka

cudaDeviceCanAccessPeer

cudaDeviceDisablePeerAccess

cudaDeviceEnablePeerAccess

automatically done when required

OpenGL, Direct3D, VDPAU, EGL, Graphics Interoperability

not available

Texture/Surface Reference/Object Management

not available

Version Management

not available