CUDA GPUs¶

Mapping the abstraction to GPUs supporting CUDA is straightforward because the hierarchy levels are identical up to the element level. So blocks of warps of threads will be mapped directly to their CUDA equivalent.

The element level is supported through an additional run-time variable containing the extent of elements per thread. This variable can be accessed by all threads and should optimally be placed in constant device memory for fast access.

Porting CUDA to alpaka¶

Nearly all CUDA functionality can be directly mapped to alpaka function calls. A major difference is that CUDA requires the block and grid sizes to be given in (x, y, z) order. Alpaka uses the mathematical C/C++ array indexing scheme [z][y][x]. In both cases x is the innermost / fast running index.

Furthermore alpaka does not require the indices and extents to be 3-dimensional. The accelerators are templatized on and support arbitrary dimensionality. NOTE: Currently the CUDA implementation is restricted to a maximum of 3 dimensions!

NOTE: You have to be careful when mixing alpaka and non alpaka CUDA code. The CUDA-accelerator back-end can change the current CUDA device and will NOT set the device back to the one prior to the invocation of the alpaka function.

Programming Interface¶

Function Attributes

CUDA	alpaka
`__host__`	`ALPAKA_FN_HOST`
`__device__`	`ALPAKA_FN_ACC`
`__global__`	`ALPAKA_FN_ACC`
`__host__ __device__`	`ALPAKA_FN_HOST_ACC`

Note

You can not call CUDA-only methods, except when ALPAKA_ACC_GPU_CUDA_ONLY_MODE is enabled.

Memory

CUDA	alpaka
`__shared__`	`alpaka::block::shared::st::allocVar<std::uint32_t, __COUNTER__>(acc)`
`__constant__`	`ALPAKA_STATIC_ACC_MEM_CONSTANT`
`__device__`	`ALPAKA_STATIC_ACC_MEM_GLOBAL`

template<typename T, std::size_t TuniqueId, typename TBlockSharedMemSt> ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::block::shared::st::allocVar(TBlockSharedMemSt const &blockSharedMemSt)¶

Allocates a variable in block shared static memory.

The allocated variable is uninitialized and not default constructed!

Template Parameters

T: The element type.
TuniqueId: id those is unique inside a kernel
TBlockSharedMemSt: The block shared allocator implementation type.

Parameters

blockSharedMemSt: The block shared allocator implementation.

ALPAKA_STATIC_ACC_MEM_CONSTANT¶

This macro defines a variable lying in constant accelerator device memory.

Example: ALPAKA_STATIC_ACC_MEM_CONSTANT int i;

Those variables behave like ordinary variables when used in file-scope. They have external linkage (are accessible from other compilation units). If you want to access it from a different compilation unit, you have to declare it as extern: extern ALPAKA_STATIC_ACC_MEM_CONSTANT int i; Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.

In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.

ALPAKA_STATIC_ACC_MEM_GLOBAL¶

This macro defines a variable lying in global accelerator device memory.

Example: ALPAKA_STATIC_ACC_MEM_GLOBAL int i;

Those variables behave like ordinary variables when used in file-scope. They have external linkage (are accessible from other compilation units). If you want to access it from a different compilation unit, you have to declare it as extern: extern ALPAKA_STATIC_ACC_MEM_GLOBAL int i; Like ordinary variables, only one definition is allowed (ODR) Failure to do so might lead to linker errors.

In contrast to ordinary variables, you can not define such variables as static compilation unit local variables with internal linkage because this is forbidden by CUDA.

Index / Work Division

CUDA	alpaka
`threadIdx`	`alpaka::idx::getIdx<alpaka::Block, alpaka::Threads>(acc)`
`blockIdx`	`alpaka::idx::getIdx<alpaka::Grid, alpaka::Blocks>(acc)`
`blockDim`	`alpaka::workdiv::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)`
`gridDim`	`alpaka::workdiv::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)`

Types

CUDA	alpaka
`dim3`	`alpaka::vec::Vec< TDim, TVal >`

CUDA Runtime API¶

The following tables list the functions available in the CUDA Runtime API and their equivalent alpaka functions:

Device Management

CUDA	alpaka
cudaChooseDevice	–
cudaDeviceGetAttribute	–
cudaDeviceGetByPCIBusId	–
cudaDeviceGetCacheConfig	–
cudaDeviceGetLimit	–
cudaDeviceGetP2PAttribute	–
cudaDeviceGetPCIBusId	–
cudaDeviceGetSharedMemConfig	–
cudaDeviceGetQueuePriorityRange	–
cudaDeviceReset	alpaka::dev::reset(device)
cudaDeviceSetCacheConfig	–
cudaDeviceSetLimit	–
cudaDeviceSetSharedMemConfig	–
cudaDeviceSynchronize	void alpaka::wait::wait(device)
cudaGetDevice	n/a (no current device)
cudaGetDeviceCount	std::sizet alpaka::pltf::getDevCount< TPltf >()
cudaGetDeviceFlags	–
cudaGetDeviceProperties	alpaka::acc::getAccDevProps(dev) (Only some properties available)
cudaIpcCloseMemHandle	–
cudaIpcGetEventHandle	–
cudaIpcGetMemHandle	–
cudaIpcOpenEventHandle	–
cudaIpcOpenMemHandle	–
cudaSetDevice	n/a (no current device)
cudaSetDeviceFlags	–
cudaSetValidDevices	–

Error Handling

CUDA	alpaka
cudaGetErrorName	n/a (handled internally, available in exception message)
cudaGetErrorString	n/a (handled internally, available in exception message)
cudaGetLastError	n/a (handled internally)
cudaPeekAtLastError	n/a (handled internally)

Queue Management

CUDA	alpaka
cudaStreamAddCallback	alpaka::queue::enqueue(queue, [](){dosomething();})
cudaStreamAttachMemAsync	–
cudaStreamCreate	queue=alpaka::queue::QueueCudaRtNonBlocking(device); queue=alpaka::queue::QueueCudaRtBlocking(device);
cudaStreamCreateWithFlags	see cudaStreamCreate (cudaStreamNonBlocking hard coded)
cudaStreamCreateWithPriority	–
cudaStreamDestroy	n/a (Destructor)
cudaStreamGetFlags	–
cudaStreamGetPriority	–
cudaStreamQuery	bool alpaka::queue::empty(queue)
cudaStreamSynchronize	void alpaka::wait::wait(queue)
cudaStreamWaitEvent	void alpaka::wait::wait(queue, event)

Event Management

CUDA	alpaka
cudaEventCreate	alpaka::event::Event< TQueue > event(dev);
cudaEventCreateWithFlags	–
cudaEventDestroy	n/a (Destructor)
cudaEventElapsedTime	–
cudaEventQuery	bool alpaka::event::test(event)
cudaEventRecord	void alpaka::queue::enqueue(queue, event)
cudaEventSynchronize	void alpaka::wait::wait(event)

Memory Management

CUDA	alpaka
cudaArrayGetInfo	–
cudaFree	n/a (automatic memory management with reference counted memory handles)
cudaFreeArray	–
cudaFreeHost	n/a
cudaFreeMipmappedArray	–
cudaGetMipmappedArrayLevel	–
cudaGetSymbolAddress	–
cudaGetSymbolSize	–
cudaHostAlloc	n/a, the existing buffer can be pinned using alpaka::mem::buf::prepareForAsyncCopy(memBuf)
cudaHostGetDevicePointer	–
cudaHostGetFlags	–
cudaHostRegister	–
cudaHostUnregister	–
cudaMalloc	alpaka::mem::buf::alloc<TElement>(device, extents1D)
cudaMalloc3D	alpaka::mem::buf::alloc<TElement>(device, extents3D)
cudaMalloc3DArray	–
cudaMallocArray	–
cudaMallocHost	alpaka::mem::buf::alloc<TElement>(device, extents) 1D, 2D, 3D suppoorted!
cudaMallocManaged	–
cudaMallocMipmappedArray	–
cudaMallocPitch	alpaka::mem::alloc<TElement>(device, extents2D)
cudaMemAdvise	–
cudaMemGetInfo	alpaka::dev::getMemBytes alpaka::dev::getFreeMemBytes
cudaMemPrefetchAsync	–
cudaMemRangeGetAttribute	–
cudaMemRangeGetAttributes	–
cudaMemcpy	alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D)
cudaMemcpy2D	alpaka::mem::view::copy(memBufDst, memBufSrc, extents2D)
cudaMemcpy2DArrayToArray	–
cudaMemcpy2DAsync	alpaka::mem::view::copy(memBufDst, memBufSrc, extents2D, queue)
cudaMemcpy2DFromArray	–
cudaMemcpy2DFromArrayAsync	–
cudaMemcpy2DToArray	–
cudaMemcpy2DToArrayAsync	–
cudaMemcpy3D	alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D)
cudaMemcpy3DAsync	alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D, queue)
cudaMemcpy3DPeer	alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D)
cudaMemcpy3DPeerAsync	alpaka::mem::view::copy(memBufDst, memBufSrc, extents3D, queue)
cudaMemcpyArrayToArray	–
cudaMemcpyAsync	alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D, queue)
cudaMemcpyFromArray	–
cudaMemcpyFromArrayAsync	–
cudaMemcpyFromSymbol	–
cudaMemcpyFromSymbolAsync	–
cudaMemcpyPeer	alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D)
cudaMemcpyPeerAsync	alpaka::mem::view::copy(memBufDst, memBufSrc, extents1D, queue)
cudaMemcpyToArray	–
cudaMemcpyToArrayAsync	–
cudaMemcpyToSymbol	–
cudaMemcpyToSymbolAsync	–
cudaMemset	alpaka::mem::view::set(memBufDst, byte, extents1D)
cudaMemset2D	alpaka::mem::view::set(memBufDst, byte, extents2D)
cudaMemset2DAsync	alpaka::mem::view::set(memBufDst, byte, extents2D, queue)
cudaMemset3D	alpaka::mem::view::set(memBufDst, byte, extents3D)
cudaMemset3DAsync	alpaka::mem::view::set(memBufDst, byte, extents3D, queue)
cudaMemsetAsync	alpaka::mem::view::set(memBufDst, byte, extents1D, queue)
makecudaExtent	–
makecudaPitchedPtr	–
makecudaPos	–
cudaMemcpyHostToDevice	n/a (direction of copy is determined automatically)
cudaMemcpyDeviceToHost	n/a (direction of copy is determined automatically)

Execution Control

CUDA	alpaka
cudaFuncGetAttributes	–
cudaFuncSetCacheConfig	–
cudaFuncSetSharedMemConfig	–
cudaLaunchKernel	alpaka::kernel::exec<TAcc>(queue, workDiv, kernel, params…) alpaka::kernel::BlockSharedExternMemSizeBytes< TKernel<TAcc> >::getBlockSharedExternMemSizeBytes<…>(…)
cudaSetDoubleForDevice	n/a (alpaka assumes double support)
cudaSetDoubleForHost	n/a (alpaka assumes double support)

Occupancy

CUDA	alpaka
cudaOccupancyMaxActiveBlocksPerMultiprocessor	–
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags	–

Unified Addressing

CUDA	alpaka
cudaPointerGetAttributes	–

Peer Device Memory Access

CUDA	alpaka
cudaDeviceCanAccessPeer	–
cudaDeviceDisablePeerAccess	–
cudaDeviceEnablePeerAccess	automatically done when required

OpenGL, Direct3D, VDPAU, EGL, Graphics Interoperability

not available

Texture/Surface Reference/Object Management

not available

Version Management

not available