Introduction

The alpaka library defines and implements an abstract interface for the hierarchical redundant parallelism model. This model exploits task- and data-parallelism as well as memory hierarchies at all levels of current multi-core architectures. This allows to achieve performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (CPUs, GPUs and other accelerators) are treated and can be programmed in the same way. The alpaka library provides back-ends for CUDA, OpenMP, HIP, SYCL and other technologies. The trait-based C++ template interface provided allows for straightforward user-defined extension of the library to support other accelerators.

The library name alpaka is an acronym standing for Abstraction Library for Parallel Kernel Acceleration.

About alpaka

alpaka is …

Abstract: It describes parallel execution on multiple hierarchy levels. It allows to implement a mapping to various hardware architectures but is no optimal mapping itself.
Sustainable: alpaka decouples the application from the availability of different accelerator frameworks in different versions, such as OpenMP, CUDA, HIP, etc. (50% on the way to reach full performance portability).
Heterogeneous: An identical algorithm / kernel can be executed on heterogeneous parallel systems by selecting the target device. This allows the best performance for each algorithm and/or a good utilization of the system without major code changes.
Maintainable: alpaka allows to provide a single version of the algorithm / kernel that can be used by all back-ends. There is no need for “copy and paste” kernels with different API calls for different accelerators. All the accelerator dependent implementation details are hidden within the alpaka library.
Testable: Due to the easy back-end switch, no special hardware is required for testing the kernels. Even if the simulation itself always uses the CUDA back-end, the tests can completely run on a CPU. As long as the alpaka library is thoroughly tested for compatibility between the acceleration back-ends, the user simulation code is guaranteed to generate identical results (ignoring rounding errors / non-determinism) and is portable without any changes.
Optimizable: Everything in alpaka can be replaced by user code to optimize for special use-cases.
Extensible: Every concept described by the alpaka abstraction can be implemented by users. Therefore it is possible to non-intrusively define new devices, queues, buffer types or even whole accelerator back-ends.
Data Structure Agnostic: The user can use and define arbitrary data structures.

alpaka does not …

Automatically provide an optimal mapping of kernels to various acceleration platforms: Except in trivial examples an optimal execution always depends on suitable selected data structures. An adaptive selection of data structures is a separate topic that has to be implemented in a distinct library.
Automatically optimize concurrent data access: alpaka does not provide feature to create optimized memory layouts.
Handle differences in arithmetic operations: For example, due to different rounding or different implementations of floating point operations, results can differ slightly between accelerators.
Guarantee determinism of results: Due to the freedom of the library to reorder or repartition the threads within the tasks it is not possible or even desired to preserve deterministic results. For example, the non-associativity of floating point operations give non-deterministic results within and across accelerators.

The alpaka library is aimed at parallelization on shared memory, i.e. within nodes of a cluster. It does not compete with libraries for distribution of processes across nodes and communication among those. For these purposes libraries like MPI (Message Passing Interface) or others should be used. MPI is situated one layer higher and can be combined with alpaka to facilitate the hardware of a whole heterogeneous cluster. The alpaka library can be used for parallelization within nodes, MPI for parallelization across nodes.