Library Interface¶
As described in the chapter about the Abstraction, the general design of the library is very similar to CUDA and OpenCL but extends both by some points, while not requiring any language extensions. General interface design as well as interface implementation decisions differentiating alpaka from those libraries are described in the Rationale section. It uses C++ because it is one of the most performant languages available on nearly all systems. Furthermore, C++14 allows to describe the concepts in a very abstract way that is not possible with many other languages. The alpaka library extensively makes use of advanced functional C++ template meta-programming techniques. The Implementation Details section discusses the C++ library and the way it provides extensibility and optimizability.
Structure¶
The alpaka library allows offloading of computations from the host execution domain to the accelerator execution domain, whereby they are allowed to be identical.
In the abstraction hierarchy the library code is interleaved with user supplied code as is depicted in the following figure.
User code invokes library functions, which in turn execute the user provided thread function (kernel) in parallel on the accelerator. The kernel in turn calls library functions when accessing accelerator properties and methods. Additionally, the user can enhance or optimize the library implementations by extending or replacing specific parts.
The alpaka abstraction itself only defines requirements a type has to fulfill to be usable with the template functions the library provides. These type constraints are called concepts in C++.
A concept is a set of requirements consisting of valid expressions, associated types, invariants, and complexity guarantees. A type that satisfies the requirements is said to model the concept. A concept can extend the requirements of another concept, which is called refinement. BoostConcepts
Concepts allow to safely define polymorphic algorithms that work with objects of many different types.
The alpaka library implements a stack of concepts and their interactions modeling the abstraction defined in the previous chapter. Furthermore, default implementations for various devices and accelerators modeling those are included in the library. The interaction of the main user facing concepts can be seen in the following figure.
For each type of Device
there is a Platform
for enumerating the available Device``s.
A ``Device
is the requirement for creating Queues
and Events
as it is for allocating Buffers
on the respective Device
. Buffers
can be copied, their memory be set and they can be pinned or mapped.
Copying and setting a buffer requires the corresponding Copy
and Set
tasks to be enqueued into the Queue
.
An Event
can be enqueued into a Queue
and its completion state can be queried by the user.
It is possible to wait for (synchronize with) a single Event
, a Queue
or a whole Device
.
An Executor
can be enqueued into a Queue
and will execute the Kernel
(after all previous tasks in the queue have been completed).
The Kernel
in turn has access to the Accelerator
it is running on.
The Accelerator
provides the Kernel
with its current index in the block or grid, their extents or other data as well as it allows to allocate shared memory, execute atomic operations and many more.
Interface Usage¶
Accelerator Functions¶
Functions that should be executable on an accelerator have to be annotated with the execution domain (one of ALPAKA_FN_HOST
, ALPAKA_FN_ACC
and ALPAKA_FN_HOST_ACC
).
They most probably also require access to the accelerator data and methods, such as indices and extents as well as functions to allocate shared memory and to synchronize all threads within a block.
Therefore the accelerator has to be passed in as a templated constant reference parameter as can be seen in the following code snippet.
template<
typename TAcc>
ALPAKA_FN_ACC auto doSomethingOnAccelerator(
TAcc const & acc/*,
...*/) // Arbitrary number of parameters
-> int // Arbitrary return type
{
//...
}
Kernel Definition¶
A kernel is a special function object which has to conform to the following requirements:
it has to fulfill the
std::is_trivially_copyable
trait (has to be copyable via memcpy)the
operator()
is the kernel entry point * it has to be an accelerator executable function * it has to returnvoid
. * its first argument has to be the accelerator (templated for arbitrary accelerator back-ends).
The following code snippet shows a basic example of a kernel function object.
struct MyKernel
{
template<
typename TAcc> // Templated on the accelerator type.
ALPAKA_FN_ACC // Macro marking the function to be executable on all accelerators.
auto operator()( // The function / kernel to execute.
TAcc const & acc/*, // The specific accelerator implementation.
...*/) const // Must be 'const'.
-> void
{
//...
}
// Class can have members but has to be std::is_trivially_copyable.
// Classes must not have pointers or references to host memory!
};
The kernel function object is shared across all threads in all blocks.
Due to the block execution order being undefined, there is no safe and consistent way of altering state that is stored inside of the function object.
Therefore, the operator()
of the kernel function object has to be const
and is not allowed to modify any of the object members.
Index and Work Division¶
The alpaka::getWorkDiv
and the alpaka::getIdx
functions both return a vector of the dimensionality the accelerator has been defined with.
They are parametrized by the origin of the calculation as well as the unit in which the values are calculated.
For example, alpaka::getWorkDiv<alpaka::Grid, alpaka::Threads>(acc)
returns a vector with the extents of the grid in units of threads.
Memory Management¶
The memory allocation function of the alpaka library (alpaka::allocBuf<TElem>(device, extents)
) is uniform for all devices, even for the host device.
It does not return raw pointers but reference counted memory buffer objects that remove the necessity for manual freeing and the possibility of memory leaks.
Additionally the memory buffer objects know their extents, their pitches as well as the device they reside on.
This allows buffers that possibly reside on different devices with different pitches to be copied only by providing the buffer objects as well as the extents of the region to copy (alpaka::memcpy(bufDevA, bufDevB, copyExtents
).
Kernel Execution¶
The following source code listing shows the execution of a kernel by enqueuing the execution task into a queue.
// Define the dimensionality of the task.
using Dim = alpaka::DimInt<1u>;
// Define the type of the indexes.
using Idx = std::size_t;
// Define the accelerator to use.
using Acc = alpaka::AccCpuSerial<Dim, Idx>;
// Select the queue type.
using Queue = alpaka::QueueCpuNonBlocking;
// Select a device to execute on.
auto devAcc(alpaka::getDevByIdx<alpaka::PltfCpu>(0));
// Create a queue to enqueue the execution into.
Queue queue(devAcc);
// Create a 1-dimensional work division with 256 blocks a 16 threads.
auto const workDiv(alpaka::WorkDivMembers<Dim, Idx>(256u, 16u);
// Create an instance of the kernel function object.
MyKernel kernel;
// Enqueue the execution task into the queue.
alpaka::exec<Acc>(queue, workDiv, kernel/*, arguments ...*/);
The dimensionality of the task as well as the type for index and extent have to be defined explicitly. Following this, the type of accelerator to execute on, as well as the type of the queue have to be defined. For both of these types instances have to be created. For the accelerator this has to be done indirectly by enumerating the required device via the device manager, whereas the queue can be created directly.
To execute the kernel, an instance of the kernel function object has to be constructed. Following this, an execution task combining the work division (grid and block sizes) with the kernel function object and the bound invocation arguments has to be created. After that this task can be enqueued into a queue for immediate or later execution (depending on the queue used).