Heterogeneous programming with Kokkos

Geert Jan Bex

Motivation

GPUs can offer significant speedup

But:

Programming GPUs is non-trivial
Not all algorithms are suitable
Code may not be portable
Various vendors: NVIDIA, AMD, Intel

L’embarras du choix

CUDA (NVIDIA specific)
HIP (AMD specific)
OpenCL
OpenMP
OpenACC
Kokkos
SYCL
Alpaka
…

Choices & why

Kokkos

Motivation

Vendor agnostic
Relatively widely used
Good/reasonable documentation
C++ based/Python bindings
Can target non-GPU accelerators (e.g., FPGAs)

Kokkos

Pros

Vendor agnostic
Supports many backends
- CPU: Serial, OpenMP, Pthreads, HPX
- GPU: CUDA, HIP, SYCL
Open source
Conceptually fairly straightforward
Fortran/Python interoperability

Cons

Mainly (non-trivial) C++
Builds for specific targets
Getting good performance can be tricky

Programming paradigms

Shared memory parallelism
Task & loop parallelism
Distributed computing
- MPI (Message Passing Interface)
- HPX (High Performance ParalleX)
- Legion (https://legion.stanford.edu/)
- PGAS (Partitioned Global Address Space) using Kokkos remote spaces

Main concepts

Memory spaces: place your data
Execution spaces: execute your code
Execution pattern: parallel constructs

Installation & building

Easy with CMake
Through Spack

Building with CMake:

find_package(Kokkos REQUIRED)
...
target_link_libraries(my_target Kokkos::kokkos)

Hello world

// include Kokkos header files

int main(int argc, char* argv[]) {
    // initialize Kokkos

    // print in parallel

    // finalize Kokkos
    return 0;
}

Headers

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
    // initialize Kokkos

    // print in parallel

    // finalize Kokkos
    return 0;
}

Initialize/finalize

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
    Kokkos::initialize(argc, argv);

    // print in parallel

    Kokkos::finalize();
    return 0;
}

1: No Kokkos can be used before Kokkos::initialize(...)
2: No Kokkos can be used after Kokkos::finalize()

Parallel for

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
    Kokkos::initialize(argc, argv);
3   Kokkos::parallel_for(15, KOKKOS_LAMBA(const int i) {
        Kokkos::printf("hello from i = %d\n", i);
    }
    Kokkos::finalize();
    return 0;
}

1: 15: number of iterations, lambda function to execute
2: i: iteration-number from 0 to 14

Kokkos views

View = \(N\)-dimensional arrays

contains
- basic types (int, float, double,…)
- struct of basic types
in memory space of host or device
memory layout can be controled, sane defaults
managed by reference counting

Views examples

Kokkos::View<double*> a("a", 10);
Kokkos::View<double[5]> b("b");
Kokkos::View<doucle**> c("c", 10, 15);
Kokkos::View<double*[20]> d("d", 10);
Kokkos::View<double[7][9]> e("e");
Kokkos::View<double***> f("f", 10, 10, 10);
Kokkos::View<double*[13][17]> f("f", 10);

1: 1D array of 10 elements, runtime
2: 1D array of 5 elements, compile time
3: 2D array of 10x15 elements, runtime
4: 2D array of 10x20 elements, rows runtime
5: 2D array of 7x9 elements, compile time
6: 3D array of 10x10x10 elements, runtime
7: 3D array of 10x13x17 elements, first dimension runtime

Accessing views

Views can be accessed using operator() with indices

a(0) = 1.0;
auto x {a(1) + b(2)};
c(2, 3) = 3.0;
f(8, 9, 10) = 6.0;

Views and memory spaces

By default in the default execution space
Can be placed in host or device memory space
- Kokkos::HostSpace on host
- Kokkos::CudaSpace on CUDA device
- Kokkos::CudaUVMSpace in CUDA UVM (Unified Virtual Memory)
- Kokkos::HipSpace on HIP device

Warning

Can only be accessed in execution space associated with memory space

Memory layout

Two main memory layouts

Kokkos::LayoutLeft = column-major
Kokkos::LayoutRight = row-major

Example:

Kokkos::View<double**, Kokkos::LayoutLeft> a("a", 10, 15);
Kokkos::View<double**, Kokkos::LayoutRight> b("b", 10, 15);

Default “best” for memory space, parallel access over leftmost index

Left for device memory spaces: coalesced memory access
Right for host memory space: cache-friendly access

Deep copy

Copy data between views
Bit copy, no conversion
- views must have the same layout and padding
Execution space can be specified

Warning

Kokkos::deep_copy between different memory spaces may not work for non-contiguous data, e.g., subviews

Mirror views

Required for, e.g.,

initializing views from file
writing views to file
non-paralellizable operations

Kokkos::View<double*> a("a", 10);
auto a_host = Kokkos::create_mirror(a);
Kokkos::deep_copy(a_host, a);

1: Create a mirror view on host
2: Copy data from device to host

Takes care of layout differences

Memory traits

Kokkos::RandomAccess = random access to view’s elements
- if view is const and CudaSpace, use texture memory/fetch
- no other memory spaces take advantage
Kokkos::Atomic = all operations on view’s elements are atomic
Combine traits: use|

Using memory traits

Kokkos::View<int*> a("a" , 100);
Kokkos::View<int*, Kokkos::MemoryTraits<Kokkos::Atomic> > a_atomic = a;
    
a_atomic(1)++;

1: Create ordinary view
2: Create atomic view from ordinary view
3: All operations on a_atomic are atomic

Unmanaged views

Useful for interfacing with external libraries
Views that do not manage memory
Can be created from raw pointers
No reference counting

Subviews

Similar to Fortran, MATLAB, Python slicing
Kokkos has no strided slices
Subviews are views
Same reference count as original view

Kokkos::View<double**> a("a", 10, 15);
auto b = Kokkos::subview(a, Kokkos::ALL, std::make_pair(2, 5));

Execution patterns

Parallel for:

Kokkos::parallel_for(10, KOKKOS_LAMBDA(int i) {
        a(i) = i;
    });

Parallel reduce:

double sum {0.0};
Kokkos::parallel_reduce(10, KOKKOS_LAMBDA(int i, double& sum) {
        sum += a(i);
    }, sum);

Parallel scan:

Kokkos::parallel_scan(10, KOKKOS_LAMBDA(int i, double& update, bool final) {
        update += a(i);
        if (final) a(i) = update;
    });

Execution policies

Kokkos::RangePolicy = range of indices

Kokkos::parallel_for(Kokkos::RangePolicy(0, 10),
    KOKKOS_LAMBDA(int i) {
        a(i) = i;
    });

Kokkos::MDRangePolicy = multi-dimensional range

Kokkos::parallel_for(
    Kokkos::MDRangePolicy<Kokkos::Rank<2>>({0, 0}, {10, 15}),
    KOKKOS_LAMBDA(int i, int j) {
        a(i, j) = i + j;
    });

Tiling

Both can be defined with chunk size

Kokkos::parallel_for(Kokkos::RangePolicy(0, 10, 2),
    KOKKOS_LAMBDA(int i) {
        a(i) = i;
    });

Kokkos::parallel_for(
    Kokkos::MDRangePolicy( {0, 0}, {10, 15}, {2, 3}),
    KOKKOS_LAMBDA(int i, int j) {
        a(i, j) = i + j;
    });

Execution spaces

Kokkos::Cuda = parallel execution on CUDA device
Kokkos::Hip = parallel execution on HIP device
Kokkos::OpenMP = parallel execution on host
Kokkos::Serial = serial execution on host

Default execution space: highest enabled at compile time

Kokkos::DefaultExecutionSpace = default execution space
Kokkos::DefaultHostExecutionSpace = default host execution space

Using execution spaces

  Kokkos::View<int*, Kokkos::HostSpace> a("a", 10);
  Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::OpenMP>(0, 10),
      KOKKOS_LAMBDA(int i) {
          a(i) = i;
  });

Functors

Can be used instead of lambdas
For more complex operations
Can be reused
Can be templated
Have to implement some methods

Functor for parallel for

struct MyFunctor {
    private:
        Kokkos::View<double*> a_;
    public:
        explicit MyFunctor(Kokkos::View<double*> a_) : a(a_) {}
        KOKKOS_INLINE_FUNCTION
        void operator()(const int i) const {
            a(i) = i;
        }
};

1: View(s) to operator on
2: Constructor that takes view(s), initializes state
3: operator() that does the work, cfr. lambda function

Using parallel for functor

Kokkos::View<double*> a("a", 10);
Kokkos::View<double*> b("b", 10);
MyFunctor my_functor(a);
Kokkos::parallel_for(10, my_functor);
Kokkos::parallel_for(10, MyFunctor(b));

1: Create functor with view a to be used multiple times
2: Create functor with view b and use it directly

Functor for parallel reduce

struct MyReduceFunctor {
    private:
        Kokkos::View<double*> a_;
    public:
        explicit MyReduceFunctor(Kokkos::View<double*> a_) : a(a_) {}
        KOKKOS_INLINE_FUNCTION
        void operator()(const int i, double& sum) const {
            sum += a(i);
        }
        KOKKOS_INLINE_FUNCTION
        void join(double& dst, const double& src) const {
            dst += src;
        }
};

1: operator() with accumulator
2: join() to combine partial results

Using parallel reduce functor

Kokkos::View<double*> a("a", 10);
double sum {0.0};
MyReduceFunctor my_reduce_functor(a);
Kokkos::parallel_reduce(10, my_reduce_functor, sum);

Synchorization

Kokkos::parallel_for, Kokkos::parallel_scan: asynchronous (non-blocking)
Kokkos::parallel_reduce: asynchronous with Kokkos::View, blocking with scalar
Execution space has its own FIFO queue
Kokkos::deep_copy(dest, src): blocking copy
Kokkos::deep_copy(exec_space, dest, src): inserted in FIFO queue
Kokkos::fence() = wait for all parallel operations to finish
exec_space.fence() = wait for all parallel operations on exec_space to finish

Multiple parallel for

Multiple parallel for blocked

Parallel reduce on process scalar

Parallel reduce on `Kokkos::View`

Deep copy

Host and device execution space

Deep copy issue

Multiple execution spaces

Execution space deep copy

Asynchronous `Kokkos::deep_copy(exec_space, dest, src)`

Synchorization gotchas

Kokkos::deep_copy(dest, src) blcoks all execution spaces
Kokkos::fence() blocks all execution spaces
Kokkos::parallel_reduce with scalar blocks process
Kokkos::View that goes out of scope may block
Execution space Kokkos::OpenMP may block
Only one instance of Kokkos::OpenMP execution space

Nested parallelism

Challenge: how to divide the work?

Use Kokkos::TeamPolicy
- League: set of teams
- Team: set of threads (warp on NVIDIA/wavefront on AMD)
Use execution patterns for, reduce, scan
Threads in team synchronize with Kokkos::TeamThread::fence()
Team can share scratch pad memory
Threads can have private scratch pad

Nested execution policies

Top-level: TeamPolicy
Next level
- TeamThreadRange, TeamThreadMDRange
  - ThreadVectorRange, ThreadVectorMDRange
- TeamVectorRange, TeamVectorMDRange

Example 1: \(\vec{y} = A \cdot \vec{x}\)

Where \(\vec{y} \in \mathbb{R}^M\), \(A \in \mathbb{R}^{M \times N}\), \(\vec{x} \in \mathbb{R}^N\)

Kokkos::parallel_for("y=A*x", M,
    KOKKOS_LAMBDA(const int i) {
        for (int j = 0; j < N; ++j) {
            y(i) = A(i, j)*x(j);
        }
    });

Kokkos::parallel_for("y=A*x",
    Kokkos::TeamPolicy<>(M, Kokkos::AUTO),
    KOKKOS_LAMBDA(int i) {
        double row_sum = 0;
        Kokkos::parallel_reduce("y(i)"
            Kokkos::TeamThreadRange(team, N),
            [=] (int j, double& lsum) {
                lsum += A(i,j)*x(j);
            }, row_sum);
        y(i) = row_sum;
});

1: Create league of teams of threads
2: Exploit inner parallelism

Better performance for \(M \ll N\)

Example 2: \(s = \vec{y} \cdot A \cdot \vec{x}\)

Where \(s \in \mathbb{R}\), \(\vec{y} \in \mathbb{R}^M\), \(A \in \mathbb{R}^{M \times N}\), \(\vec{x} \in \mathbb{R}^N\)

using policy_t = Kokkos::TeamPolicy<>;
float s {0.0f};
Kokkos::parallel_reduce("y*A*x", policy_t(M, Kokkos::AUTO),
    KOKKOS_LAMBDA(const policy_t::member_type& team_member, float& sum) {
        const int i {team_member.league_rank()};
        float row_sum {0.0f};
        Kokkos::parallel_reduce(
            Kokkos::TeamThreadRange(team_member, N),
            [=] (const int j, float& row_sum) {
                row_sum += A(i, j)*x(j);
            }, row_sum);
        Kokkos::single(Kokkos::PerTeam(team_member),
            [&]() { result += y(i)*row_sum; });
    }, s);

1: Note: capture by value
2: Note: A is accessed row-wise!
3: Only one thread per team should update!
4: Note: capture by reference

Scratch pads

Level-0
- Limited size, fast
- “Manual L1 cache”
Level-1
- Larger, but less performance
- “Nearest fast storage”

Team or thread private memory
- Per work item temp storage
- Size \(\propto\) nr. threads
Manually managed cache
- Cache frequently used data
- exposes on-core scratch space (e.g., NVIDIA GPU shared memory)

How to use scratch pads?

TeamPolicy
Determine actual scratch pad size
Execution pattern
- Create scratch view
- Initialize scratch view using VectorTeamPolicy
- Use scratch view

Base version

Kokkos::View<double***> A("A", nr_elements, nr_qp, vector_size);
Kokkos::View<double**> B("B", nr_elements, vector_size);
Kokkos::View<double**> C("C", nr_elements, nr_qp);

Kokkos::parallel_for(
    "C=A*B", Kokkos::TeamPolicy<>(nr_elements, Kokkos::AUTO),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
      const int element{team.league_rank()};
      Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nr_qp),
                           [&](const int& qp) {
                             double total{0.0};
                             for (int i = 0; i < vector_size; ++i) {
                               total += A(element, qp, i) * B(element, i);
                             }
                             C(element, qp) = total;
                           });
    });

1: Note: each thread in a team uses same B(element, :)

Determine scratch pad size

using ScratchView =
    Kokkos::View<double*,
                 Kokkos::DefaultExecutionSpace::scratch_memory_space>;
size_t scratch_size = ScratchView::shmem_size(vector_size);

1: Define scratch view type based on execution space
2: Determine size of scratch space to store vector_size values of type double

Note

Required to ensure alignment taken into account!

Reserve scratch pad memory

const int level {0};
Kokkos::TeamPolicy<> policy(nr_elements, Kokkos::AUTO);
Kokkos::parallel_for(
    policy.set_scratch_size(level, Kokkos::PerTeam(scratch_size)),
    ...
);

1: Set the scratch pad level
2: Create the TeamPolicy to use
3: Allocate scratch pad at given level of given size

Scratch can be Kokkos::PerTeam and/or Kokkos::PerThread
Both levels can be used
Only allocates memory

Create scratch view and initialize

Kokkos::parallel_for(
    policy.set_scratch_size(level, Kokkos::PerTeam(scratch_size)),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
      const int element{team.league_rank()};
      ScratchView scratch(team.team_scratch(level), vector_size);
      Kokkos::parallel_for(
          Kokkos::TeamVectorRange(team, vector_size),
          [&](const int& i) { scratch(i) = B(element, i); });
      team.team_barrier();
      ...

1: Create scratch view at given level and of given size
2: Copy row of B in parallel
3: Barrier to ensure scratch is fully initialized

Warning

Kokkos::parallel_for can run asynchronously, threads will not synchronize

Putting it all together

Kokkos::parallel_for(
    "C=A*B", policy.set_scratch_size(level, Kokkos::PerTeam(scratch_size)),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
      const int element{team.league_rank()};
      ScratchView scratch(team.team_scratch(level), vector_size);
      Kokkos::parallel_for(
          Kokkos::TeamVectorRange(team, vector_size),
          [&](const int& i) { scratch(i) = B(element, i); });
      team.team_barrier();
      Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nr_qp),
                           [&](const int& qp) {
                             double total{0.0};
                             for (int i = 0; i < vector_size; ++i) {
                               total += A(element, qp, i) * scratch(i);
                             }
                             C(element, qp) = total;
                           });
    });

1: Each thread in team uses same scratch

All threads in team use per team scratch

Kokkos library

Algorithms
- Random numbers
- Sorting
Containers
- Unordered map
- Bitset

Warning

Experimental features!

Random numbers

#include <Kokkos_Random.hpp>
...
Kokkos::Random_XorShift64_Pool<Kokkos::DefaultExecutionSpace> random_pool(seed);
...
Kokkos::parallel_for("walks", nr_walks, KOKKOS_LAMBDA(const int walk) {
      auto generator = random_pool.get_state();
      int pos = max_steps;
      for (int step = 0; step < (walk % 2 == 0 ? max_steps : max_steps - 1);
           ++step) {
        pos += generator.drand() < 0.5 ? -1 : 1;
      }
      distance_atomic(pos)++;
    });

1: Include header file for random number generator
2: Create a pool of generators with given seed
3: For each thread, get generator from the pool
4: Get a random number \(x\) such that \(0 \le x \lt 1\)

Why kernels?

Don’t program your own matrix-matrix multiplication!
Portable performance
Can use vendor libraries

Note

Separate download/installation

Which kernels?

Dense linear algebra: BLAS, batched BLAS, some LAPACK
Sparse linear algebra, CRS (Compressed Row Storage)
Graph algorithms: distance-1/distance-2 graph coloring, bipartite graph coloring
Sparse solvers: variants of Gauss-Seidel
ODE solvers: explicit Runge-Kutta, implicit BDF, Newton

Heterogeneous programming with Kokkos

Motivation

L’embarras du choix

Choices & why

Kokkos

Pros

Cons

Programming paradigms

Main concepts

Installation & building

Hello world

Headers

Initialize/finalize

Parallel for

Kokkos views

Views examples

Accessing views

Views and memory spaces

Memory layout

Deep copy

Mirror views

Memory traits

Using memory traits

Unmanaged views

Subviews

Execution patterns

Execution policies

Tiling

Execution spaces

Using execution spaces

Functors

Functor for parallel for

Using parallel for functor

Functor for parallel reduce

Using parallel reduce functor

Synchorization

Multiple parallel for

Multiple parallel for blocked

Parallel reduce on process scalar

Parallel reduce on Kokkos::View

Deep copy

Host and device execution space

Deep copy issue

Multiple execution spaces

Execution space deep copy

Synchorization gotchas

Nested parallelism

Nested execution policies

Example 1: \(\vec{y} = A \cdot \vec{x}\)

Example 2: \(s = \vec{y} \cdot A \cdot \vec{x}\)

Scratch pads

How to use scratch pads?

Base version

Determine scratch pad size

Reserve scratch pad memory

Create scratch view and initialize

Putting it all together

Kokkos library

Random numbers

Why kernels?

Which kernels?

Reference & further reading

Parallel reduce on `Kokkos::View`