Heterogeneous programming with Kokkos

Motivation

GPUs can offer significant speedup

But:

  • Programming GPUs is non-trivial
  • Not all algorithms are suitable
  • Code may not be portable
  • Various vendors: NVIDIA, AMD, Intel

L’embarras du choix

  • CUDA (NVIDIA specific)
  • HIP (AMD specific)
  • OpenCL
  • OpenMP
  • OpenACC
  • Kokkos
  • SYCL
  • Alpaka

Choices & why

  • Kokkos

Motivation

  • Vendor agnostic
  • Relatively widely used
  • Good/reasonable documentation
  • C++ based/Python bindings
  • Can target non-GPU accelerators (e.g., FPGAs)

Kokkos

Pros

  • Vendor agnostic
  • Supports many backends
    • CPU: Serial, OpenMP, Pthreads, HPX
    • GPU: CUDA, HIP, SYCL
  • Open source
  • Conceptually fairly straightforward
  • Fortran/Python interoperability

Cons

  • Mainly (non-trivial) C++
  • Builds for specific targets
  • Getting good performance can be tricky

Programming paradigms

  • Shared memory parallelism
  • Task & loop parallelism
  • Distributed computing
    • MPI (Message Passing Interface)
    • HPX (High Performance ParalleX)
    • Legion (https://legion.stanford.edu/)
    • PGAS (Partitioned Global Address Space) using Kokkos remote spaces

Main concepts

  • Memory spaces: place your data
  • Execution spaces: execute your code
  • Execution pattern: parallel constructs

Installation & building

  • Easy with CMake
  • Through Spack

Building with CMake:

find_package(Kokkos REQUIRED)
...
target_link_libraries(my_target Kokkos::kokkos)

Hello world

// include Kokkos header files

int main(int argc, char* argv[]) {
    // initialize Kokkos

    // print in parallel

    // finalize Kokkos
    return 0;
}

Headers

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
    // initialize Kokkos

    // print in parallel

    // finalize Kokkos
    return 0;
}

Initialize/finalize

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
    Kokkos::initialize(argc, argv);

    // print in parallel

    Kokkos::finalize();
    return 0;
}
1
No Kokkos can be used before Kokkos::initialize(...)
2
No Kokkos can be used after Kokkos::finalize()

Parallel for

#include <Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
    Kokkos::initialize(argc, argv);
3   Kokkos::parallel_for(15, KOKKOS_LAMBA(const int i) {
        Kokkos::printf("hello from i = %d\n", i);
    }
    Kokkos::finalize();
    return 0;
}
1
15: number of iterations, lambda function to execute
2
i: iteration-number from 0 to 14

Kokkos views

View = \(N\)-dimensional arrays

  • contains
    • basic types (int, float, double,…)
    • struct of basic types
  • in memory space of host or device
  • memory layout can be controled, sane defaults
  • managed by reference counting

Views examples

Kokkos::View<double*> a("a", 10);
Kokkos::View<double[5]> b("b");
Kokkos::View<doucle**> c("c", 10, 15);
Kokkos::View<double*[20]> d("d", 10);
Kokkos::View<double[7][9]> e("e");
Kokkos::View<double***> f("f", 10, 10, 10);
Kokkos::View<double*[13][17]> f("f", 10);
1
1D array of 10 elements, runtime
2
1D array of 5 elements, compile time
3
2D array of 10x15 elements, runtime
4
2D array of 10x20 elements, rows runtime
5
2D array of 7x9 elements, compile time
6
3D array of 10x10x10 elements, runtime
7
3D array of 10x13x17 elements, first dimension runtime

Accessing views

Views can be accessed using operator() with indices

a(0) = 1.0;
auto x {a(1) + b(2)};
c(2, 3) = 3.0;
f(8, 9, 10) = 6.0;

Views and memory spaces

  • By default in the default execution space
  • Can be placed in host or device memory space
    • Kokkos::HostSpace on host
    • Kokkos::CudaSpace on CUDA device
    • Kokkos::CudaUVMSpace in CUDA UVM (Unified Virtual Memory)
    • Kokkos::HipSpace on HIP device

Warning

Can only be accessed in execution space associated with memory space

Memory layout

Two main memory layouts

  • Kokkos::LayoutLeft = column-major
  • Kokkos::LayoutRight = row-major

Example:

Kokkos::View<double**, Kokkos::LayoutLeft> a("a", 10, 15);
Kokkos::View<double**, Kokkos::LayoutRight> b("b", 10, 15);

Default “best” for memory space, parallel access over leftmost index

  • Left for device memory spaces: coalesced memory access
  • Right for host memory space: cache-friendly access

Deep copy

  • Copy data between views
  • Bit copy, no conversion
    • views must have the same layout and padding
  • Execution space can be specified

Warning

Kokkos::deep_copy between different memory spaces may not work for non-contiguous data, e.g., subviews

Mirror views

Required for, e.g.,

  • initializing views from file
  • writing views to file
  • non-paralellizable operations
Kokkos::View<double*> a("a", 10);
auto a_host = Kokkos::create_mirror(a);
Kokkos::deep_copy(a_host, a);
1
Create a mirror view on host
2
Copy data from device to host

Takes care of layout differences

Memory traits

  • Kokkos::RandomAccess = random access to view’s elements
    • if view is const and CudaSpace, use texture memory/fetch
    • no other memory spaces take advantage
  • Kokkos::Atomic = all operations on view’s elements are atomic
  • Combine traits: use|

Using memory traits

Kokkos::View<int*> a("a" , 100);
Kokkos::View<int*, Kokkos::MemoryTraits<Kokkos::Atomic> > a_atomic = a;
    
a_atomic(1)++;
1
Create ordinary view
2
Create atomic view from ordinary view
3
All operations on a_atomic are atomic

Unmanaged views

  • Useful for interfacing with external libraries
  • Views that do not manage memory
  • Can be created from raw pointers
  • No reference counting

Subviews

  • Similar to Fortran, MATLAB, Python slicing
  • Kokkos has no strided slices
  • Subviews are views
  • Same reference count as original view
Kokkos::View<double**> a("a", 10, 15);
auto b = Kokkos::subview(a, Kokkos::ALL, std::make_pair(2, 5));

Execution patterns

Parallel for:

Kokkos::parallel_for(10, KOKKOS_LAMBDA(int i) {
        a(i) = i;
    });

Parallel reduce:

double sum {0.0};
Kokkos::parallel_reduce(10, KOKKOS_LAMBDA(int i, double& sum) {
        sum += a(i);
    }, sum);

Parallel scan:

Kokkos::parallel_scan(10, KOKKOS_LAMBDA(int i, double& update, bool final) {
        update += a(i);
        if (final) a(i) = update;
    });

Execution policies

Kokkos::RangePolicy = range of indices

Kokkos::parallel_for(Kokkos::RangePolicy(0, 10),
    KOKKOS_LAMBDA(int i) {
        a(i) = i;
    });

Kokkos::MDRangePolicy = multi-dimensional range

Kokkos::parallel_for(
    Kokkos::MDRangePolicy<Kokkos::Rank<2>>({0, 0}, {10, 15}),
    KOKKOS_LAMBDA(int i, int j) {
        a(i, j) = i + j;
    });

Tiling

Both can be defined with chunk size

Kokkos::parallel_for(Kokkos::RangePolicy(0, 10, 2),
    KOKKOS_LAMBDA(int i) {
        a(i) = i;
    });
Kokkos::parallel_for(
    Kokkos::MDRangePolicy( {0, 0}, {10, 15}, {2, 3}),
    KOKKOS_LAMBDA(int i, int j) {
        a(i, j) = i + j;
    });

Execution spaces

  1. Kokkos::Cuda = parallel execution on CUDA device
  2. Kokkos::Hip = parallel execution on HIP device
  3. Kokkos::OpenMP = parallel execution on host
  4. Kokkos::Serial = serial execution on host

Default execution space: highest enabled at compile time

  • Kokkos::DefaultExecutionSpace = default execution space
  • Kokkos::DefaultHostExecutionSpace = default host execution space

Using execution spaces

  Kokkos::View<int*, Kokkos::HostSpace> a("a", 10);
  Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::OpenMP>(0, 10),
      KOKKOS_LAMBDA(int i) {
          a(i) = i;
  });

Functors

  • Can be used instead of lambdas
  • For more complex operations
  • Can be reused
  • Can be templated
  • Have to implement some methods

Functor for parallel for

struct MyFunctor {
    private:
        Kokkos::View<double*> a_;
    public:
        explicit MyFunctor(Kokkos::View<double*> a_) : a(a_) {}
        KOKKOS_INLINE_FUNCTION
        void operator()(const int i) const {
            a(i) = i;
        }
};
1
View(s) to operator on
2
Constructor that takes view(s), initializes state
3
operator() that does the work, cfr. lambda function

Using parallel for functor

Kokkos::View<double*> a("a", 10);
Kokkos::View<double*> b("b", 10);
MyFunctor my_functor(a);
Kokkos::parallel_for(10, my_functor);
Kokkos::parallel_for(10, MyFunctor(b));
1
Create functor with view a to be used multiple times
2
Create functor with view b and use it directly

Functor for parallel reduce

struct MyReduceFunctor {
    private:
        Kokkos::View<double*> a_;
    public:
        explicit MyReduceFunctor(Kokkos::View<double*> a_) : a(a_) {}
        KOKKOS_INLINE_FUNCTION
        void operator()(const int i, double& sum) const {
            sum += a(i);
        }
        KOKKOS_INLINE_FUNCTION
        void join(double& dst, const double& src) const {
            dst += src;
        }
};
1
operator() with accumulator
2
join() to combine partial results

Using parallel reduce functor

Kokkos::View<double*> a("a", 10);
double sum {0.0};
MyReduceFunctor my_reduce_functor(a);
Kokkos::parallel_reduce(10, my_reduce_functor, sum);

Synchorization

  • Kokkos::parallel_for, Kokkos::parallel_scan: asynchronous (non-blocking)
  • Kokkos::parallel_reduce: asynchronous with Kokkos::View, blocking with scalar
  • Execution space has its own FIFO queue
  • Kokkos::deep_copy(dest, src): blocking copy
  • Kokkos::deep_copy(exec_space, dest, src): inserted in FIFO queue
  • Kokkos::fence() = wait for all parallel operations to finish
  • exec_space.fence() = wait for all parallel operations on exec_space to finish

Multiple parallel for

Multiple parallel for

Multiple parallel for blocked

Parallel for blocked by process

Parallel reduce on process scalar

Parallel reduce on process scalar

Parallel reduce on Kokkos::View

Parallel reduce on `Kokkos::View`

Deep copy

`Kokkos::deep_copy(dest, src) blocks

Host and device execution space

Using both host and device execution spaces

Deep copy issue

`Kokkos::deep_copy(dest, src)` issue

Multiple execution spaces

Using multiple execution spaces on device

Execution space deep copy

Asynchronous `Kokkos::deep_copy(exec_space, dest, src)`

Synchorization gotchas

  • Kokkos::deep_copy(dest, src) blcoks all execution spaces
  • Kokkos::fence() blocks all execution spaces
  • Kokkos::parallel_reduce with scalar blocks process
  • Kokkos::View that goes out of scope may block
  • Execution space Kokkos::OpenMP may block
  • Only one instance of Kokkos::OpenMP execution space

Nested parallelism

Challenge: how to divide the work?

  • Use Kokkos::TeamPolicy
    • League: set of teams
    • Team: set of threads (warp on NVIDIA/wavefront on AMD)
  • Use execution patterns for, reduce, scan
  • Threads in team synchronize with Kokkos::TeamThread::fence()
  • Team can share scratch pad memory
  • Threads can have private scratch pad

Nested execution policies

  • Top-level: TeamPolicy
  • Next level
    • TeamThreadRange, TeamThreadMDRange
      • ThreadVectorRange, ThreadVectorMDRange
    • TeamVectorRange, TeamVectorMDRange

Example 1: \(\vec{y} = A \cdot \vec{x}\)

Where \(\vec{y} \in \mathbb{R}^M\), \(A \in \mathbb{R}^{M \times N}\), \(\vec{x} \in \mathbb{R}^N\)

Kokkos::parallel_for("y=A*x", M,
    KOKKOS_LAMBDA(const int i) {
        for (int j = 0; j < N; ++j) {
            y(i) = A(i, j)*x(j);
        }
    });
Kokkos::parallel_for("y=A*x",
    Kokkos::TeamPolicy<>(M, Kokkos::AUTO),
    KOKKOS_LAMBDA(int i) {
        double row_sum = 0;
        Kokkos::parallel_reduce("y(i)"
            Kokkos::TeamThreadRange(team, N),
            [=] (int j, double& lsum) {
                lsum += A(i,j)*x(j);
            }, row_sum);
        y(i) = row_sum;
});
1
Create league of teams of threads
2
Exploit inner parallelism

Better performance for \(M \ll N\)

Example 2: \(s = \vec{y} \cdot A \cdot \vec{x}\)

Where \(s \in \mathbb{R}\), \(\vec{y} \in \mathbb{R}^M\), \(A \in \mathbb{R}^{M \times N}\), \(\vec{x} \in \mathbb{R}^N\)

using policy_t = Kokkos::TeamPolicy<>;
float s {0.0f};
Kokkos::parallel_reduce("y*A*x", policy_t(M, Kokkos::AUTO),
    KOKKOS_LAMBDA(const policy_t::member_type& team_member, float& sum) {
        const int i {team_member.league_rank()};
        float row_sum {0.0f};
        Kokkos::parallel_reduce(
            Kokkos::TeamThreadRange(team_member, N),
            [=] (const int j, float& row_sum) {
                row_sum += A(i, j)*x(j);
            }, row_sum);
        Kokkos::single(Kokkos::PerTeam(team_member),
            [&]() { result += y(i)*row_sum; });
    }, s);
1
Note: capture by value
2
Note: A is accessed row-wise!
3
Only one thread per team should update!
4
Note: capture by reference

Scratch pads

  • Level-0
    • Limited size, fast
    • “Manual L1 cache”
  • Level-1
    • Larger, but less performance
    • “Nearest fast storage”
  • Team or thread private memory
    • Per work item temp storage
    • Size \(\propto\) nr. threads
  • Manually managed cache
    • Cache frequently used data
    • exposes on-core scratch space (e.g., NVIDIA GPU shared memory)

How to use scratch pads?

  • TeamPolicy
  • Determine actual scratch pad size
  • Execution pattern
    • Create scratch view
    • Initialize scratch view using VectorTeamPolicy
    • Use scratch view

Base version

Kokkos::View<double***> A("A", nr_elements, nr_qp, vector_size);
Kokkos::View<double**> B("B", nr_elements, vector_size);
Kokkos::View<double**> C("C", nr_elements, nr_qp);

Kokkos::parallel_for(
    "C=A*B", Kokkos::TeamPolicy<>(nr_elements, Kokkos::AUTO),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
      const int element{team.league_rank()};
      Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nr_qp),
                           [&](const int& qp) {
                             double total{0.0};
                             for (int i = 0; i < vector_size; ++i) {
                               total += A(element, qp, i) * B(element, i);
                             }
                             C(element, qp) = total;
                           });
    });
1
Note: each thread in a team uses same B(element, :)

Determine scratch pad size

using ScratchView =
    Kokkos::View<double*,
                 Kokkos::DefaultExecutionSpace::scratch_memory_space>;
size_t scratch_size = ScratchView::shmem_size(vector_size);
1
Define scratch view type based on execution space
2
Determine size of scratch space to store vector_size values of type double

Note

Required to ensure alignment taken into account!

Reserve scratch pad memory

const int level {0};
Kokkos::TeamPolicy<> policy(nr_elements, Kokkos::AUTO);
Kokkos::parallel_for(
    policy.set_scratch_size(level, Kokkos::PerTeam(scratch_size)),
    ...
);
1
Set the scratch pad level
2
Create the TeamPolicy to use
3
Allocate scratch pad at given level of given size
  • Scratch can be Kokkos::PerTeam and/or Kokkos::PerThread
  • Both levels can be used
  • Only allocates memory

Create scratch view and initialize

Kokkos::parallel_for(
    policy.set_scratch_size(level, Kokkos::PerTeam(scratch_size)),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
      const int element{team.league_rank()};
      ScratchView scratch(team.team_scratch(level), vector_size);
      Kokkos::parallel_for(
          Kokkos::TeamVectorRange(team, vector_size),
          [&](const int& i) { scratch(i) = B(element, i); });
      team.team_barrier();
      ...
1
Create scratch view at given level and of given size
2
Copy row of B in parallel
3
Barrier to ensure scratch is fully initialized

Warning

Kokkos::parallel_for can run asynchronously, threads will not synchronize

Putting it all together

Kokkos::parallel_for(
    "C=A*B", policy.set_scratch_size(level, Kokkos::PerTeam(scratch_size)),
    KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type& team) {
      const int element{team.league_rank()};
      ScratchView scratch(team.team_scratch(level), vector_size);
      Kokkos::parallel_for(
          Kokkos::TeamVectorRange(team, vector_size),
          [&](const int& i) { scratch(i) = B(element, i); });
      team.team_barrier();
      Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nr_qp),
                           [&](const int& qp) {
                             double total{0.0};
                             for (int i = 0; i < vector_size; ++i) {
                               total += A(element, qp, i) * scratch(i);
                             }
                             C(element, qp) = total;
                           });
    });
1
Each thread in team uses same scratch

All threads in team use per team scratch

Kokkos library

  • Algorithms
    • Random numbers
    • Sorting
  • Containers
    • Unordered map
    • Bitset

Warning

Experimental features!

Random numbers

#include <Kokkos_Random.hpp>
...
Kokkos::Random_XorShift64_Pool<Kokkos::DefaultExecutionSpace> random_pool(seed);
...
Kokkos::parallel_for("walks", nr_walks, KOKKOS_LAMBDA(const int walk) {
      auto generator = random_pool.get_state();
      int pos = max_steps;
      for (int step = 0; step < (walk % 2 == 0 ? max_steps : max_steps - 1);
           ++step) {
        pos += generator.drand() < 0.5 ? -1 : 1;
      }
      distance_atomic(pos)++;
    });
1
Include header file for random number generator
2
Create a pool of generators with given seed
3
For each thread, get generator from the pool
4
Get a random number \(x\) such that \(0 \le x \lt 1\)

Why kernels?

  • Don’t program your own matrix-matrix multiplication!
  • Portable performance
  • Can use vendor libraries

Note

Separate download/installation

Which kernels?

  • Dense linear algebra: BLAS, batched BLAS, some LAPACK
  • Sparse linear algebra, CRS (Compressed Row Storage)
  • Graph algorithms: distance-1/distance-2 graph coloring, bipartite graph coloring
  • Sparse solvers: variants of Gauss-Seidel
  • ODE solvers: explicit Runge-Kutta, implicit BDF, Newton

Reference & further reading