OpenMP GPU offloading

Motivation

GPUs can offer significant speedup

But:

  • Programming GPUs is non-trivial
  • Not all algorithms are suitable
  • Code may not be portable
  • Various vendors: NVIDIA, AMD, Intel

L’embarras du choix

  • CUDA (NVIDIA specific)
  • HIP (AMD specific)
  • OpenCL
  • OpenMP
  • OpenACC
  • Kokkos
  • SYCL
  • Alpaka

Choices & why

  • OpenMP

Motivation

  • Vendor agnostic
  • Relatively widely used
  • Support for C, C++ and Fortran
  • Good/reasonable documentation

Introduction

  • OpenMP for shared memory programming
  • OpenMP 4.5+ offloading to accelarators
  • Annotate code with directives
    • #pragma omp ... for C/C++
    • !$omp ... for Fortran
  • Runtime library
  • Environment variables

Pros

  • Gradual path to parallel code
  • Implemented by GCC, Intel OneAPI, Clang, NVIDIA
  • Can be used in C, C++, Fortran
  • Conceptually fairly easy

Cons

  • Efficient code not so easy
  • OpenMP 5.x not fully implemented

Caveats

Warning

These slides only cover device offloading using OpenMP

Warning

Familarity with C-type raw pointers helps (a lot)

Hello world

float *A = (float *) malloc(N*sizeof(float));
float *B = (float *) malloc(N*sizeof(float));
float *C = (float *) malloc(N*sizeof(float));
...
#pragma omp target teams loop map(to:A[:N], B[:N]) map(tofrom:C[:N])
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < n; j++) {
            C[i] += A[i]*B[i];
        }
    }
1
Allocate and initialize data on host
2
Move A and B to device, move C to device and back to host at end of target region
3
Kernel executing on device

Warning

Data scope is scope of #pragma omp target teams loop

Targetting a device

  • #pragma omp target selects a device
  • Default device is GPU:0 if there is one, CPU otherwise
  • device clause allows to explicitly select device
  • omp_get_num_devices(), omp_get_default_device()
  • Devices numbered from 0 to omp_get_num_devices() - 1

Creating teams

  • #pragma omp teams directive
  • Creates league of teams
  • Teams have threads as members
  • num_teams clause to set maximum size of league
  • thread_limit clause to set maximum team size
  • One thread per team outside of parallel region or loop
  • Threads in team can synchronize
  • Teams don’t synchronize

Work sharing

  • #pragma omp loop
    • Simple
    • Portable
  • #pragma omp distributed parallel for
    • More control for fine-tuning
    • Parallel region may cause overhead

Warning

Do not forget distributed!

Moving data around

  • Move data explicitely to, from, tofrom, alloc, release, delete
    • As part of #pragma omp target teams loop
    • Structured data region
    • Unstructured data region
  • Allocate data on device

Specify slice to copy: `array[:]

  • C++: <end> not included
  • Fortran: <end> included

Tip

Shortcut: array[:<end>]

Map types

  • to: allocate on device, copy from host to device
  • from: allocate on device, copy from device to host
  • tofrom: allocate on device, copy from host to device, copy from device to host
  • alloc: allocate on device
  • release: decrease reference count on device
  • delete: deallocate on device

Warning

alloc expects data to be allocated on host

Structured data region

float *vecA = (float *) malloc(N*sizeof(float));
float *vecB = (float *) malloc(N*sizeof(float));
float *vecC = (float *) malloc(N*sizeof(float));
...
#pragma omp target data map(to: vecA[:N], vecB[:N]) map(tofrom:: vecC[:N])
{
    ...
}
1
Data A, B, C allocated on host
2
Data A, B, C copied to device at start of scope
3
Data C copied back to host at end of scope
  • Can be nested
  • Can not overlap

Unstructured data region

float *a = (float *) malloc(n*n*sizeof(float));
...
#pragma omp target enter data map(to:a[0:n*n])
...
#pragma omp target exit data map(from:a[0:n*n])
1
Enter data region for a, copy to device
2
Exit data region for a, copy back to host
  • Can be nested

  • Can overlap

  • Can update

    #pragma omp target update from(a[0:n*n])

Allocate on device

OpenMP function

#include <omp.h>
...
float *b = (float *) omp_target_alloc(n*n*sizeof(float), omp_get_default_device());
#pragma omp target teams distribute parallel for is_device_ptr(b)
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            b[i*n + j] = ((float) (i*n + j))/(n*n);
        }
    }
...
omp_target_free(b, omp_get_default_device());
1
omp_target_alloc and omp_target_free are declared in omp.h
2
Allocation on device
3
Declare b as pointer to data on device
4
Free b

Tip

No data movements!

Functions as kernels

void iterate_z(const float complex *z, int *count, int n,
               const float complex C);

#pragma omp declare target to(iterate_z)
1
Function declaration or definition
2
Declare function as kernel

References