OpenMP GPU offloading

Geert Jan Bex

Motivation

GPUs can offer significant speedup

But:

Programming GPUs is non-trivial
Not all algorithms are suitable
Code may not be portable
Various vendors: NVIDIA, AMD, Intel

L’embarras du choix

CUDA (NVIDIA specific)
HIP (AMD specific)
OpenCL
OpenMP
OpenACC
Kokkos
SYCL
Alpaka
…

Choices & why

OpenMP

Motivation

Vendor agnostic
Relatively widely used
Support for C, C++ and Fortran
Good/reasonable documentation

Introduction

OpenMP for shared memory programming
OpenMP 4.5+ offloading to accelarators
Annotate code with directives
- #pragma omp ... for C/C++
- !$omp ... for Fortran
Runtime library
Environment variables

Pros

Gradual path to parallel code
Implemented by GCC, Intel OneAPI, Clang, NVIDIA
Can be used in C, C++, Fortran
Conceptually fairly easy

Cons

Efficient code not so easy
OpenMP 5.x not fully implemented

Caveats

Warning

These slides only cover device offloading using OpenMP

Warning

Familarity with C-type raw pointers helps (a lot)

Hello world

float *A = (float *) malloc(N*sizeof(float));
float *B = (float *) malloc(N*sizeof(float));
float *C = (float *) malloc(N*sizeof(float));
...
#pragma omp target teams loop map(to:A[:N], B[:N]) map(tofrom:C[:N])
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < n; j++) {
            C[i] += A[i]*B[i];
        }
    }

1: Allocate and initialize data on host
2: Move A and B to device, move C to device and back to host at end of target region
3: Kernel executing on device

Warning

Data scope is scope of #pragma omp target teams loop

Targetting a device

#pragma omp target selects a device
Default device is GPU:0 if there is one, CPU otherwise
device clause allows to explicitly select device
omp_get_num_devices(), omp_get_default_device()
Devices numbered from 0 to omp_get_num_devices() - 1

Creating teams

#pragma omp teams directive
Creates league of teams
Teams have threads as members
num_teams clause to set maximum size of league
thread_limit clause to set maximum team size
One thread per team outside of parallel region or loop
Threads in team can synchronize
Teams don’t synchronize

Moving data around

Move data explicitely to, from, tofrom, alloc, release, delete
- As part of #pragma omp target teams loop
- Structured data region
- Unstructured data region
Allocate data on device

Specify slice to copy: `array[:]

C++: <end> not included
Fortran: <end> included

Tip

Shortcut: array[:<end>]

Map types

to: allocate on device, copy from host to device
from: allocate on device, copy from device to host
tofrom: allocate on device, copy from host to device, copy from device to host
alloc: allocate on device
release: decrease reference count on device
delete: deallocate on device

Warning

alloc expects data to be allocated on host

Structured data region

float *vecA = (float *) malloc(N*sizeof(float));
float *vecB = (float *) malloc(N*sizeof(float));
float *vecC = (float *) malloc(N*sizeof(float));
...
#pragma omp target data map(to: vecA[:N], vecB[:N]) map(tofrom:: vecC[:N])
{
    ...
}

1: Data A, B, C allocated on host
2: Data A, B, C copied to device at start of scope
3: Data C copied back to host at end of scope

Can be nested
Can not overlap

Unstructured data region

float *a = (float *) malloc(n*n*sizeof(float));
...
#pragma omp target enter data map(to:a[0:n*n])
...
#pragma omp target exit data map(from:a[0:n*n])

1: Enter data region for a, copy to device
2: Exit data region for a, copy back to host

Can be nested
Can overlap

Can update

#pragma omp target update from(a[0:n*n])

Allocate on device

OpenMP function

#include <omp.h>
...
float *b = (float *) omp_target_alloc(n*n*sizeof(float), omp_get_default_device());
#pragma omp target teams distribute parallel for is_device_ptr(b)
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            b[i*n + j] = ((float) (i*n + j))/(n*n);
        }
    }
...
omp_target_free(b, omp_get_default_device());

1: omp_target_alloc and omp_target_free are declared in omp.h
2: Allocation on device
3: Declare b as pointer to data on device
4: Free b

Tip

No data movements!

Functions as kernels

void iterate_z(const float complex *z, int *count, int n,
               const float complex C);

#pragma omp declare target to(iterate_z)

1: Function declaration or definition
2: Declare function as kernel

References

OpenMP secification

OpenMP GPU offloading

Motivation

L’embarras du choix

Choices & why

Introduction

Pros

Cons

Caveats

Hello world

Targetting a device

Creating teams

Work sharing

Moving data around

Map types

Structured data region

Unstructured data region

Allocate on device

Functions as kernels

References