View on GitHub

OpenMP GPU offloading

Training material for the session 'OpenMP GPU offloading'.

Leveraging the compute power of GPU accelerators for scientific computing is becoming increasingly important. There are many programming models for GPU programming, but OpenMP is a vendor-agnostic approach that allows you to write code that can run on a wide variety of GPU hardware. In this training, we will cover GPU programming with OpenMP, and how to optimize performance.

Learning outcomes

When you complete this training you will

have a good understanding of the GPU hardware and programming model;
be able to write OpenMP code that offloads to the GPU;
understand how to optimize performance of OpenMP GPU code;
be able to apply this knowledge to your own scientific computing applications.

Schedule

Total duration: 4 hours.

Subject	Duration
introduction and motivation	5 min.
GPU hardware/programming model	80 min.
OpenMP worksharing	25 min.
coffee break	10 min.
OpenMP data movements	30 min.
OpenMP kernels	10 min.
examples	60 min.
wrap up	10 min.

Training materials

The slide deck is available as a Quarto RevealJS presentation.

Slides, source code, and supporting material are available in the GitHub repository. The repository contains C and Fortran examples with CMake build files.

Target audience

This training is for you if you want to do GPU programming in C, C++, or Fortran and want a vendor-agnostic approach.

Prerequisites

You will need experience programming in C, C++, or Fortran and be familiar with the OpenMP programming model.

If you plan to do GPU programming in a Linux or HPC environment you should be familiar with these as well.

More concretely, participants should already be comfortable with the following:

writing and reading small to medium programs in C, C++, or Fortran;
arrays, loops, functions or procedures, and separate compilation at a basic level;
compiling and running programs from the command line;
basic OpenMP shared-memory concepts such as parallel loops, reductions, and the fact that race conditions can occur when threads update shared data;
working from the shell: navigating directories, editing files, and running commands;
basic HPC concepts such as login nodes, compute nodes, and batch jobs if the examples are run on a cluster.

Familiarity with C-style raw pointers helps a lot for the C and C++ examples. You do not need prior experience with GPU programming, OpenMP target offloading, device memory management, target data regions, or GPU-specific performance tuning. Those are part of the training itself.

Quick self-assessment

If you can do most of the tasks below without looking up basic language, OpenMP, or shell syntax, you are likely ready for this training.

compile and run a small C, C++, or Fortran program from the command line;
read a short loop nest over arrays and explain what it computes;
understand what an OpenMP parallel loop or reduction does at a high level;
understand at a high level why updating a shared variable from many threads can cause incorrect results;
read a short program that allocates arrays and passes them to a function;
make a small change to an existing source file and rebuild it;
understand at a high level that data on a GPU may need to be transferred to and from host memory.

If several of these items still feel difficult, the training will probably move too fast. In that case, it is better to first refresh your base language and basic OpenMP shared-memory programming.

Software and access requirements

To follow hands-on, you need a system with GPU hardware and a compiler toolchain that supports OpenMP target offloading for that hardware. The example code is organized as C and Fortran CMake projects.

More concretely, you need:

a laptop or desktop with internet access;
a terminal environment that can connect to the system where the examples will run;
an account on an HPC system if the GPU hardware is provided by a cluster;
compute credits or project allocation if that is required by the HPC system;
a C compiler for the C examples, and a Fortran compiler for the Fortran examples;
CMake for configuring and building the examples;
an OpenMP offloading-capable compiler stack, such as the NVIDIA HPC SDK for NVIDIA GPU examples;
enough familiarity with the local module or environment system to load the compiler, GPU runtime, and build tools.

The C and Fortran example directories contain local README files with notes on setting compiler environment variables such as CC and FC when using the NVIDIA HPC SDK.

Level of the Material

For participants who already have programming experience in C, C++, or Fortran, and basic OpenMP experience, the material in this training is approximately

Introductory: 15 %
Intermediate: 35 %
Advanced: 50 %

These percentages describe the level of the GPU-offloading and OpenMP topics covered in the training, not the required entry level in the base programming language itself.

Trainer(s)

Geert Jan Bex (geertjan.bex@uhasselt.be)