GPUs can offer significant speedup
But:
Motivation
#pragma omp ... for C/C++!$omp ... for FortranWarning
These slides only cover device offloading using OpenMP
Warning
Familarity with C-type raw pointers helps (a lot)
A and B to device, move C to device and back to host at end of target region
Warning
Data scope is scope of #pragma omp target teams loop
#pragma omp target selects a devicedevice clause allows to explicitly select deviceomp_get_num_devices(), omp_get_default_device()omp_get_num_devices() - 1#pragma omp teams directivenum_teams clause to set maximum size of leaguethread_limit clause to set maximum team sizeloop#pragma omp loop
#pragma omp distributed parallel for
Warning
Do not forget distributed!
to, from, tofrom, alloc, release, delete
#pragma omp target teams loopSpecify slice to copy: `array[
<end> not included<end> includedTip
Shortcut: array[:<end>]
to: allocate on device, copy from host to devicefrom: allocate on device, copy from device to hosttofrom: allocate on device, copy from host to device, copy from device to hostalloc: allocate on devicerelease: decrease reference count on devicedelete: deallocate on deviceWarning
alloc expects data to be allocated on host
A, B, C allocated on host
A, B, C copied to device at start of scope
C copied back to host at end of scope
a, copy to device
a, copy back to host
Can be nested
Can overlap
Can update
OpenMP function
#include <omp.h>
...
float *b = (float *) omp_target_alloc(n*n*sizeof(float), omp_get_default_device());
#pragma omp target teams distribute parallel for is_device_ptr(b)
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
b[i*n + j] = ((float) (i*n + j))/(n*n);
}
}
...
omp_target_free(b, omp_get_default_device());omp_target_alloc and omp_target_free are declared in omp.h
b as pointer to data on device
b
Tip
No data movements!