View on GitHub

Best practices for data science on HPC

This repository contains the training material for the VSC training about best practices for data science on HPC infrastructures

There are many good reasons to run data science workloads on a High Performance Computing (HPC) system. However, the transition from a laptop to an HPC system can be daunting. This training will help you make that transition.

You will also learn about potential pitfalls and how to avoid them. This training is not just about the good parts, but also about how to avoid the bad parts.

Learning outcomes

When you complete this training you will

be able to judge when to switch to an HPC environment;
be able to prepare your environment for R and Python;
be able to run a job on an HPC cluster that uses that environment;
be able to determine how long your computation will take;
be able to determine how much memory your computation will need;
be able to estimate the efficiency of your computation;
know the basics of how to run your computations efficiently;
know when it makes sense to use parallelization;
understand the basics and pitfalls of I/O on HPC systems;
are aware of potential pitfalls and how to avoid them.

Schedule

Total duration: 4 hours

Subject	Duration
introduction and motivation	5 min.
setting up environments on an HPC system	25 min.
walltime & memory requirements	30 min.
efficiency	30 min.
to parallelize or not to parallelize?	30 min.
I/O on HPC systems	60 min.
pitfalls and how to avoid them	30 min.
wrap up	5 min.

Training materials

All training materials are available in a GitHub repository.

Target audience

This training is for you if you need to use R on HPC systems.

Prerequisites

This is not a training that starts from scratch. You have followed an HPC introduction training session and you have a basic understanding of how to work on the Bash command line.

You have experience with R or Python.

Quick self-assessment

If you can do most of the tasks below, you are likely ready for this training.

log in to an HPC system and navigate the filesystem from a shell;
submit a simple batch job and inspect whether it completed successfully;
activate or load a software environment for R or Python;
run a small R or Python script from the command line;
estimate roughly how long a small computation takes on your own machine;
recognize when a dataset or intermediate result is large enough to stress memory or storage;
explain at a high level why reading many small files can be inefficient;
make a small change to an existing script or job script and run it again.

If several of these items still feel difficult, the training will probably move too fast. In that case, it is better to first take an introductory HPC session and refresh basic command-line use.

Software and access requirements

For following along hands-on, you need

laptop or desktop with internet access and set up so you can connect to an HPC system;
an account on an HPC system (e.g., VSC, CECI, …);
compute credits if that is required to run jobs on the HPC system;

Level

Introductory: 40 %
Intermediate: 50 %
Advanced: 10 %

Trainer(s)

Geert Jan Bex (geertjan.bex@uhasselt.be)