View on GitHub

Best-practices-for-data-science-on-HPC

This repository contains the training material for the VSC training about best practices for data science on HPC infrastructures

There are many good reasons to run data science workloads on a High Performance Computing (HPC) system. However, the transition from a laptop to an HPC system can be daunting. This training will help you make that transition.

You will also learn about potential pitfalls and how to avoid them. This training is not just about the good parts, but also about how to avoid the bad parts.

Learning outcomes

When you complete this training you will

Schedule

Total duration: 4 hours

Subject Duration
introduction and motivation 5 min.
setting up environments on an HPC system 25 min.
walltime & memory requirements 30 min.
efficiency 30 min.
to parallelize or not to parallelize? 30 min.
I/O on HPC systems 60 min.
pitfalls and how to avoid them 30 min.
wrap up 5 min.

Training materials

All training materials are available in a GitHub repository.

Target audience

This training is for you if you need to use R on HPC systems.

Prerequisites

This is not a training that starts from scratch. You have followed an HPC introduction training session and you have a basic understanding of how to work on the Bash command line.

You have experience with R or Python.

Trainer(s)