Data science learning path

If you want to do data science on HPC, you can consider following the following training sessions.

graph TD Best_practices_for_scientific_computing[Best practices for scientific computing] --> Best_practices_for_data_science_on_HPC[Best practices for data science on HPC] Linux_intro[Linux introduction] --> HPC_intro[HPC introduction] HPC_intro --> Best_practices_for_data_science_on_HPC Best_practices_for_data_science_on_HPC --> Version_control_with_git[Version control with Git] Version_control_with_git --> MLOps_on_HPC[MLOps on HPC] Python_for_beginners[Python for beginners] --> Python_for_programmers[Python for programmers] Python_for_programmers --> Best_practices_for_data_science_on_HPC Best_practices_for_data_science_on_HPC --> Containers_on_HPC[Containers on HPC] Best_practices_for_data_science_on_HPC --> Workflows_for_HPC[Workflows for HPC] Best_practices_for_data_science_on_HPC --> Scientific_python[Scientific Python] Python_for_programmers --> Python_software_engineering[Python software engineering] Python_for_programmers --> Scientific_Python[Scientific Python] Scientific_Python --> Python_for_data_science[Python for data science] Python_for_data_science --> Python_dashboards[Python dashboards] Scientific_Python --> Python_for_HPC[Python for HPC] Python_for_data_science --> Python_for_HPC Python_for_data_science --> Generative_ai_for_software_engineering_and_data_analysis[Generative AI for software\nengineering and data analysis] Python_for_data_science --> Machine_learning_with_Python[Machine learning with Python] MLOps_on_HPC --> Machine_learning_with_Python click Best_practices_for_scientific_computing "https://gjbex.github.io/Best-practices-for-scientific-computing/" "Best practices for scientific computing" click Best_practices_for_data_science_on_HPC "https://gjbex.github.io/Best-practices-for-data-science-on-HPC/" "Best practices for data science on HPC" click Version_control_with_git "https://gjbex.github.io/Version-control-with-git" "Version control with Git" click Linux_intro "https://gjbex.github.io/Training-sessions/linux_intro" "Linux introduction" click HPC_intro "https://gjbex.github.io/Training-sessions/hpc_intro" "HPC introduction" click Containers_on_HPC "https://gjbex.github.io/Containers-for-HPC/" "Containers on HPC" click Workflows_for_HPC "https://gjbex.github.io/Workflows-for-HPC/" "Workflows for HPC" click MLOps_on_HPC "https://gjbex.github.io/MLOps-on-HPC/" "MLOps on HPC" click Python_for_beginners "https://gjbex.github.io/Python-for-beginners/" "Python for beginners" click Python_for_programmers "https://gjbex.github.io/Python-for-programmers/" "Python for programmers" click Python_software_engineering "https://gjbex.github.io/Python-software-engineering/" "Python software engineering" click Scientific_Python "https://gjbex.github.io/Scientific-Python/" "Scientific Python" click Generative_ai_for_software_engineering_and_data_analysis "https://gjbex.github.io/Training-sessions/generative_ai_for_software_engineering_and_data_analysis" "Generative AI for software engineering and data analysis" click Machine_learning_with_Python "https://gjbex.github.io/Training-sessions/machine_learning_with_python" "Machine learning with Python" click MLOps_on_HPC "https://gjbex.github.io/Training-sessions/mlops_on_hpc" "MLOps on HPC" click Python_dashboards "https://gjbex.github.io/Python-dashboards/" "Python dashboards" click Python_for_HPC "https://gjbex.github.io/Python-for-HPC/" "Python for HPC" click Python_for_data_science "https://gjbex.github.io/Python-for-data-science/" "Python for data science"

If you are new to scientific computing, you may want to start with "Best practices for scientific computing".

The next step is to fammliarize yourself with the basics of working on the Linux command line and the HPC infrastructure.

Since you need some programming skills to do data science, you may want to start with "Python for beginners", followed by "Python for programmers".

Since there are quite some best practices specific to data science on HPC, you may want to follow the "Best practices for data science on HPC" training session.

Learn how to manage your code with version control in the "Version control with git" training session, and your data and experiments with MLOps in the "MLOps on HPC" training session.

Containers are useful tools in the context of data science both to create a complete, stable and portable development environment, but also as a means to distribute your software. For more information on this topic, see "Containers on HPC".

Workflows are essential to automate your data science tasks. For more information on this topic, see "Workflows for HPC".

"Scientific Python" will introduce you to the Python libraries that are commonly used in the context of scientific computing, while "Python for data science" will introduce you to the Python libraries that are commonly used in the context of data science.

Performance is of course important in data science on HPC, so you may want to follow the "Python for HPC" training session.