Cours : HPC133 Introduction to GPU Programming (Feb 2023)

Topic outline

Sélectionner la section General

Replier Déplier
General

Tout replier Tout déplier
- Sélectionner l’activité Course Description
  
  HPC133 Introduction to GPU Programming (Feb 2023)
  An overview of GPUs and their use in supercomputers. This workshop will explain what GPUs are, and cover the basic ideas of GPU use in scientific computing. We will introduce several GPU programming frameworks, and demonstrate how to accelerate a solution of a science problem using a GPU. Python or C++ could be used for the assignment.
  Format: In person, but also broadcast.
  Enseignant: Yohai Meiron
  Date de début: : 21 févr. 2023
  Date de fin: : 24 févr. 2023
  Nombre de crédits - calcul haute performance: 6
  Événements:
  Intro to GPU Programming #1 - mardi 21 février, 12:30 » 14:00
  Intro to GPU Programming #2 - mercredi 22 février, 12:30 » 14:00
  Intro to GPU Programming #3 - vendredi 24 février, 13:30 » 15:00
- Sélectionner l’activité Location: SciNet Teaching Room, 11th floor o...
  
  Location: SciNet Teaching Room, 11th floor on the MaRS West tower, 661 University Ave., Suite 1140, Toronto, ON M5G 1M1
  Dates and times: Feb. 21, 22, & 24, from 12:30 to 14:00 EST
- Sélectionner l’activité Announcements
  
  Announcements Forum
- Sélectionner l’activité Feedback
  
  Feedback
  
  Please fill in this very short survey
  
  Non disponible à moins que : Vous soyez membre de Active participants
Sélectionner la section Course material

Replier Déplier
Course material
- Sélectionner l’activité Slides
  
  Slides Fichier
  
  [updated with Friday's slides]
- Sélectionner l’activité Source code
  
  Source code Fichier
  
  This tarball contains the source code for the vector addition examples, gravitational potential exercise, and sample code for the diffusion equation homework exercise.
  
  The src/gravity directory contains two Python programs to test and benchmark the different solutions for the gravitational potential exercise; the solution functions are in the gravity_calculators subdirectory. The Sapporo N-body library has to be downloaded, patched, and built before it can be used here; this is done automatically by the get_sapporo.sh script in the sapporo2 directory (tested on Mist; some tweaks required to get this program to work on other systems).
- Sélectionner l’activité What we learned on Tuesday: Modern CPUs hav...
  
  What we learned on Tuesday:
  
  Modern CPUs have SIMD hardware.
  
  We (usually) depend on the compiler to vectorize serial code.
  
  A GPU is a co-processor.
  
  It’s similar to a CPU but with more parallel machinery.
  
  It has its own instruction stream and memory.
  
  Therefore, an application will have both host code and device code.
  
  Similarly, variables can live either on the RAM (CPU-side) or the video memory (GPU-side).
  
  Copying data back and forth is needed.
  
  Copying data can be explicit or implicit.
  
  The thread is the basic unit of parallelism (thread hierarchy).
  
  Threads are collected into blocks, the blocks make up the grid.
  
  The total number of threads should be >> number of CUDA “cores” / stream processors.
  
  We looked at a basic CUDA program:
  
  We created pointers to device memory.
  
  Copied memory between host and device.
  
  Wrote and launched a kernel.
- Sélectionner l’activité What we learned on Wednesday: We discussed ...
  
  What we learned on Wednesday:
  
  We discussed the diffusion equation.
  
  Finite-difference method, stencils.
  
  We showcased a few more frameworks:
  
  Thrust is a high level wrapper of CUDA providing useful abstractions.
  Such as containers (vectors).
  And algorithms (transformation, reductions, sorting...)
  HIP is a clone of CUDA, the difference is just the branding.
  
  It’s part of AMD’s open source ROCm platform.
  
  OpenCL is an open standard.
  
  It’s similar to CUDA/HIP but with different nomenclature.
  
  Device code is a string, compiled at runtime.
  
  It depends on vendor implementation, which is problematic.
  
  Tonnes of boilerplate code.
  
  Directive-based approaches use #pragma directives to transform loops into GPU code.
  
  Easy to get started.
  
  Complex code may require extensive optimization.
  
  SYCL is a modern C++-based standard.
  
  Pushed by Intel.
  
  Numba is a JIT compiler for Python.
  
  Can be used on Nvidia GPU.
  
  Has kernel and ufunc modes and reduction operations.
Sélectionner la section Recordings

Replier Déplier
Recordings
- Sélectionner l’activité [Please log in for the recordings]
  
  [Please log in for the recordings]
Sélectionner la section Assignment

Replier Déplier
Assignment
- Sélectionner l’activité 2D diffusion equation
  
  2D diffusion equation Devoir
  
  À rendre : samedi 11 mars 2023, 00:00
  
  The assignment is to numerically solve the diffusion (heat) equation in two dimensions, using GPU acceleration, in either Python or C++.
  
  You can find serial, CPU-based solutions in both languages in the course’s source tarball.
  
  If you choose Python, you can follow the instructions on the appropriate slide in the handouts, titled Setting up the environment (Python). If you want graphics on Mist, please also install the matplotlib package in your Conda environment using conda install (no need on Graham as the package is provided by the scipy-stack module). You could modify the file diff2d.py such that the bulk of the calculation (within the time loop) is done using the GPU. You can follow the gravitational potential calculation example shown in class, Numba and/or CuPy can be used in the solution. Note that the sample solution in Python is equivalent to the “naïve” solution for the gravitational potential problem, therefore very bad.
  
  If you choose C++, we count on you being familiar with how to compile source code. You could modify the file diff2d.cpp such that the bulk of the calculation (within the time loop) is done using the GPU. On both Mist and Graham, load the following modules: cuda, gcc, pgplot (you may skip PGPLOT but then please remove the plotting calls from the main source file and do not compile diff2dplot.cpp). If you choose to work with HIP instead of CUDA, on Mist you can load the hip module in addition to the cuda module (which is still necessary since HIP uses the CUDA compiler under the hood when compiling for Nvidia GPUs). HIP is not currently available on Graham, but you can try to install it locally there.
  
  The sample C++ code uses rarray, you can install it locally or just pull the header.
  
  The usual suffix of CUDA files is .cu and the nvcc command is used to compile the source (instead of g++ for example). You can keep the .cpp suffix, but then have to pass --x=cu to nvcc (this is not recommended for files that contains a kernel launch with triple angle brackets, as that is not legal C++ syntax). The usual suffix of HIP files is just .cpp, and the hipcc command is used to compile. GPU kernels can be in the same file where the main function is, as we saw in the vector addition example, but in more complex applications the GPU code (including kernel launches) would generally be separated out to one or more compilation units that are linked to the rest of the code during the build process.
  
  The Mist login node has four GPUs that can be used for the assignment, on Graham one has to submit a job to the scheduler. Unlike for submitted jobs, there is no guarantee that a GPU on the Mist login node would be free when you launch your application, as the node is shared by everyone. You could use the nvidia-smi command to see the current usage of the GPUs. By default, the first device (number 0) is used, but this behaviour can be change by setting an environment variable as shown below. For example, if you want to use device number 1:
  
  CUDA_VISIBLE_DEVICES=1 python code.py
  
  There are three “bonus” tasks that you can try for your own amusement (2 & 3 are beyond the scope of this workshop):
  
  The smaller Δx, the more accurate and computationally heavy the solution. Plot the timing for your solution and of the serial CPU-based solution (and possibly improved CPU-based solutions) as a function of Δx.
  
  Decompose the domain and solve the problem with multiple GPUs on the same node.
  
  Use a distributed memory library to deploy your solution on multiple nodes.
  
  Hint: for a single node you could use multiprocessing in Python and thread or OpenMP in C++. For multiple nodes you could use mpi4py (Python) or MPI (C++).
  
  Non disponible à moins que : Vous soyez membre de Active participants
- Sélectionner l’activité Example solution https://educatio...
  
  Example solution
- Sélectionner l’activité Example solution
  
  Example solution Fichier
  
  Non disponible à moins que : Vous soyez membre de Active participants

Topic outline

General

HPC133 Introduction to GPU Programming (Feb 2023)

Course material

Recordings

Assignment

Example solution