Topic outline

    • [slides available for enrolled students]

    • What we learned on Monday:

      • Modern CPUs have SIMD hardware.
        • We (usually) depend on the compiler to vectorize serial code.
      • A GPU is a co-processor.
        • It’s similar to a CPU but with more parallel machinery.
        • It has its own instruction stream and memory.
        • Therefore, an application will have both host code and device code.
        • Similarly, variables can live either on the RAM (CPU-side) or the video memory (GPU-side).
          • Copying data back and forth is needed.
          • Copying data can be explicit or implicit.
      • The thread is the basic unit of parallelism (thread hierarchy).
        • Threads are collected into blocks, the blocks make up the grid.
        • The total number of threads should be >> number of CUDA “cores” / stream processors.
      • We looked at basic CUDA and Thrust programs.
        • In CUDA,
          • We created pointers to device memory.
          • Copied memory between host and device.
          • Wrote and launched a kernel.
        • In Thrust,
          • We created host and device vectors.
          • We launched a kernel implicitly using the thrust::transform function.

    • What we learned on Wednesday:

      • We discussed the diffusion equation.
        • Finite-difference method, stencils.
      • We showcased a few more frameworks:
        • HIP is a clone of CUDA, the difference is just the branding.
          • It’s part of AMD’s open source ROCm platform.
        • OpenCL is an open standard.
          • It’s similar to CUDA/HIP but with different nomenclature.
          • Device code is a string, compiled at runtime.
          • It depends on vendor implementation, which is problematic.
          • Tonnes of boilerplate code.
        • Directive-based approaches use #pragma directives to transform loops into GPU code.
          • Easy to get started.
          • Complex code may require extensive optimization.
        • SYCL is a modern C++-based standard.
          • Pushed by Intel.
        • Numba is a JIT compiler for Python.
          • Can be used on Nvidia GPU.
          • Has kernel and ufunc modes and reduction operations.