Skip to content
Luca Parisi edited this page Aug 15, 2024 · 9 revisions

Main Topics

Basic Functionality

  • Parallelisation over SMs ( teams )
  • Parallelisation over warps ( parallel )
  • Data Transfers ( map directives )
  • #pragma omp declare

Advanced Topics

  • Custom mappers ( #1 )
    • mapping of a class containing pointers
  • Custom memory allocators ( #4 )
    • Memory allocations allocation per thread ( firstprivate) vs shared variables ( may live in shared memory/global memory/local memory vs being passed as a kernel argument )
    • Pinned memory allocation on CPU
    • shared memory allocations
  • Concurrency
    • submit kernels from multiple threads (#3). Demonstrates using different cudaStreams/hipStreams.
    • use openmp tasking (#7). Demonstrates overlapping memory transfers and execution or multiple small kernels which might be difficult to merge in one bigger kernel.
  • Interoperability ( #2 )
    • Dementrate how to use cuda with a variable mapped from openmp and how to use a variable allocated from cuda in openmp.
    • example of using cuFFT ( or any cuda/rocm numerical library ) together with openmp

Performance

  • OPenMP generic( CPU ) mode
  • Occupancy
  • Memory bandwith ( global/shared ), roofline plot
  • Coalesced access ( global memory )
  • Bank Conflicts ( shared memory )

Exercices

Jacobi tutorial

  • Naive implementation [ teams only, teams + parallel ]
  • Add mapping directives to control memory transfers
  • Use a custom mapper for transferring the whole class
  • Split up in subdomains for overlapping transfer and computation ( use streams )
  • Use shared memory to improve performance bandwidth ( custom memory allocators ? )

Clone this wiki locally