Home

Jump to bottom

Luca Parisi edited this page Aug 15, 2024 · 9 revisions

Main Topics

Basic Functionality

Parallelisation over SMs ( teams )
Parallelisation over warps ( parallel )
Data Transfers ( map directives )
#pragma omp declare

Advanced Topics

Custom mappers ( #1 )
- mapping of a class containing pointers
Custom memory allocators ( #4 )
- Memory allocations allocation per thread ( firstprivate) vs shared variables ( may live in shared memory/global memory/local memory vs being passed as a kernel argument )
- Pinned memory allocation on CPU
- shared memory allocations
Concurrency
- submit kernels from multiple threads (#3). Demonstrates using different cudaStreams/hipStreams.
- use openmp tasking (#7). Demonstrates overlapping memory transfers and execution or multiple small kernels which might be difficult to merge in one bigger kernel.
Interoperability ( #2 )
- Dementrate how to use cuda with a variable mapped from openmp and how to use a variable allocated from cuda in openmp.
- example of using cuFFT ( or any cuda/rocm numerical library ) together with openmp

Performance

OPenMP generic( CPU ) mode
Occupancy
Memory bandwith ( global/shared ), roofline plot
Coalesced access ( global memory )
Bank Conflicts ( shared memory )

Exercices

Jacobi tutorial

Naive implementation [ teams only, teams + parallel ]
Add mapping directives to control memory transfers
Use a custom mapper for transferring the whole class
Split up in subdomains for overlapping transfer and computation ( use streams )
Use shared memory to improve performance bandwidth ( custom memory allocators ? )

Clone this wiki locally