Skip to content

Commit 99c749d

Browse files
Added GPU debugging and update cursor rules (#952)
Co-authored-by: Spencer Bryngelson <[email protected]>
1 parent 658f188 commit 99c749d

File tree

3 files changed

+167
-8
lines changed

3 files changed

+167
-8
lines changed

.cursor/rules/mfc-agent-rules.mdc

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Written primarily for Fortran/Fypp; the OpenACC and style sections matter only w
1616
- Most sources are `.fpp`; CMake transpiles them to `.f90`.
1717
- **Fypp macros** live in `src/<subprogram>/include/` you should scan these first.
1818
`<subprogram>` ∈ {`simulation`,`common`,`pre_process`,`post_process`}.
19-
- Only `simulation` (+ its `common` calls) is GPU-accelerated via **OpenACC**.
19+
- Only `simulation` (+ its `common` calls) is GPU-accelerated via **OpenACC** or **OpenMP**.
2020
- Assume free-form Fortran 2008+, `implicit none`, explicit `intent`, and modern
2121
intrinsics.
2222
- Prefer `module … contains … subroutine foo()`; avoid `COMMON` blocks and
@@ -56,27 +56,29 @@ Written primarily for Fortran/Fypp; the OpenACC and style sections matter only w
5656
* Every variable: `intent(in|out|inout)` + appropriate `dimension` / `allocatable`
5757
/ `pointer`.
5858
* Use `s_mpi_abort(<msg>)` for errors, not `stop`.
59-
* Mark OpenACC-callable helpers that are called from OpenACC parallel loops immediately after declaration:
59+
* Mark GPU-callable helpers that are called from GPU parallel loops immediately after declaration:
6060
```fortran
6161
subroutine s_flux_update(...)
62-
!$acc routine seq
62+
$:GPU_ROUTINE(function_name='s_flux_update', parallelism='[seq]')
6363
...
6464
end subroutine
6565
```
6666

6767
---
6868

69-
# 3 OpenACC Programming Guidelines (for kernels)
69+
# 3 FYPP Macros for GPU acceleration Pogramming Guidelines (for kernels)
70+
71+
Do not directly use OpenACC or OpenMP directives directly. Instead, use the FYPP macros contained in src/common/include/parallel_macros.fpp
7072

7173
Wrap tight loops with
7274

7375
```fortran
74-
!$acc parallel loop gang vector default(present) reduction(...)
76+
$:GPU_PARALLEL_FOR(private='[...]', copy='[...]')
7577
```
76-
* Add `collapse(n)` to merge nested loops when safe.
77-
* Declare loop-local variables with `private(...)`.
78+
* Add `collapse=n` to merge nested loops when safe.
79+
* Declare loop-local variables with `private='[...]'`.
7880
* Allocate large arrays with `managed` or move them into a persistent
79-
`!$acc enter data` region at start-up.
81+
`$:GPU_ENTER_DATA(...)` region at start-up.
8082
* **Do not** place `stop` / `error stop` inside device code.
8183
* Must compile with Cray `ftn` and NVIDIA `nvfortran` for GPU offloading; also build CPU-only with
8284
GNU `gfortran` and Intel `ifx`/`ifort`.

docs/documentation/gpuDebugging.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Debugging Tools and Tips for GPUs
2+
3+
## Compiler agnostic tools
4+
5+
## OpenMP tools
6+
```bash
7+
OMP_DISPLAY_ENV=true | false | verbose
8+
```
9+
- Prints out the internal control values and environment variables at the beginning of the program if `true` or `verbose`
10+
- `verbose` will also print out vendor-specific internal control values and environment variables
11+
12+
```bash
13+
OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT
14+
```
15+
- Quick way to turn off off-load (`DISABLED`) or make it abort if a GPU isn't found (`MANDATORY`)
16+
- Great first test: does the problem disappear when you drop back to the CPU?
17+
18+
```bash
19+
OMP_THREAD_LIMIT=<positive_integer>
20+
```
21+
- Sets the maximum number of OpenMP threads to use in a contention group
22+
- Might be useful in checking for issues with contention or race conditions
23+
24+
```bash
25+
OMP_DISPLAY_AFFINITY=TRUE
26+
```
27+
- Will display affinity bindings for each OpenMP thread, containing hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.
28+
29+
## Cray Compiler Tools
30+
31+
### Cray General Options
32+
33+
```bash
34+
CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy)
35+
```
36+
- Dumps a time-stamped log line (`"ACC: ...`) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU.
37+
38+
- Outputs on STDERR by default. Can be changed by setting `CRAY_ACC_DEBUG_FILE`.
39+
- Recognizes `stderr`, `stdout`, and `process`.
40+
- `process` automatically generates a new file based on `pid` (each MPI process will have a different file)
41+
42+
- While this environment variable specifies ACC, it can be used for both OpenACC and OpenMP
43+
44+
```bash
45+
CRAY_ACC_FORCE_EARLY_INIT=1
46+
```
47+
- Force full GPU initialization at program start so you can see start-up hangs immediately
48+
- Default behavior without an environment variable is to defer initialization on first use
49+
- Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device
50+
51+
### Cray OpenACC Options
52+
53+
```bash
54+
CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1
55+
```
56+
- Will cause `acc_present_dump()` to output variable names and file locations in addition to variable mappings
57+
- Add `acc_present_dump()` around hotspots to help find problems with data movements
58+
- Helps more if adding `CRAY_ACC_DEBUG` environment variable
59+
60+
## NVHPC Compiler Options
61+
62+
### NVHPC General Options
63+
64+
```bash
65+
STATIC_RANDOM_SEED=1
66+
```
67+
- Forces the seed returned by `RANDOM_SEED` to be constant, so it generates the same sequence of random numbers
68+
- Useful for testing issues with randomized data
69+
70+
```bash
71+
NVCOMPILER_TERM=option[,option]
72+
```
73+
- `[no]debug`: Enables/disables just-in-time debugging (debugging invoked on error)
74+
- `[no]trace`: Enables/disables stack traceback on error
75+
76+
### NVHPC OpenACC Options
77+
78+
```bash
79+
NVCOMPILER_ACC_NOTIFY= <bitmask>
80+
```
81+
- Assign the environment variable to a bitmask to print out information to stderr for the following
82+
- kernel launches: 1
83+
- data transfers: 2
84+
- region entry/exit: 4
85+
- wait operation of synchronizations with the device: 8
86+
- device memory allocations and deallocations: 16
87+
- 1 (kernels only) is the usual first step.3 (kernels + copies) is great for "why is it so slow?"
88+
89+
```bash
90+
NVCOMPILER_ACC_TIME=1
91+
```
92+
- Lightweight profiler
93+
- prints a tidy end-of-run table with per-region and per-kernel times and bytes moved
94+
- Do not use with CUDA profiler at the same time
95+
96+
```bash
97+
NVCOMPILER_ACC_DEBUG=1
98+
```
99+
- Spews everything the runtime sees: host/device addresses, mapping events, present-table look-ups, etc.
100+
- Great for "partially present" or "pointer went missing" errors.
101+
- [Doc for NVCOMPILER_ACC_DEBUG](https://docs.nvidia.com/hpc-sdk/archive/20.9/pdf/hpc209openacc_gs.pdf)
102+
- Ctrl+F for `NVCOMPILER_ACC_DEBUG`
103+
104+
### NVHPC OpenMP Options
105+
106+
```bash
107+
LIBOMPTARGET_PROFILE=run.json
108+
```
109+
- Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope
110+
- Great lightweight profiler when Nsight is overkill.
111+
- Granularity in µs via `LIBOMPTARGET_PROFILE_GRANULARITY` (default 500).
112+
113+
```bash
114+
LIBOMPTARGET_INFO=<bitmask>
115+
```
116+
- Prints out different types of runtime information
117+
- Human-readable log of data-mapping inserts/updates, kernel launches, copies, waits.
118+
- Perfect first stop for "why is nothing copied?"
119+
- Flags
120+
- Print all data arguments upon entering an OpenMP device kernel: 0x01
121+
- Indicate when a mapped address already exists in the device mapping table: 0x02
122+
- Dump the contents of the device pointer map at kernel exit: 0x04
123+
- Indicate when an entry is changed in the device mapping table: 0x08
124+
- Print OpenMP kernel information from device plugins: 0x10
125+
- Indicate when data is copied to and from the device: 0x20
126+
127+
```bash
128+
LIBOMPTARGET_DEBUG=1
129+
```
130+
- Developer-level trace (host-side)
131+
- Much noisier than `INFO`
132+
- Only works if the runtime was built with `-DOMPTARGET_DEBUG`.
133+
134+
```bash
135+
LIBOMPTARGET_JIT_OPT_LEVEL=-O{0,1,2,3}
136+
```
137+
- This environment variable can be used to change the optimization pipeline used to optimize the embedded device code as part of the device JIT.
138+
- The value corresponds to the `-O{0,1,2,3}` command line argument passed to clang.
139+
140+
```bash
141+
LIBOMPTARGET_JIT_SKIP_OPT=1
142+
```
143+
- This environment variable can be used to skip the optimization pipeline during JIT compilation.
144+
- If set, the image will only be passed through the backend.
145+
- The backend is invoked with the `LIBOMPTARGET_JIT_OPT_LEVEL` flag.
146+
147+
## Compiler Documentation
148+
149+
- [Cray & OpenMP Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openmp.7.html#environment-variables)
150+
- [Cray & OpenACC Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openacc.7.html#environment-variables)
151+
- [NVHPC & OpenACC Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#environment-variables)
152+
- [NVHPC & OpenMP Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#id2)
153+
- [LLVM & OpenMP Docs] (https://openmp.llvm.org/design/Runtimes.html)
154+
- NVHPC is built on top of LLVM
155+
- [OpenMP Docs](https://www.openmp.org/spec-html/5.1/openmp.html)
156+
- [OpenACC Docs](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf)

docs/documentation/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
- [Flow Visualization](visualization.md)
1111
- [Performance](expectedPerformance.md)
1212
- [GPU Parallelization](gpuParallelization.md)
13+
- [GPU Debugging](gpuDebugging.md)
1314
- [MFC's Authors](authors.md)
1415
- [References](references.md)
1516

0 commit comments

Comments
 (0)