|
| 1 | +# Debugging Tools and Tips for GPUs |
| 2 | + |
| 3 | +## Compiler agnostic tools |
| 4 | + |
| 5 | +## OpenMP tools |
| 6 | +```bash |
| 7 | +OMP_DISPLAY_ENV=true | false | verbose |
| 8 | +``` |
| 9 | +- Prints out the internal control values and environment variables at the beginning of the program if `true` or `verbose` |
| 10 | +- `verbose` will also print out vendor-specific internal control values and environment variables |
| 11 | + |
| 12 | +```bash |
| 13 | +OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT |
| 14 | +``` |
| 15 | +- Quick way to turn off off-load (`DISABLED`) or make it abort if a GPU isn't found (`MANDATORY`) |
| 16 | +- Great first test: does the problem disappear when you drop back to the CPU? |
| 17 | + |
| 18 | +```bash |
| 19 | +OMP_THREAD_LIMIT=<positive_integer> |
| 20 | +``` |
| 21 | +- Sets the maximum number of OpenMP threads to use in a contention group |
| 22 | +- Might be useful in checking for issues with contention or race conditions |
| 23 | + |
| 24 | +```bash |
| 25 | +OMP_DISPLAY_AFFINITY=TRUE |
| 26 | +``` |
| 27 | +- Will display affinity bindings for each OpenMP thread, containing hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding. |
| 28 | + |
| 29 | +## Cray Compiler Tools |
| 30 | + |
| 31 | +### Cray General Options |
| 32 | + |
| 33 | +```bash |
| 34 | +CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) |
| 35 | +``` |
| 36 | +- Dumps a time-stamped log line (`"ACC: ...`) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU. |
| 37 | + |
| 38 | +- Outputs on STDERR by default. Can be changed by setting `CRAY_ACC_DEBUG_FILE`. |
| 39 | + - Recognizes `stderr`, `stdout`, and `process`. |
| 40 | + - `process` automatically generates a new file based on `pid` (each MPI process will have a different file) |
| 41 | + |
| 42 | +- While this environment variable specifies ACC, it can be used for both OpenACC and OpenMP |
| 43 | + |
| 44 | +```bash |
| 45 | +CRAY_ACC_FORCE_EARLY_INIT=1 |
| 46 | +``` |
| 47 | +- Force full GPU initialization at program start so you can see start-up hangs immediately |
| 48 | +- Default behavior without an environment variable is to defer initialization on first use |
| 49 | +- Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device |
| 50 | + |
| 51 | +### Cray OpenACC Options |
| 52 | + |
| 53 | +```bash |
| 54 | +CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1 |
| 55 | +``` |
| 56 | +- Will cause `acc_present_dump()` to output variable names and file locations in addition to variable mappings |
| 57 | +- Add `acc_present_dump()` around hotspots to help find problems with data movements |
| 58 | + - Helps more if adding `CRAY_ACC_DEBUG` environment variable |
| 59 | + |
| 60 | +## NVHPC Compiler Options |
| 61 | + |
| 62 | +### NVHPC General Options |
| 63 | + |
| 64 | +```bash |
| 65 | +STATIC_RANDOM_SEED=1 |
| 66 | +``` |
| 67 | +- Forces the seed returned by `RANDOM_SEED` to be constant, so it generates the same sequence of random numbers |
| 68 | +- Useful for testing issues with randomized data |
| 69 | + |
| 70 | +```bash |
| 71 | +NVCOMPILER_TERM=option[,option] |
| 72 | +``` |
| 73 | +- `[no]debug`: Enables/disables just-in-time debugging (debugging invoked on error) |
| 74 | +- `[no]trace`: Enables/disables stack traceback on error |
| 75 | + |
| 76 | +### NVHPC OpenACC Options |
| 77 | + |
| 78 | +```bash |
| 79 | +NVCOMPILER_ACC_NOTIFY= <bitmask> |
| 80 | +``` |
| 81 | +- Assign the environment variable to a bitmask to print out information to stderr for the following |
| 82 | + - kernel launches: 1 |
| 83 | + - data transfers: 2 |
| 84 | + - region entry/exit: 4 |
| 85 | + - wait operation of synchronizations with the device: 8 |
| 86 | + - device memory allocations and deallocations: 16 |
| 87 | +- 1 (kernels only) is the usual first step.3 (kernels + copies) is great for "why is it so slow?" |
| 88 | + |
| 89 | +```bash |
| 90 | +NVCOMPILER_ACC_TIME=1 |
| 91 | +``` |
| 92 | +- Lightweight profiler |
| 93 | +- prints a tidy end-of-run table with per-region and per-kernel times and bytes moved |
| 94 | +- Do not use with CUDA profiler at the same time |
| 95 | + |
| 96 | +```bash |
| 97 | +NVCOMPILER_ACC_DEBUG=1 |
| 98 | +``` |
| 99 | +- Spews everything the runtime sees: host/device addresses, mapping events, present-table look-ups, etc. |
| 100 | +- Great for "partially present" or "pointer went missing" errors. |
| 101 | +- [Doc for NVCOMPILER_ACC_DEBUG](https://docs.nvidia.com/hpc-sdk/archive/20.9/pdf/hpc209openacc_gs.pdf) |
| 102 | + - Ctrl+F for `NVCOMPILER_ACC_DEBUG` |
| 103 | + |
| 104 | +### NVHPC OpenMP Options |
| 105 | + |
| 106 | +```bash |
| 107 | +LIBOMPTARGET_PROFILE=run.json |
| 108 | +``` |
| 109 | +- Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope |
| 110 | +- Great lightweight profiler when Nsight is overkill. |
| 111 | +- Granularity in µs via `LIBOMPTARGET_PROFILE_GRANULARITY` (default 500). |
| 112 | + |
| 113 | +```bash |
| 114 | +LIBOMPTARGET_INFO=<bitmask> |
| 115 | +``` |
| 116 | +- Prints out different types of runtime information |
| 117 | +- Human-readable log of data-mapping inserts/updates, kernel launches, copies, waits. |
| 118 | +- Perfect first stop for "why is nothing copied?" |
| 119 | +- Flags |
| 120 | + - Print all data arguments upon entering an OpenMP device kernel: 0x01 |
| 121 | + - Indicate when a mapped address already exists in the device mapping table: 0x02 |
| 122 | + - Dump the contents of the device pointer map at kernel exit: 0x04 |
| 123 | + - Indicate when an entry is changed in the device mapping table: 0x08 |
| 124 | + - Print OpenMP kernel information from device plugins: 0x10 |
| 125 | + - Indicate when data is copied to and from the device: 0x20 |
| 126 | + |
| 127 | +```bash |
| 128 | +LIBOMPTARGET_DEBUG=1 |
| 129 | +``` |
| 130 | +- Developer-level trace (host-side) |
| 131 | +- Much noisier than `INFO` |
| 132 | +- Only works if the runtime was built with `-DOMPTARGET_DEBUG`. |
| 133 | + |
| 134 | +```bash |
| 135 | +LIBOMPTARGET_JIT_OPT_LEVEL=-O{0,1,2,3} |
| 136 | +``` |
| 137 | +- This environment variable can be used to change the optimization pipeline used to optimize the embedded device code as part of the device JIT. |
| 138 | +- The value corresponds to the `-O{0,1,2,3}` command line argument passed to clang. |
| 139 | + |
| 140 | +```bash |
| 141 | +LIBOMPTARGET_JIT_SKIP_OPT=1 |
| 142 | +``` |
| 143 | +- This environment variable can be used to skip the optimization pipeline during JIT compilation. |
| 144 | +- If set, the image will only be passed through the backend. |
| 145 | +- The backend is invoked with the `LIBOMPTARGET_JIT_OPT_LEVEL` flag. |
| 146 | + |
| 147 | +## Compiler Documentation |
| 148 | + |
| 149 | +- [Cray & OpenMP Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openmp.7.html#environment-variables) |
| 150 | +- [Cray & OpenACC Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openacc.7.html#environment-variables) |
| 151 | +- [NVHPC & OpenACC Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#environment-variables) |
| 152 | +- [NVHPC & OpenMP Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#id2) |
| 153 | +- [LLVM & OpenMP Docs] (https://openmp.llvm.org/design/Runtimes.html) |
| 154 | + - NVHPC is built on top of LLVM |
| 155 | +- [OpenMP Docs](https://www.openmp.org/spec-html/5.1/openmp.html) |
| 156 | +- [OpenACC Docs](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf) |
0 commit comments