Commit 35fba19
authored
[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute (#16928)
Calling `cuFuncSetAttribute` to set
`CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES` is required to launch
kernels using more than 48 kB of local memory[1] (CUDA dynamic shared
memory). Without this, `cuLaunchKernel` fails with
`CUDA_ERROR_INVALID_VALUE`. However, calling `cuFuncSetAttribute`
introduces synchronisation in the CUDA runtime which blocks its
execution until all H2D/D2H memory copies are finished (don't know why),
therefore effectively blocking kernel launches from overlapping with
memory copies. This introduces significant performance degradation in
some workflows, specifically in applications launching overlapping
memory copies and kernels from multiple host threads into multiple CUDA
streams to the same GPU.
Avoid the CUDA runtime synchronisation causing poor performance by
removing the `cuFuncSetAttribute` call unless it's strictly necessary.
Call it only when a specific carveout is requested by user (using env
variables) or when the kernel launch would fail without it (local memory
size >48kB). Good performance is recovered for default settings with
kernels using little or no local memory.
No performance effects were observed for kernel execution time after
removing the attribute across a wide range of tested kernels using
various amounts of local memory.
[1] Related to the 48 kB static shared memory limit, see the footnote
for "Maximum amount of shared memory per thread block" in
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability1 parent 671468a commit 35fba19
1 file changed
+3
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
290 | 290 | | |
291 | 291 | | |
292 | 292 | | |
293 | | - | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
294 | 296 | | |
295 | 297 | | |
296 | 298 | | |
| |||
0 commit comments