Conversation
|
Automatic reviews are disabled for this repository. |
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
I've left my WIP PR up for this v2 work, which includes some plots re kernel launch overhead, e.g.: https://github.com/NVIDIA/numba-cuda/pull/727/changes#diff-0a7ee1c443b1d66342149e5f9b416d2817f22c97d8c6664da3507c15a0e5454b |
There was a problem hiding this comment.
Pull request overview
This PR adds launch-configuration infrastructure to support compile-time access to launch parameters (and pre-launch hooks), and introduces launch-config-sensitive (LCS) compilation/caching so kernels whose codegen depends on launch config are specialized and cached correctly.
Changes:
- Add a
numba.cuda.launchconfigAPI to access the current launch config during compilation and to capture it from Python. - Add dispatcher/compiler plumbing to mark kernels as launch-config sensitive and to specialize runtime and disk-cache behavior per launch config (including
.lcsmarker handling). - Add tests and docs covering compile-time visibility, LCS recompilation, and LCS vs non-LCS disk-cache keying.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| numba_cuda/numba/cuda/launchconfig.py | New API for reading/ensuring current launch config and capturing it during compilation. |
| numba_cuda/numba/cuda/dispatcher.py | Adds launch-config metadata on configured launches, pre-launch callbacks, LCS specialization routing, and LCS disk-cache key/marker logic. |
| numba_cuda/numba/cuda/compiler.py | Promotes LCS marking into compile metadata for downstream dispatcher/cache behavior. |
| numba_cuda/numba/cuda/cext/_dispatcher.cpp | Implements low-overhead TLS storage of active launch config during compilation via C-extension plumbing. |
| numba_cuda/numba/cuda/tests/cudapy/test_dispatcher.py | Tests that launch config is visible during compilation and that capture works. |
| numba_cuda/numba/cuda/tests/cudapy/test_launch_config_sensitive.py | New test ensuring LCS marking forces recompilation across differing launch configs. |
| numba_cuda/numba/cuda/tests/cudapy/test_caching.py | Extends caching tests to validate LCS vs non-LCS cache key behavior across processes. |
| numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_sensitive_usecases.py | New cache usecase module that marks a cached kernel as LCS via rewrite. |
| numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_insensitive_usecases.py | New cache usecase module for a cached kernel that remains launch-config-insensitive. |
| docs/source/reference/kernel.rst | Documents advanced launch-config access, pre-launch callbacks, and LCS compilation semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.pre_launch_callbacks = [] | ||
| self.args = None | ||
| self._kernel_launch_config_sensitive = None |
There was a problem hiding this comment.
CUDADispatcher.configure() is lru_cached, so the same _LaunchConfiguration instance can be shared across threads. Adding the mutable per-call args field (set in CUDADispatcher.call) makes launches/compilations racy: concurrent launches can overwrite launch_config.args while another thread’s rewrite/callback reads it, leading to wrong argument handling. Consider keeping launch config objects immutable and storing per-call args in thread-local storage (or passing args into callbacks explicitly), or returning a fresh launch-config object per call when args need to be attached.
There was a problem hiding this comment.
I'm going to defer to numba-cuda developers here regarding whether or not this is actually problematic; I vaguely recall numba-cuda not being thread-safe re this particular code path in the first place.
| launch_config.args = args | ||
| try: | ||
| dispatcher = self._select_launch_config_dispatcher(launch_config) | ||
| if dispatcher is not self: | ||
| return dispatcher.call(args, launch_config) |
There was a problem hiding this comment.
Setting launch_config.args = args before selecting a launch-config specialization / compiling is not thread-safe when the configured launch object is shared (it is cached by configure()), and can be overwritten by another thread while compilation is in progress (global compiler lock is acquired deeper in the call chain). This can make cfg.args observed in rewrites/callbacks incorrect. Prefer attaching args via thread-local storage for the duration of compilation/launch, or passing args to callbacks without mutating shared state.
| if self._cache.is_launch_config_sensitive(): | ||
| if launch_config is None: | ||
| key = None | ||
| else: | ||
| key = self._launch_config_key(launch_config) | ||
| self._cache.set_launch_config_key(key) | ||
| return |
There was a problem hiding this comment.
When the on-disk cache is marked launch-config-sensitive (.lcs exists) but there is no active launch config (e.g. the user calls dispatcher.compile(sig) directly), _configure_cache_for_launch_config() sets the cache key to None. That can allow loading legacy/unsafe cache entries created before .lcs keying (or entries compiled without a launch config), undermining the safety guarantee. Consider forcing a cache miss / disabling disk-cache loads when .lcs is present but current_launch_config() is None (or using a dedicated sentinel key that cannot match pre-LCS entries).
| @contextlib.contextmanager | ||
| def capture_compile_config(dispatcher): | ||
| """Capture the launch config seen during compilation for a dispatcher. | ||
| The returned dict has a single key, ``"config"``, which is populated when a | ||
| compilation is triggered by a kernel launch. If the kernel is already | ||
| compiled, the dict value may remain ``None``. | ||
| """ | ||
| if dispatcher is None: | ||
| raise TypeError("dispatcher is required") | ||
|
|
||
| record = {"config": None} | ||
| original = dispatcher._compile_for_args | ||
|
|
||
| @wraps(original) | ||
| def wrapped(*args, **kws): | ||
| record["config"] = current_launch_config() | ||
| return original(*args, **kws) | ||
|
|
||
| dispatcher._compile_for_args = wrapped | ||
| try: | ||
| yield record | ||
| finally: | ||
| dispatcher._compile_for_args = original |
There was a problem hiding this comment.
capture_compile_config() temporarily monkeypatches dispatcher._compile_for_args on the dispatcher object. Since dispatchers are typically shared, this is not thread-safe: a concurrent compilation/launch in another thread could observe the wrapped method (or have its own wrapper clobbered), leading to incorrect behavior or leaked wrappers if interleaved. Consider implementing capture via a dispatcher-level hook mechanism (e.g. a stack of callbacks), or at minimum scoping the capture using thread-local state so only the current thread records the config.
Summary
This PR introduces two related pieces of launch-config infrastructure needed by
cuda.coopsingle-phase work:numba-cuda compilation stages (including rewrites), plus pre-launch callback
registration on configured launches.
configuration are specialized and cached correctly across launch configs.
Background and motivation
cuda.coopsingle-phase rewriting needs compile-time access to launchconfiguration details (grid/block/shared memory/launch args) and a way to
register pre-launch hooks from rewrite time (for launch-time kernel argument
handling without requiring
@cuda.jit(extensions=...)).An earlier implementation (PR #288) provided this via Python
contextvars, butreview feedback showed launch overhead was too high. This branch reimplements
the mechanism through C-extension TLS plumbing in
_dispatcher.cpp, withnegligible overhead in the launch micro-benchmark.
From
bench-launch-overhead.out(us/launch, baseline vs contextvar vs v2):5.56vs7.29(+31.1%) vs5.56(+0.0%)7.53vs9.18(+21.8%) vs7.55(+0.2%)8.90vs10.64(+19.5%) vs8.97(+0.8%)10.31vs12.50(+21.3%) vs10.37(+0.5%)11.82vs13.56(+14.7%) vs11.92(+0.8%)What this PR adds
1) Launch-config API with low launch overhead
numba_cuda/numba/cuda/cext/_dispatcher.cpp) now carries theactive launch config in thread-local storage only during compilation paths.
numba_cuda/numba/cuda/launchconfig.py:current_launch_config()ensure_current_launch_config()capture_compile_config()griddim,blockdim,sharedmem,args,dispatcher)pre_launch_callbacksfor just-in-time launch-time hook registration.2) Launch-config-sensitive compilation/caching
_LaunchConfiguration(
numba_cuda/numba/cuda/dispatcher.py):mark_kernel_as_launch_config_sensitive()get_kernel_launch_config_sensitive()is_kernel_launch_config_sensitive()numba_cuda/numba/cuda/compiler.py) promotes that mark intocompile metadata (
state.metadata["launch_config_sensitive"] = True)..lcsmarker file indicating launch-config-sensitive cache entries.Why the LCS piece is required
Without LCS, cache keys are signature-based only, so a kernel compiled once for
launch config A can be reused for launch config B without rerunning rewrite.
That breaks launch-config-dependent rewrite behavior.
Concrete observed behavior:
[1, 32]: rewrite runs, callback registered.[1, 64]without LCS: existing kernel reused, rewrite does not run,callback for the 64-config path is never registered.
specialization, so rewrite/callback registration runs for 64.
[1, 32].[1, 64]without LCS: 32-config artifact can be reusedfrom disk (no rewrite for 64 path).
64-specific variant.
does not force recompilation when the launch-config key already matches.
So the LCS plumbing is what makes launch-config-dependent rewrite decisions
correct under both in-memory and disk cache reuse.
Scope note for
cuda.cooptoday:cuda.coopfrequently injects LTO-IR/linking files during compilation.those paths the immediate LCS correctness benefit is runtime/in-memory cache
behavior across launch configs.
otherwise disk-cacheable (and remains relevant for future linked-code cache
support).
Safety behavior
.lcsmarker is missing, wetreat that cache state as unsafe, force recompile, and re-mark.
(fallback to
NullCache) to avoid unsafe reuse.Out of scope
numba_cuda.__version__(handled by PR Fix cache invalidation logic. #800).Note that PR Fix cache invalidation logic. #800 should be merged and presumably a release cut before this
PR is merged--that allows downstream projects like cuda.coop to pin accordingly.