Implement launch config infrastructure. by tpn · Pull Request #804 · NVIDIA/numba-cuda

tpn · 2026-02-20T23:57:50Z

Summary

This PR introduces two related pieces of launch-config infrastructure needed by
cuda.coop single-phase work:

A low-overhead launch-config API that can be consumed from arbitrary
numba-cuda compilation stages (including rewrites), plus pre-launch callback
registration on configured launches.
Launch-config-sensitive (LCS) plumbing so kernels that depend on launch
configuration are specialized and cached correctly across launch configs.

Background and motivation

cuda.coop single-phase rewriting needs compile-time access to launch
configuration details (grid/block/shared memory/launch args) and a way to
register pre-launch hooks from rewrite time (for launch-time kernel argument
handling without requiring @cuda.jit(extensions=...)).

An earlier implementation (PR #288) provided this via Python contextvars, but
review feedback showed launch overhead was too high. This branch reimplements
the mechanism through C-extension TLS plumbing in _dispatcher.cpp, with
negligible overhead in the launch micro-benchmark.

From bench-launch-overhead.out (us/launch, baseline vs contextvar vs v2):

0 args: 5.56 vs 7.29 (+31.1%) vs 5.56 (+0.0%)
1 arg: 7.53 vs 9.18 (+21.8%) vs 7.55 (+0.2%)
2 args: 8.90 vs 10.64 (+19.5%) vs 8.97 (+0.8%)
3 args: 10.31 vs 12.50 (+21.3%) vs 10.37 (+0.5%)
4 args: 11.82 vs 13.56 (+14.7%) vs 11.92 (+0.8%)

What this PR adds

1) Launch-config API with low launch overhead

C-extension (numba_cuda/numba/cuda/cext/_dispatcher.cpp) now carries the
active launch config in thread-local storage only during compilation paths.
Python API in numba_cuda/numba/cuda/launchconfig.py:
- current_launch_config()
- ensure_current_launch_config()
- capture_compile_config()
Configured launches expose:
- launch metadata (griddim, blockdim, sharedmem, args, dispatcher)
- pre_launch_callbacks for just-in-time launch-time hook registration.

2) Launch-config-sensitive compilation/caching

Explicit LCS marker API on _LaunchConfiguration
(numba_cuda/numba/cuda/dispatcher.py):
- mark_kernel_as_launch_config_sensitive()
- get_kernel_launch_config_sensitive()
- is_kernel_launch_config_sensitive()
CUDA backend (numba_cuda/numba/cuda/compiler.py) promotes that mark into
compile metadata (state.metadata["launch_config_sensitive"] = True).
Dispatcher/cache behavior for LCS kernels:
- per-launch-config dispatcher specialization routing
- per-launch-config disk-cache keys
- .lcs marker file indicating launch-config-sensitive cache entries.

Why the LCS piece is required

Without LCS, cache keys are signature-based only, so a kernel compiled once for
launch config A can be reused for launch config B without rerunning rewrite.
That breaks launch-config-dependent rewrite behavior.

Concrete observed behavior:

Runtime cache (single process):
- Launch [1, 32]: rewrite runs, callback registered.
- Launch [1, 64] without LCS: existing kernel reused, rewrite does not run,
  callback for the 64-config path is never registered.
- With LCS marking: second launch recompiles under a distinct launch-config
  specialization, so rewrite/callback registration runs for 64.
Disk cache (cross process):
- Process 1 compiles and caches launch [1, 32].
- Process 2 launches [1, 64] without LCS: 32-config artifact can be reused
  from disk (no rewrite for 64 path).
- With LCS marking: process 2 misses on 64-specific cache key and compiles a
  64-specific variant.
LCS intentionally preserves exact cache hits for matching launch configs. It
does not force recompilation when the launch-config key already matches.

So the LCS plumbing is what makes launch-config-dependent rewrite decisions
correct under both in-memory and disk cache reuse.

Scope note for cuda.coop today:

cuda.coop frequently injects LTO-IR/linking files during compilation.
numba-cuda currently does not disk-cache kernels with linking files, so for
those paths the immediate LCS correctness benefit is runtime/in-memory cache
behavior across launch configs.
Disk-cache LCS behavior applies to launch-config-sensitive kernels that are
otherwise disk-cacheable (and remains relevant for future linked-code cache
support).

Safety behavior

If an LCS kernel is loaded from disk but the .lcs marker is missing, we
treat that cache state as unsafe, force recompile, and re-mark.
If marking fails (e.g. filesystem error), disk caching is disabled for safety
(fallback to NullCache) to avoid unsafe reuse.

Out of scope

Cache invalidation keyed on numba_cuda.__version__ (handled by PR Fix cache invalidation logic. #800).
Note that PR Fix cache invalidation logic. #800 should be merged and presumably a release cut before this
PR is merged--that allows downstream projects like cuda.coop to pin accordingly.

greptile-apps · 2026-02-20T23:57:53Z

Automatic reviews are disabled for this repository.

copy-pr-bot · 2026-02-20T23:57:54Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

tpn · 2026-02-21T00:22:50Z

I've left my WIP PR up for this v2 work, which includes some plots re kernel launch overhead, e.g.: https://github.com/NVIDIA/numba-cuda/pull/727/changes#diff-0a7ee1c443b1d66342149e5f9b416d2817f22c97d8c6664da3507c15a0e5454b

Copilot

Pull request overview

This PR adds launch-configuration infrastructure to support compile-time access to launch parameters (and pre-launch hooks), and introduces launch-config-sensitive (LCS) compilation/caching so kernels whose codegen depends on launch config are specialized and cached correctly.

Changes:

Add a numba.cuda.launchconfig API to access the current launch config during compilation and to capture it from Python.
Add dispatcher/compiler plumbing to mark kernels as launch-config sensitive and to specialize runtime and disk-cache behavior per launch config (including .lcs marker handling).
Add tests and docs covering compile-time visibility, LCS recompilation, and LCS vs non-LCS disk-cache keying.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
numba_cuda/numba/cuda/launchconfig.py	New API for reading/ensuring current launch config and capturing it during compilation.
numba_cuda/numba/cuda/dispatcher.py	Adds launch-config metadata on configured launches, pre-launch callbacks, LCS specialization routing, and LCS disk-cache key/marker logic.
numba_cuda/numba/cuda/compiler.py	Promotes LCS marking into compile metadata for downstream dispatcher/cache behavior.
numba_cuda/numba/cuda/cext/_dispatcher.cpp	Implements low-overhead TLS storage of active launch config during compilation via C-extension plumbing.
numba_cuda/numba/cuda/tests/cudapy/test_dispatcher.py	Tests that launch config is visible during compilation and that capture works.
numba_cuda/numba/cuda/tests/cudapy/test_launch_config_sensitive.py	New test ensuring LCS marking forces recompilation across differing launch configs.
numba_cuda/numba/cuda/tests/cudapy/test_caching.py	Extends caching tests to validate LCS vs non-LCS cache key behavior across processes.
numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_sensitive_usecases.py	New cache usecase module that marks a cached kernel as LCS via rewrite.
numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_insensitive_usecases.py	New cache usecase module for a cached kernel that remains launch-config-insensitive.
docs/source/reference/kernel.rst	Documents advanced launch-config access, pre-launch callbacks, and LCS compilation semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-21T00:25:47Z

numba_cuda/numba/cuda/dispatcher.py

+        self.pre_launch_callbacks = []
+        self.args = None
+        self._kernel_launch_config_sensitive = None


CUDADispatcher.configure() is lru_cached, so the same _LaunchConfiguration instance can be shared across threads. Adding the mutable per-call args field (set in CUDADispatcher.call) makes launches/compilations racy: concurrent launches can overwrite launch_config.args while another thread’s rewrite/callback reads it, leading to wrong argument handling. Consider keeping launch config objects immutable and storing per-call args in thread-local storage (or passing args into callbacks explicitly), or returning a fresh launch-config object per call when args need to be attached.

I'm going to defer to numba-cuda developers here regarding whether or not this is actually problematic; I vaguely recall numba-cuda not being thread-safe re this particular code path in the first place.

Copilot · 2026-02-21T00:25:47Z

numba_cuda/numba/cuda/dispatcher.py

+        launch_config.args = args
+        try:
+            dispatcher = self._select_launch_config_dispatcher(launch_config)
+            if dispatcher is not self:
+                return dispatcher.call(args, launch_config)


Setting launch_config.args = args before selecting a launch-config specialization / compiling is not thread-safe when the configured launch object is shared (it is cached by configure()), and can be overwritten by another thread while compilation is in progress (global compiler lock is acquired deeper in the call chain). This can make cfg.args observed in rewrites/callbacks incorrect. Prefer attaching args via thread-local storage for the duration of compilation/launch, or passing args to callbacks without mutating shared state.

Copilot · 2026-02-21T00:25:48Z

numba_cuda/numba/cuda/dispatcher.py

+        if self._cache.is_launch_config_sensitive():
+            if launch_config is None:
+                key = None
+            else:
+                key = self._launch_config_key(launch_config)
+            self._cache.set_launch_config_key(key)
+            return


When the on-disk cache is marked launch-config-sensitive (.lcs exists) but there is no active launch config (e.g. the user calls dispatcher.compile(sig) directly), _configure_cache_for_launch_config() sets the cache key to None. That can allow loading legacy/unsafe cache entries created before .lcs keying (or entries compiled without a launch config), undermining the safety guarantee. Consider forcing a cache miss / disabling disk-cache loads when .lcs is present but current_launch_config() is None (or using a dedicated sentinel key that cannot match pre-LCS entries).

Copilot · 2026-02-21T00:25:48Z

numba_cuda/numba/cuda/launchconfig.py

+@contextlib.contextmanager
+def capture_compile_config(dispatcher):
+    """Capture the launch config seen during compilation for a dispatcher.
+
+    The returned dict has a single key, ``"config"``, which is populated when a
+    compilation is triggered by a kernel launch. If the kernel is already
+    compiled, the dict value may remain ``None``.
+    """
+    if dispatcher is None:
+        raise TypeError("dispatcher is required")
+
+    record = {"config": None}
+    original = dispatcher._compile_for_args
+
+    @wraps(original)
+    def wrapped(*args, **kws):
+        record["config"] = current_launch_config()
+        return original(*args, **kws)
+
+    dispatcher._compile_for_args = wrapped
+    try:
+        yield record
+    finally:
+        dispatcher._compile_for_args = original


capture_compile_config() temporarily monkeypatches dispatcher._compile_for_args on the dispatcher object. Since dispatchers are typically shared, this is not thread-safe: a concurrent compilation/launch in another thread could observe the wrapped method (or have its own wrapper clobbered), leading to incorrect behavior or leaked wrappers if interleaved. Consider implementing capture via a dispatcher-level hook mechanism (e.g. a stack of callbacks), or at minimum scoping the capture using thread-local state so only the current thread records the config.

Implement launch config support.

4b10ea1

tpn requested review from Copilot, cpcloud and gmarkall February 21, 2026 00:20

tpn self-assigned this Feb 21, 2026

Copilot started reviewing on behalf of tpn February 21, 2026 00:20 View session

tpn mentioned this pull request Feb 21, 2026

Implement a thread-local means to access kernel launch config. #288

Open

Copilot AI reviewed Feb 21, 2026

View reviewed changes

gmarkall added the 3 - Ready for Review Ready for review by team label Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement launch config infrastructure.#804

Implement launch config infrastructure.#804
tpn wants to merge 1 commit intoNVIDIA:mainfrom
tpn:280-launch-config-v2-pr

tpn commented Feb 20, 2026

Uh oh!

greptile-apps bot commented Feb 20, 2026

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

tpn commented Feb 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 21, 2026

Uh oh!

tpn Feb 21, 2026

Uh oh!

Copilot AI Feb 21, 2026

Uh oh!

Copilot AI Feb 21, 2026

Uh oh!

Copilot AI Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tpn commented Feb 20, 2026

Summary

Background and motivation

What this PR adds

1) Launch-config API with low launch overhead

2) Launch-config-sensitive compilation/caching

Why the LCS piece is required

Safety behavior

Out of scope

Uh oh!

greptile-apps bot commented Feb 20, 2026

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

tpn commented Feb 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

tpn Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants