Skip to content

Implement launch config infrastructure.#804

Open
tpn wants to merge 1 commit intoNVIDIA:mainfrom
tpn:280-launch-config-v2-pr
Open

Implement launch config infrastructure.#804
tpn wants to merge 1 commit intoNVIDIA:mainfrom
tpn:280-launch-config-v2-pr

Conversation

@tpn
Copy link
Contributor

@tpn tpn commented Feb 20, 2026

Summary

This PR introduces two related pieces of launch-config infrastructure needed by
cuda.coop single-phase work:

  1. A low-overhead launch-config API that can be consumed from arbitrary
    numba-cuda compilation stages (including rewrites), plus pre-launch callback
    registration on configured launches.
  2. Launch-config-sensitive (LCS) plumbing so kernels that depend on launch
    configuration are specialized and cached correctly across launch configs.

Background and motivation

cuda.coop single-phase rewriting needs compile-time access to launch
configuration details (grid/block/shared memory/launch args) and a way to
register pre-launch hooks from rewrite time (for launch-time kernel argument
handling without requiring @cuda.jit(extensions=...)).

An earlier implementation (PR #288) provided this via Python contextvars, but
review feedback showed launch overhead was too high. This branch reimplements
the mechanism through C-extension TLS plumbing in _dispatcher.cpp, with
negligible overhead in the launch micro-benchmark.

From bench-launch-overhead.out (us/launch, baseline vs contextvar vs v2):

  • 0 args: 5.56 vs 7.29 (+31.1%) vs 5.56 (+0.0%)
  • 1 arg: 7.53 vs 9.18 (+21.8%) vs 7.55 (+0.2%)
  • 2 args: 8.90 vs 10.64 (+19.5%) vs 8.97 (+0.8%)
  • 3 args: 10.31 vs 12.50 (+21.3%) vs 10.37 (+0.5%)
  • 4 args: 11.82 vs 13.56 (+14.7%) vs 11.92 (+0.8%)

What this PR adds

1) Launch-config API with low launch overhead

  • C-extension (numba_cuda/numba/cuda/cext/_dispatcher.cpp) now carries the
    active launch config in thread-local storage only during compilation paths.
  • Python API in numba_cuda/numba/cuda/launchconfig.py:
    • current_launch_config()
    • ensure_current_launch_config()
    • capture_compile_config()
  • Configured launches expose:
    • launch metadata (griddim, blockdim, sharedmem, args, dispatcher)
    • pre_launch_callbacks for just-in-time launch-time hook registration.

2) Launch-config-sensitive compilation/caching

  • Explicit LCS marker API on _LaunchConfiguration
    (numba_cuda/numba/cuda/dispatcher.py):
    • mark_kernel_as_launch_config_sensitive()
    • get_kernel_launch_config_sensitive()
    • is_kernel_launch_config_sensitive()
  • CUDA backend (numba_cuda/numba/cuda/compiler.py) promotes that mark into
    compile metadata (state.metadata["launch_config_sensitive"] = True).
  • Dispatcher/cache behavior for LCS kernels:
    • per-launch-config dispatcher specialization routing
    • per-launch-config disk-cache keys
    • .lcs marker file indicating launch-config-sensitive cache entries.

Why the LCS piece is required

Without LCS, cache keys are signature-based only, so a kernel compiled once for
launch config A can be reused for launch config B without rerunning rewrite.
That breaks launch-config-dependent rewrite behavior.

Concrete observed behavior:

  • Runtime cache (single process):
    • Launch [1, 32]: rewrite runs, callback registered.
    • Launch [1, 64] without LCS: existing kernel reused, rewrite does not run,
      callback for the 64-config path is never registered.
    • With LCS marking: second launch recompiles under a distinct launch-config
      specialization, so rewrite/callback registration runs for 64.
  • Disk cache (cross process):
    • Process 1 compiles and caches launch [1, 32].
    • Process 2 launches [1, 64] without LCS: 32-config artifact can be reused
      from disk (no rewrite for 64 path).
    • With LCS marking: process 2 misses on 64-specific cache key and compiles a
      64-specific variant.
  • LCS intentionally preserves exact cache hits for matching launch configs. It
    does not force recompilation when the launch-config key already matches.

So the LCS plumbing is what makes launch-config-dependent rewrite decisions
correct under both in-memory and disk cache reuse.

Scope note for cuda.coop today:

  • cuda.coop frequently injects LTO-IR/linking files during compilation.
  • numba-cuda currently does not disk-cache kernels with linking files, so for
    those paths the immediate LCS correctness benefit is runtime/in-memory cache
    behavior across launch configs.
  • Disk-cache LCS behavior applies to launch-config-sensitive kernels that are
    otherwise disk-cacheable (and remains relevant for future linked-code cache
    support).

Safety behavior

  • If an LCS kernel is loaded from disk but the .lcs marker is missing, we
    treat that cache state as unsafe, force recompile, and re-mark.
  • If marking fails (e.g. filesystem error), disk caching is disabled for safety
    (fallback to NullCache) to avoid unsafe reuse.

Out of scope

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 20, 2026

Automatic reviews are disabled for this repository.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tpn
Copy link
Contributor Author

tpn commented Feb 21, 2026

I've left my WIP PR up for this v2 work, which includes some plots re kernel launch overhead, e.g.: https://github.com/NVIDIA/numba-cuda/pull/727/changes#diff-0a7ee1c443b1d66342149e5f9b416d2817f22c97d8c6664da3507c15a0e5454b

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds launch-configuration infrastructure to support compile-time access to launch parameters (and pre-launch hooks), and introduces launch-config-sensitive (LCS) compilation/caching so kernels whose codegen depends on launch config are specialized and cached correctly.

Changes:

  • Add a numba.cuda.launchconfig API to access the current launch config during compilation and to capture it from Python.
  • Add dispatcher/compiler plumbing to mark kernels as launch-config sensitive and to specialize runtime and disk-cache behavior per launch config (including .lcs marker handling).
  • Add tests and docs covering compile-time visibility, LCS recompilation, and LCS vs non-LCS disk-cache keying.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
numba_cuda/numba/cuda/launchconfig.py New API for reading/ensuring current launch config and capturing it during compilation.
numba_cuda/numba/cuda/dispatcher.py Adds launch-config metadata on configured launches, pre-launch callbacks, LCS specialization routing, and LCS disk-cache key/marker logic.
numba_cuda/numba/cuda/compiler.py Promotes LCS marking into compile metadata for downstream dispatcher/cache behavior.
numba_cuda/numba/cuda/cext/_dispatcher.cpp Implements low-overhead TLS storage of active launch config during compilation via C-extension plumbing.
numba_cuda/numba/cuda/tests/cudapy/test_dispatcher.py Tests that launch config is visible during compilation and that capture works.
numba_cuda/numba/cuda/tests/cudapy/test_launch_config_sensitive.py New test ensuring LCS marking forces recompilation across differing launch configs.
numba_cuda/numba/cuda/tests/cudapy/test_caching.py Extends caching tests to validate LCS vs non-LCS cache key behavior across processes.
numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_sensitive_usecases.py New cache usecase module that marks a cached kernel as LCS via rewrite.
numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_insensitive_usecases.py New cache usecase module for a cached kernel that remains launch-config-insensitive.
docs/source/reference/kernel.rst Documents advanced launch-config access, pre-launch callbacks, and LCS compilation semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +705 to +707
self.pre_launch_callbacks = []
self.args = None
self._kernel_launch_config_sensitive = None
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDADispatcher.configure() is lru_cached, so the same _LaunchConfiguration instance can be shared across threads. Adding the mutable per-call args field (set in CUDADispatcher.call) makes launches/compilations racy: concurrent launches can overwrite launch_config.args while another thread’s rewrite/callback reads it, leading to wrong argument handling. Consider keeping launch config objects immutable and storing per-call args in thread-local storage (or passing args into callbacks explicitly), or returning a fresh launch-config object per call when args need to be attached.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to defer to numba-cuda developers here regarding whether or not this is actually problematic; I vaguely recall numba-cuda not being thread-safe re this particular code path in the first place.

Comment on lines +1811 to +1815
launch_config.args = args
try:
dispatcher = self._select_launch_config_dispatcher(launch_config)
if dispatcher is not self:
return dispatcher.call(args, launch_config)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting launch_config.args = args before selecting a launch-config specialization / compiling is not thread-safe when the configured launch object is shared (it is cached by configure()), and can be overwritten by another thread while compilation is in progress (global compiler lock is acquired deeper in the call chain). This can make cfg.args observed in rewrites/callbacks incorrect. Prefer attaching args via thread-local storage for the duration of compilation/launch, or passing args to callbacks without mutating shared state.

Copilot uses AI. Check for mistakes.
Comment on lines +1705 to +1711
if self._cache.is_launch_config_sensitive():
if launch_config is None:
key = None
else:
key = self._launch_config_key(launch_config)
self._cache.set_launch_config_key(key)
return
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the on-disk cache is marked launch-config-sensitive (.lcs exists) but there is no active launch config (e.g. the user calls dispatcher.compile(sig) directly), _configure_cache_for_launch_config() sets the cache key to None. That can allow loading legacy/unsafe cache entries created before .lcs keying (or entries compiled without a launch config), undermining the safety guarantee. Consider forcing a cache miss / disabling disk-cache loads when .lcs is present but current_launch_config() is None (or using a dedicated sentinel key that cannot match pre-LCS entries).

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +52
@contextlib.contextmanager
def capture_compile_config(dispatcher):
"""Capture the launch config seen during compilation for a dispatcher.
The returned dict has a single key, ``"config"``, which is populated when a
compilation is triggered by a kernel launch. If the kernel is already
compiled, the dict value may remain ``None``.
"""
if dispatcher is None:
raise TypeError("dispatcher is required")

record = {"config": None}
original = dispatcher._compile_for_args

@wraps(original)
def wrapped(*args, **kws):
record["config"] = current_launch_config()
return original(*args, **kws)

dispatcher._compile_for_args = wrapped
try:
yield record
finally:
dispatcher._compile_for_args = original
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capture_compile_config() temporarily monkeypatches dispatcher._compile_for_args on the dispatcher object. Since dispatchers are typically shared, this is not thread-safe: a concurrent compilation/launch in another thread could observe the wrapped method (or have its own wrapper clobbered), leading to incorrect behavior or leaked wrappers if interleaved. Consider implementing capture via a dispatcher-level hook mechanism (e.g. a stack of callbacks), or at minimum scoping the capture using thread-local state so only the current thread records the config.

Copilot uses AI. Check for mistakes.
@gmarkall gmarkall added the 3 - Ready for Review Ready for review by team label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants