Skip to content

Conversation

@yashwantbezawada
Copy link

Purpose

Fixes #29608

When using async KV connectors with layerwise operations (wait_for_layer_load, save_kv_layer), these calls are made inside the @maybe_transfer_kv_layer decorator. During CUDA graph replay, only the captured GPU ops run - the decorator's Python code is skipped entirely. This means wait_for_layer_load() never executes, causing data races where attention kernels read KV cache data that hasn't finished loading.

This is a follow-up to #29755 based on @NickLucche's feedback to implement approach (b): restricting layerwise ops to piecewise mode rather than trying to trace them into CUDA graphs.

Changes

  1. Added requires_piecewise_for_cudagraph(cls, extra_config) class method to KVConnectorBase_V1

    • Returns False by default (most connectors don't use layerwise async ops)
    • Connectors override this if they need PIECEWISE mode
  2. LMCacheConnectorV1 overrides to return True when use_layerwise=True

    • Other connectors (Nixl, P2P, etc.) have no-op layerwise implementations, so they're fine with FULL mode
  3. VllmConfig.__post_init__() checks this method and auto-downgrades to PIECEWISE when needed

    • Follows the same pattern used for DCP, pooler models, encoder-decoder models

Test Plan

The fix is config-level validation that runs at engine initialization. Manual testing with LMCache + use_layerwise=True + CUDA graphs should show the warning message and the mode being downgraded.

Test Result

Pre-commit checks pass. The implementation follows existing patterns in the codebase for handling CUDA graph incompatibilities.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to automatically downgrade to PIECEWISE CUDA graph mode when a KV connector uses layerwise async operations, which are incompatible with full CUDA graphs. This is a good fix that prevents potential data races. The implementation adds a new class method requires_piecewise_for_cudagraph to the base KV connector and overrides it in LMCacheConnectorV1. The logic in VllmConfig to use this check is also sound.

I've found one critical issue related to the MultiConnector, which could fail to propagate this requirement from its child connectors, leading to the very data races this PR aims to fix. I've provided a detailed comment and a suggested fix for this.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@yashwantbezawada yashwantbezawada force-pushed the fix/kv-connector-piecewise-cudagraph branch from 0b093dc to e830e03 Compare December 20, 2025 03:07
@robertgshaw2-redhat
Copy link
Collaborator

seems reasonable

@robertgshaw2-redhat
Copy link
Collaborator

just note that this will have a severe negative impact on DBO

…e async ops

When KV connectors use layerwise async operations (wait_for_layer_load,
save_kv_layer), these calls happen inside the @maybe_transfer_kv_layer
decorator which gets skipped during CUDA graph replay. This causes data
races where attention kernels read incompletely loaded KV cache data.

This change adds a class method `requires_piecewise_for_cudagraph` to
KVConnectorBase_V1 that connectors can override to indicate they need
PIECEWISE mode. LMCache overrides this to return True when use_layerwise
is enabled. MultiConnector propagates the requirement from its child
connectors. VllmConfig then auto-downgrades to PIECEWISE during init.

Fixes vllm-project#29608

Signed-off-by: Yashwant Bezawada <[email protected]>
@yashwantbezawada yashwantbezawada force-pushed the fix/kv-connector-piecewise-cudagraph branch from e830e03 to 49a322d Compare December 20, 2025 19:24
@yashwantbezawada
Copy link
Author

Good catch, thanks for pointing this out.

Dug into this a bit - the issue is that DBO needs FULL graphs for the overlap schedule to work, but layerwise KV transfer needs PIECEWISE since it has Python sync points between layers. So yeah, they don't play well together.

That said, if users set use_layerwise=false, the batch loading path should work fine with FULL graphs.

One option would be to add an explicit check that errors out when both DBO and layerwise are enabled, so users know they need to pick one. Thoughts? Open to other ideas if you have something else in mind.

@ProExpertProg
Copy link
Collaborator

One option would be to add an explicit check that errors out when both DBO and layerwise are enabled, so users know they need to pick one. Thoughts? Open to other ideas if you have something else in mind.

@yashwantbezawada you're already warning when downgrading, I think that's enough. You could also check if DBO is enabled and emit a second warning that DBO will be affected if we want to be clearer. But there might be other cases where downgrading from FULL -> PIECEWISE hurts performance so not sure that's scalable. I think either is fine

@yashwantbezawada
Copy link
Author

Sounds good, thanks for the feedback! Will leave it as-is with the current warning.

Comment on lines +737 to +743
logger.warning_once(
"KV connector %s requires PIECEWISE CUDA graph mode "
"due to layerwise async operations that cannot be "
"captured in CUDA graphs. "
"Overriding cudagraph_mode to PIECEWISE.",
connector_cls.__name__,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's slightly improve the warning, something like this, feel free to edit:

Suggested change
logger.warning_once(
"KV connector %s requires PIECEWISE CUDA graph mode "
"due to layerwise async operations that cannot be "
"captured in CUDA graphs. "
"Overriding cudagraph_mode to PIECEWISE.",
connector_cls.__name__,
)
logger.warning_once(
"KV connector %s requires PIECEWISE CUDA graph mode "
"due to layerwise async operations that cannot be "
"captured in CUDA graphs. "
"Overriding cudagraph_mode from %s to PIECEWISE, which "
"might reduce performance. "
"You may want to use a different kvcache connector "
"with support for full cudagraphs for better performance",
                    connector_cls.__name__,
                )

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 5, 2026
@yashwantbezawada
Copy link
Author

Hey @ProExpertProg, any chance this can be merged? Approved on Jan 5 and no conflicts. Thanks!

@ProExpertProg
Copy link
Collaborator

Sorry I missed this, can you merge from main?

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026
@ProExpertProg
Copy link
Collaborator

Nvm looks like it was missing CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

[Bug]: CUDA Graph Replay Skips KV Transfer Synchronization in Full Cache Hit Scenarios

3 participants