[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057

yashwantbezawada · 2025-12-20T02:56:52Z

Purpose

When using async KV connectors with layerwise operations (wait_for_layer_load, save_kv_layer), these calls are made inside the @maybe_transfer_kv_layer decorator. During CUDA graph replay, only the captured GPU ops run - the decorator's Python code is skipped entirely. This means wait_for_layer_load() never executes, causing data races where attention kernels read KV cache data that hasn't finished loading.

This is a follow-up to #29755 based on @NickLucche's feedback to implement approach (b): restricting layerwise ops to piecewise mode rather than trying to trace them into CUDA graphs.

Changes

Added requires_piecewise_for_cudagraph(cls, extra_config) class method to KVConnectorBase_V1
- Returns False by default (most connectors don't use layerwise async ops)
- Connectors override this if they need PIECEWISE mode
LMCacheConnectorV1 overrides to return True when use_layerwise=True
- Other connectors (Nixl, P2P, etc.) have no-op layerwise implementations, so they're fine with FULL mode
VllmConfig.__post_init__() checks this method and auto-downgrades to PIECEWISE when needed
- Follows the same pattern used for DCP, pooler models, encoder-decoder models

Test Plan

The fix is config-level validation that runs at engine initialization. Manual testing with LMCache + use_layerwise=True + CUDA graphs should show the warning message and the mode being downgraded.

Test Result

Pre-commit checks pass. The implementation follows existing patterns in the codebase for handling CUDA graph incompatibilities.

gemini-code-assist

Code Review

This pull request introduces a mechanism to automatically downgrade to PIECEWISE CUDA graph mode when a KV connector uses layerwise async operations, which are incompatible with full CUDA graphs. This is a good fix that prevents potential data races. The implementation adds a new class method requires_piecewise_for_cudagraph to the base KV connector and overrides it in LMCacheConnectorV1. The logic in VllmConfig to use this check is also sound.

I've found one critical issue related to the MultiConnector, which could fail to propagate this requirement from its child connectors, leading to the very data races this PR aims to fix. I've provided a detailed comment and a suggested fix for this.

vllm/distributed/kv_transfer/kv_connector/v1/base.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/config/vllm.py

robertgshaw2-redhat · 2025-12-20T15:37:30Z

seems reasonable

robertgshaw2-redhat · 2025-12-20T15:37:58Z

just note that this will have a severe negative impact on DBO

…e async ops When KV connectors use layerwise async operations (wait_for_layer_load, save_kv_layer), these calls happen inside the @maybe_transfer_kv_layer decorator which gets skipped during CUDA graph replay. This causes data races where attention kernels read incompletely loaded KV cache data. This change adds a class method `requires_piecewise_for_cudagraph` to KVConnectorBase_V1 that connectors can override to indicate they need PIECEWISE mode. LMCache overrides this to return True when use_layerwise is enabled. MultiConnector propagates the requirement from its child connectors. VllmConfig then auto-downgrades to PIECEWISE during init. Fixes vllm-project#29608 Signed-off-by: Yashwant Bezawada <[email protected]>

yashwantbezawada · 2025-12-20T20:15:41Z

Good catch, thanks for pointing this out.

Dug into this a bit - the issue is that DBO needs FULL graphs for the overlap schedule to work, but layerwise KV transfer needs PIECEWISE since it has Python sync points between layers. So yeah, they don't play well together.

That said, if users set use_layerwise=false, the batch loading path should work fine with FULL graphs.

One option would be to add an explicit check that errors out when both DBO and layerwise are enabled, so users know they need to pick one. Thoughts? Open to other ideas if you have something else in mind.

ProExpertProg · 2025-12-22T22:16:59Z

One option would be to add an explicit check that errors out when both DBO and layerwise are enabled, so users know they need to pick one. Thoughts? Open to other ideas if you have something else in mind.

@yashwantbezawada you're already warning when downgrading, I think that's enough. You could also check if DBO is enabled and emit a second warning that DBO will be affected if we want to be clearer. But there might be other cases where downgrading from FULL -> PIECEWISE hurts performance so not sure that's scalable. I think either is fine

yashwantbezawada · 2025-12-24T09:05:33Z

Sounds good, thanks for the feedback! Will leave it as-is with the current warning.

ProExpertProg · 2026-01-05T17:41:42Z

vllm/config/vllm.py

+                    logger.warning_once(
+                        "KV connector %s requires PIECEWISE CUDA graph mode "
+                        "due to layerwise async operations that cannot be "
+                        "captured in CUDA graphs. "
+                        "Overriding cudagraph_mode to PIECEWISE.",
+                        connector_cls.__name__,
+                    )


Let's slightly improve the warning, something like this, feel free to edit:

Suggested change

logger.warning_once(

"KV connector %s requires PIECEWISE CUDA graph mode "

"due to layerwise async operations that cannot be "

"captured in CUDA graphs. "

"Overriding cudagraph_mode to PIECEWISE.",

connector_cls.__name__,

)

logger.warning_once(

"KV connector %s requires PIECEWISE CUDA graph mode "

"due to layerwise async operations that cannot be "

"captured in CUDA graphs. "

"Overriding cudagraph_mode from %s to PIECEWISE, which "

"might reduce performance. "

"You may want to use a different kvcache connector "

"with support for full cudagraphs for better performance",

connector_cls.__name__, )

yashwantbezawada · 2026-01-31T22:56:56Z

Hey @ProExpertProg, any chance this can be merged? Approved on Jan 5 and no conflicts. Thanks!

ProExpertProg · 2026-02-12T13:15:28Z

Sorry I missed this, can you merge from main?

ProExpertProg · 2026-02-12T13:16:28Z

Nvm looks like it was missing CI

yashwantbezawada requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 20, 2025 02:56

mergify bot added nvidia kv-connector labels Dec 20, 2025

github-project-automation bot added this to NVIDIA Dec 20, 2025

gemini-code-assist bot reviewed Dec 20, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/base.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 20, 2025

View reviewed changes

vllm/config/vllm.py Show resolved Hide resolved

yashwantbezawada force-pushed the fix/kv-connector-piecewise-cudagraph branch from 0b093dc to e830e03 Compare December 20, 2025 03:07

yashwantbezawada force-pushed the fix/kv-connector-piecewise-cudagraph branch from e830e03 to 49a322d Compare December 20, 2025 19:24

yashwantbezawada mentioned this pull request Dec 25, 2025

[Bug]: vLLM 0.12.0 + LMCache outputs ERROR: PrometheusLogger instance already created with different metadata. #30996

Open

ProExpertProg approved these changes Jan 5, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 5, 2026

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 12, 2026

Merge branch 'main' into fix/kv-connector-piecewise-cudagraph

60b2cbc

ProExpertProg requested a review from orozery as a code owner February 12, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057

[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057

yashwantbezawada commented Dec 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 20, 2025

Uh oh!

robertgshaw2-redhat commented Dec 20, 2025

Uh oh!

yashwantbezawada commented Dec 20, 2025

Uh oh!

ProExpertProg commented Dec 22, 2025

Uh oh!

yashwantbezawada commented Dec 24, 2025

Uh oh!

ProExpertProg Jan 5, 2026

Uh oh!

yashwantbezawada commented Jan 31, 2026

Uh oh!

ProExpertProg commented Feb 12, 2026

Uh oh!

ProExpertProg commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057

Are you sure you want to change the base?

[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057

Conversation

yashwantbezawada commented Dec 20, 2025

Purpose

Changes

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 20, 2025

Uh oh!

robertgshaw2-redhat commented Dec 20, 2025

Uh oh!

yashwantbezawada commented Dec 20, 2025

Uh oh!

ProExpertProg commented Dec 22, 2025

Uh oh!

yashwantbezawada commented Dec 24, 2025

Uh oh!

ProExpertProg Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

yashwantbezawada commented Jan 31, 2026

Uh oh!

ProExpertProg commented Feb 12, 2026

Uh oh!

ProExpertProg commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants