-
-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops #31057
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to automatically downgrade to PIECEWISE CUDA graph mode when a KV connector uses layerwise async operations, which are incompatible with full CUDA graphs. This is a good fix that prevents potential data races. The implementation adds a new class method requires_piecewise_for_cudagraph to the base KV connector and overrides it in LMCacheConnectorV1. The logic in VllmConfig to use this check is also sound.
I've found one critical issue related to the MultiConnector, which could fail to propagate this requirement from its child connectors, leading to the very data races this PR aims to fix. I've provided a detailed comment and a suggested fix for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
0b093dc to
e830e03
Compare
|
seems reasonable |
|
just note that this will have a severe negative impact on DBO |
…e async ops When KV connectors use layerwise async operations (wait_for_layer_load, save_kv_layer), these calls happen inside the @maybe_transfer_kv_layer decorator which gets skipped during CUDA graph replay. This causes data races where attention kernels read incompletely loaded KV cache data. This change adds a class method `requires_piecewise_for_cudagraph` to KVConnectorBase_V1 that connectors can override to indicate they need PIECEWISE mode. LMCache overrides this to return True when use_layerwise is enabled. MultiConnector propagates the requirement from its child connectors. VllmConfig then auto-downgrades to PIECEWISE during init. Fixes vllm-project#29608 Signed-off-by: Yashwant Bezawada <[email protected]>
e830e03 to
49a322d
Compare
|
Good catch, thanks for pointing this out. Dug into this a bit - the issue is that DBO needs FULL graphs for the overlap schedule to work, but layerwise KV transfer needs PIECEWISE since it has Python sync points between layers. So yeah, they don't play well together. That said, if users set One option would be to add an explicit check that errors out when both DBO and layerwise are enabled, so users know they need to pick one. Thoughts? Open to other ideas if you have something else in mind. |
@yashwantbezawada you're already warning when downgrading, I think that's enough. You could also check if DBO is enabled and emit a second warning that DBO will be affected if we want to be clearer. But there might be other cases where downgrading from FULL -> PIECEWISE hurts performance so not sure that's scalable. I think either is fine |
|
Sounds good, thanks for the feedback! Will leave it as-is with the current warning. |
| logger.warning_once( | ||
| "KV connector %s requires PIECEWISE CUDA graph mode " | ||
| "due to layerwise async operations that cannot be " | ||
| "captured in CUDA graphs. " | ||
| "Overriding cudagraph_mode to PIECEWISE.", | ||
| connector_cls.__name__, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's slightly improve the warning, something like this, feel free to edit:
| logger.warning_once( | |
| "KV connector %s requires PIECEWISE CUDA graph mode " | |
| "due to layerwise async operations that cannot be " | |
| "captured in CUDA graphs. " | |
| "Overriding cudagraph_mode to PIECEWISE.", | |
| connector_cls.__name__, | |
| ) | |
| logger.warning_once( | |
| "KV connector %s requires PIECEWISE CUDA graph mode " | |
| "due to layerwise async operations that cannot be " | |
| "captured in CUDA graphs. " | |
| "Overriding cudagraph_mode from %s to PIECEWISE, which " | |
| "might reduce performance. " | |
| "You may want to use a different kvcache connector " | |
| "with support for full cudagraphs for better performance", |
connector_cls.__name__,
)
|
Hey @ProExpertProg, any chance this can be merged? Approved on Jan 5 and no conflicts. Thanks! |
|
Sorry I missed this, can you merge from main? |
|
Nvm looks like it was missing CI |
Purpose
Fixes #29608
When using async KV connectors with layerwise operations (
wait_for_layer_load,save_kv_layer), these calls are made inside the@maybe_transfer_kv_layerdecorator. During CUDA graph replay, only the captured GPU ops run - the decorator's Python code is skipped entirely. This meanswait_for_layer_load()never executes, causing data races where attention kernels read KV cache data that hasn't finished loading.This is a follow-up to #29755 based on @NickLucche's feedback to implement approach (b): restricting layerwise ops to piecewise mode rather than trying to trace them into CUDA graphs.
Changes
Added
requires_piecewise_for_cudagraph(cls, extra_config)class method toKVConnectorBase_V1Falseby default (most connectors don't use layerwise async ops)LMCacheConnectorV1overrides to returnTruewhenuse_layerwise=TrueVllmConfig.__post_init__()checks this method and auto-downgrades to PIECEWISE when neededTest Plan
The fix is config-level validation that runs at engine initialization. Manual testing with LMCache +
use_layerwise=True+ CUDA graphs should show the warning message and the mode being downgraded.Test Result
Pre-commit checks pass. The implementation follows existing patterns in the codebase for handling CUDA graph incompatibilities.