-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Core] Move EngineCoreRequest to Request conversion out of EngineCore #21627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively moves the EngineCoreRequest
to Request
conversion out of the engine's critical path, which, as shown by the profiling data, significantly improves performance. The implementation is clean and the logic is sound.
My main concern is about the robustness of the process_input_sockets
thread. By moving more complex logic into this daemon thread, the risk of silent failures that could hang the server increases. I've added a high-severity comment with a suggestion to implement more robust error handling to ensure the engine fails gracefully.
vllm/v1/engine/core.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an additional protection? Or we do have such error handling originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is additional protection, in response to gemini's review above #21627 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to keep the PR focused and not introduce new error handling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @linzebing this looks good to me for the most part!
vllm/v1/engine/core.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this
def add_request(self, request: Request, current_wave: int = 0): | |
def add_request(self, request: Request, request_wave: int = 0): |
vllm/v1/engine/core.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to keep the PR focused and not introduce new error handling here.
67ee54e
to
fdca348
Compare
c6732fe
to
1698a9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @linzebing @Jialin!
One more thing, do you think you could also add a short comment in the code next to each of those two things confirming that they are threadsafe?
Thanks @njhill for the review! I have added comments around thread safety. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @linzebing!
@linzebing I think some of the CI test failures are related: https://buildkite.com/vllm/ci/builds/25313#01985782-1b21-4082-9849-1642052665bc/212-1966 |
@njhill : thanks for pointing out! I forgot to examine the callsites of
I manually inspected the codebase, I should have changed all callsites of |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: linzebing <[email protected]>
Looks like the v1-test failure is related to CI. I tested locally, no issue:
|
@linzebing If you confirmed the failing test is not related, you could either rebase to kick off the CI or request force merge in sig-ci slack channel. |
…vllm-project#21627) Signed-off-by: linzebing <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: x22x22 <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: Noam Gat <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]>
…vllm-project#21627) Signed-off-by: linzebing <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
In engine core thread, we used 18us to convert EngineCoreRequest to Request which is on model forward critical path.
Ideally, we should be able to move the conversion from engine core thread to input request thread to relax the logic from critical path.
There's an extra benefit for making the change, as Request became available in input processing threads, which would significantly simply the pending changes for block hashing optimization mentioned in #21247
Test Plan
Test Result
With the change, handle_client_request in engine core thread reduced from 35us to 7us.

As expected, right now, Request conversion is executing in parallel with model forward:

(Optional) Documentation Update