Skip to content

Conversation

Jialin
Copy link
Contributor

@Jialin Jialin commented Jul 21, 2025

…thread

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

In engine core thread, we used 18us to convert EngineCoreRequest to Request which is on model forward critical path.

Ideally, we should be able to move the conversion from engine core thread to input request thread to relax the logic from critical path.

There's an extra benefit for making the change, as Request became available in input processing threads, which would significantly simply the pending changes for block hashing optimization mentioned in #21247

Screenshot 2025-07-21 at 12 29 53 PM Screenshot 2025-07-21 at 1 16 35 PM

Test Plan

Profile with the change

Test Result

With the change, handle_client_request in engine core thread reduced from 35us to 7us.
Screenshot 2025-07-21 at 1 04 59 PM

As expected, right now, Request conversion is executing in paralllel with model forward
Screenshot 2025-07-21 at 1 05 36 PM

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jul 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the request processing pipeline to improve performance by moving the conversion of EngineCoreRequest to Request off the main engine thread's critical path. The changes are well-structured and the performance gains are clearly demonstrated.

I've identified one potential high-severity issue: a race condition on the multi-modal input cache (mm_input_cache_server) introduced by accessing it from multiple threads without synchronization. I've provided details in a specific comment. Addressing this will ensure the thread safety of the new implementation.

@robertgshaw2-redhat
Copy link
Collaborator

nice idea! I wonder what else we can put into that thread...

@Jialin
Copy link
Contributor Author

Jialin commented Jul 21, 2025

nice idea! I wonder what else we can put into that thread...

block hashes for sure, as we discussed in a separate PR.

Personally, the main benefit for this change is to come up with a centralized place to put all these future ideas :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this to the internal Request, maybe convert EngineCoreRequest to a tuple[Request, int]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I might leaning toward adding it in Request instead.

  1. current_wave is an existing field in EngineCoreRequest
  2. If we go with tuple[Request, int], I'm afraid we might end up having tuple[Request, A, B, C, D, ...] in the future :/

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_wave is logically separate from the request, it's only used for coordination purposes at the point that the request is received. Request is the scheduler's state for the request so it doesn't really belong in there. So I don't think what you mentioned will be a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrapping current_wave in a new class (e.g. RequestEnv)? And the interface would become tuple[Request, RequestEnv]

In this way,

  • non-request data (i.e. current_wave) go into RequestEnv
  • when new similar coming in, we have a place for them without touching the interface

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, gentle nudge @njhill for your thoughts :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New class would be more overhead, tuples are very cheap (small allocs are reused). I don't think we have to worry about a place for other values, I don't think it's likely that there will be, and it's better to do that if/when needed in future. This isn't an external interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill Appreciate for your inputs.

I've handed over the idea to @linzebing, and most of your comments should had been addressed. Let's move the discussion there. #21627

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where else are we calling add_request that we need to keep union of 2 types here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least 2 places I noticed. Especially for the later one, not fully sure if we need to further update interface upstream.

An alternative approach is to

  • Add non-Union API as add_request(self, request: Request)
  • Expose EngineCoreRequest to Request conversion as API (e.g. preprocess_ad_request(EngineCoreRequest) -> Request:, and update logic on caller side

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those should be for the sync engine. we should be able to trigger them directly using pythonic api or bench throughput. do we see similar benefits?

Jialin added a commit to Jialin/vllm that referenced this pull request Jul 22, 2025
Copy link

mergify bot commented Jul 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Jialin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@Jialin
Copy link
Contributor Author

Jialin commented Jul 25, 2025

Handed over to @linzebing in #21627

@Jialin Jialin closed this Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants