-
Notifications
You must be signed in to change notification settings - Fork 389
[CORE] concurrent partial prefills #2372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to limit concurrent partial prefills for long prompts in the AscendScheduler, which is a great feature for improving Time To First Token (TTFT) in mixed-load scenarios. The implementation looks solid and correctly follows the logic described. I've found one high-severity issue regarding configuration validation that should be addressed.
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
@wangxiyuan 有劳看下,有需要调整的,可以及时修改。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job
Thanks for the PR! can you rebase to main to make CI pass? |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Csrayz <[email protected]>
Signed-off-by: Csrayz <[email protected]>
Signed-off-by: Csrayz <[email protected]>
Modify assert according to code review comments Signed-off-by: Csrayz <[email protected]>
Signed-off-by: Csrayz <[email protected]>
Signed-off-by: Csrayz <[email protected]>
cff106e
to
af40bb2
Compare
0c92117
to
413af46
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2372 +/- ##
==========================================
- Coverage 78.49% 72.65% -5.84%
==========================================
Files 132 147 +15
Lines 17806 21845 +4039
==========================================
+ Hits 13976 15871 +1895
- Misses 3830 5974 +2144
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pipeline [multicard e2e test (linux-aarch64-a2-2, v0.10.1.1) (pull_request)] Failing after 93m Error: (VllmWorker TP0 pid=64935) ERROR 08-25 07:26:59 [multiproc_executor.py:559] [ERROR] 2025-08-25-07:26:58 (PID:64935, Device:0, RankID:-1) ERR02200 DIST call hccl api failed. Run again? |
Can this pipeline be rerun? Based on the error, the pipeline failure seems unrelated to code changes. @wangxiyuan |
413af46
to
f1365e9
Compare
Signed-off-by: Csrayz <[email protected]>
f1365e9
to
2fe1204
Compare
This pipeline has various issues, and these non-code-related errors are causing the pipeline to fail. The same code, which previously failed the multi-GPU e2e test, now fails due to the single-GPU e2e failing to start after rerunning the pipeline. |
What this PR does / why we need it?
When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled.
This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card.
Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests.
Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios.
Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). The results of the benchmark are as follows.