[CORE] concurrent partial prefills #2372

Csrayz · 2025-08-14T07:35:29Z

What this PR does / why we need it?

When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled.

This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card.

Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests.

Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios.

Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). The results of the benchmark are as follows.

python benchmarks/benchmark_serving.py \ 
--host "xx" \ 
--port 80 \ 
--model /model/Qwen3-8B/ \ 
--dataset-name "custom" \ 
--dataset-path ${test_case} \ 
--metric-percentiles 80,85,90,95,99 \ 
--max-concurrency 40 ‍

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@00e3f9d

Implement Concurrent Partial Prefills Signed-off-by: Csrayz <[email protected]>

Signed-off-by: Csrayz <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a mechanism to limit concurrent partial prefills for long prompts in the AscendScheduler, which is a great feature for improving Time To First Token (TTFT) in mixed-load scenarios. The implementation looks solid and correctly follows the logic described. I've found one high-severity issue regarding configuration validation that should be addressed.

vllm_ascend/core/schedule_config.py

github-actions · 2025-08-14T07:41:47Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Modify assert according to code review comments Signed-off-by: Csrayz <[email protected]>

Csrayz added 3 commits August 14, 2025 10:20

[FEAT] add Concurrent Partial Prefills

410fe43

Implement Concurrent Partial Prefills Signed-off-by: Csrayz <[email protected]>

[Doc] new ascend_scheduler_config

3662c43

Signed-off-by: Csrayz <[email protected]>

[Fix] fix some error

8e0adbe

Signed-off-by: Csrayz <[email protected]>

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

vllm_ascend/core/schedule_config.py Outdated Show resolved Hide resolved

[FIX] assert -> raise ValueError

55cafcb

Modify assert according to code review comments Signed-off-by: Csrayz <[email protected]>

github-actions bot added the documentation Improvements or additions to documentation label Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CORE] concurrent partial prefills #2372

[CORE] concurrent partial prefills #2372

Csrayz commented Aug 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Uh oh!

[CORE] concurrent partial prefills #2372

Are you sure you want to change the base?

[CORE] concurrent partial prefills #2372

Conversation

Csrayz commented Aug 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Uh oh!

Csrayz commented Aug 14, 2025 •

edited by github-actions bot

Loading