[Qwen3-TTS] Remove hardcoded `distributed_executor_backend` to improve single-GPU performance by iancarrasco-b10 · Pull Request #2604 · vllm-project/vllm-omni

iancarrasco-b10 · 2026-04-08T16:39:58Z

Summary

Remove hardcoded distributed_executor_backend: "mp" from qwen3_tts.yaml stage config

This improves single-GPU performance by avoiding unnecessary multiprocessing overhead from the mp executor when only one device is in use. This still preserves the current behavior of using mp in world_size > 1 scenarios.

Test Plan

Test Qwen3-TTS with uniproc and mp executors and both worked in the single-gpu case. More results can be found here: #2603

chatgpt-codex-connector · 2026-04-08T16:40:07Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-08T16:46:07Z

fix pre-commit please

linyueqian · 2026-04-08T16:50:13Z

fix dco please

iancarrasco-b10 · 2026-04-08T16:57:07Z

vllm already defaults to uniproc executor when distributed_executor_backend is None and world_size=1 so this is actually just a config change. Similarly mp is used by default when world_size > 1.
https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py#L825

Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

linyueqian · 2026-04-08T17:17:48Z

Thanks for the investigation! We ran the same benchmark on H20 (141GB) to verify the claim generalizes beyond H100.

Setup: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, bs16 config, single GPU, 50 prompts per concurrency level.

Results (H20):

Concurrency	Mean RTF (mp)	Mean RTF (uni)	Delta
1	0.155	0.151	~tied
4	0.221	0.308	mp 39.6% better
10	0.372	0.423	mp 13.7% better
16	0.430	0.541	mp 25.8% better

On H20, mp executor is consistently faster at concurrency 4+, which is the opposite of the H100 results in #2603. The throughput gap is significant: at concurrency 4, mp delivers 18.12 audio s/s vs 12.84 for uni (+29%).

This suggests the performance tradeoff is hardware-dependent. Auto-defaulting to uni for all single-GPU stages would regress performance on H20.

A few observations on the PR itself:

The Python code change (_default_executor_backend + setdefault) is redundant. vLLM already defaults to uni when distributed_executor_backend is None and world_size=1. Just removing the YAML line would have the same effect.
The scope is narrow. Only qwen3_tts.yaml is updated, but 80+ other YAML configs also hardcode "mp".

Suggestion: keep the default as "mp" in the configs, and let users opt into "uni" explicitly if their hardware benefits from it. Or, could you share more details about the H100 setup so we can understand what drives the difference?

@tzhouam @hsliuustc0106 Could you help confirm these findings or share any thoughts on the mp vs uni tradeoff?

iancarrasco-b10 · 2026-04-08T17:21:54Z

Very interesting to see that it doesn't hold up on H200. I think it is fair to leave this up to users to flip based on their desired throughput/latency targets and hardware setup. Outside of what I shared about the setup in the related issue, what could be helpful here for getting a better sense of the discrepancy?

iancarrasco-b10 · 2026-04-08T17:25:44Z

Have you tried running the Base cloning task? This is what I actually got the results for vs. CustomVoice so arguably that could be playing a role. I'll also run CustomVoice on my setup to see if I observe the same.

linyueqian · 2026-04-08T17:55:00Z

Follow-up: Base (voice cloning) task shows different results on the same H20 hardware.

Concurrency	Mean RTF (mp)	Mean RTF (uni)	Delta
1	0.297	0.240	uni 19.3% better
4	0.626	0.607	~tied
10	1.193	1.044	uni 12.5% better
16	1.936	1.388	uni 28.3% better

For the Base task, uni is consistently better, consistent with the original findings in #2603. However, using CustomVoice on the same H20 GPU, we saw the opposite (mp winning at concurrency 4+).

This suggests the tradeoff is task-dependent, not just hardware-dependent. The Base task involves heavier per-request processing (reference audio encoding), making IPC serialization overhead a larger fraction of the total cost, which favors uni. CustomVoice is lighter per-request, so the process-level parallelism of mp dominates.

Given this, a blanket default change seems risky. Keeping mp as the explicit default in configs and letting users opt into uni for their specific workload would be the safer path.

iancarrasco-b10 · 2026-04-08T18:00:03Z

Thanks for running @linyueqian! Agreed to not merge the blanket change. I think it would be worth potentially adding this to the docs or commenting somewhere more permanent and that future deployments can take advantage of these perf gains. Potentially having a task-specific stage config?

hsliuustc0106 · 2026-04-09T07:55:01Z

does it also apply to qwen-omni as well? @ZeldaHuang @amy-why-3459

iancarrasco-b10 requested a review from hsliuustc0106 as a code owner April 8, 2026 16:40

iancarrasco-b10 mentioned this pull request Apr 8, 2026

[Performance]: Uniproc executor has much better performance at higher concurrency on Qwen3-TTS (Single GPU) #2603

Open

1 task

linyueqian linked an issue Apr 8, 2026 that may be closed by this pull request

[Performance]: Uniproc executor has much better performance at higher concurrency on Qwen3-TTS (Single GPU) #2603

Open

1 task

linyueqian self-requested a review April 8, 2026 16:46

linyueqian added the ready label to trigger buildkite CI label Apr 8, 2026

iancarrasco-b10 force-pushed the uniproc-executor branch 2 times, most recently from b122235 to dab9604 Compare April 8, 2026 16:52

iancarrasco-b10 added 5 commits April 8, 2026 12:58

Default to uniproc when engine is unset otherwise mp

56a5852

Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

revert check

ab7dc71

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

cleaner call

da9cc4c

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

ruff format

da21ca7

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

revert change as it is redundant

e58e392

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

iancarrasco-b10 force-pushed the uniproc-executor branch from b2abace to e58e392 Compare April 8, 2026 16:58

iancarrasco-b10 changed the title ~~Default to UniProcExecutor for single-GPU stages~~ [Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026

iancarrasco-b10 changed the title ~~[Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance~~ [Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026

hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3-TTS] Remove hardcoded `distributed_executor_backend` to improve single-GPU performance#2604

[Qwen3-TTS] Remove hardcoded `distributed_executor_backend` to improve single-GPU performance#2604
iancarrasco-b10 wants to merge 5 commits intovllm-project:mainfrom
basetenlabs:uniproc-executor

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

chatgpt-codex-connector bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading