Skip to content

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604

Open
iancarrasco-b10 wants to merge 5 commits intovllm-project:mainfrom
basetenlabs:uniproc-executor
Open

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604
iancarrasco-b10 wants to merge 5 commits intovllm-project:mainfrom
basetenlabs:uniproc-executor

Conversation

@iancarrasco-b10
Copy link
Copy Markdown

@iancarrasco-b10 iancarrasco-b10 commented Apr 8, 2026

Summary

  • Remove hardcoded distributed_executor_backend: "mp" from qwen3_tts.yaml stage config

This improves single-GPU performance by avoiding unnecessary multiprocessing overhead from the mp executor when only one device is in use. This still preserves the current behavior of using mp in world_size > 1 scenarios.

Test Plan

Test Qwen3-TTS with uniproc and mp executors and both worked in the single-gpu case. More results can be found here: #2603

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian
Copy link
Copy Markdown
Collaborator

fix pre-commit please

@linyueqian
Copy link
Copy Markdown
Collaborator

fix dco please

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 8, 2026
@iancarrasco-b10 iancarrasco-b10 force-pushed the uniproc-executor branch 2 times, most recently from b122235 to dab9604 Compare April 8, 2026 16:52
@iancarrasco-b10
Copy link
Copy Markdown
Author

iancarrasco-b10 commented Apr 8, 2026

vllm already defaults to uniproc executor when distributed_executor_backend is None and world_size=1 so this is actually just a config change. Similarly mp is used by default when world_size > 1.
https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py#L825

Made-with: Cursor
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
@iancarrasco-b10 iancarrasco-b10 changed the title Default to UniProcExecutor for single-GPU stages [Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026
@iancarrasco-b10 iancarrasco-b10 changed the title [Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance [Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

Thanks for the investigation! We ran the same benchmark on H20 (141GB) to verify the claim generalizes beyond H100.

Setup: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, bs16 config, single GPU, 50 prompts per concurrency level.

Results (H20):

Concurrency Mean RTF (mp) Mean RTF (uni) Delta
1 0.155 0.151 ~tied
4 0.221 0.308 mp 39.6% better
10 0.372 0.423 mp 13.7% better
16 0.430 0.541 mp 25.8% better

On H20, mp executor is consistently faster at concurrency 4+, which is the opposite of the H100 results in #2603. The throughput gap is significant: at concurrency 4, mp delivers 18.12 audio s/s vs 12.84 for uni (+29%).

This suggests the performance tradeoff is hardware-dependent. Auto-defaulting to uni for all single-GPU stages would regress performance on H20.

A few observations on the PR itself:

  1. The Python code change (_default_executor_backend + setdefault) is redundant. vLLM already defaults to uni when distributed_executor_backend is None and world_size=1. Just removing the YAML line would have the same effect.
  2. The scope is narrow. Only qwen3_tts.yaml is updated, but 80+ other YAML configs also hardcode "mp".

Suggestion: keep the default as "mp" in the configs, and let users opt into "uni" explicitly if their hardware benefits from it. Or, could you share more details about the H100 setup so we can understand what drives the difference?

@tzhouam @hsliuustc0106 Could you help confirm these findings or share any thoughts on the mp vs uni tradeoff?

@iancarrasco-b10
Copy link
Copy Markdown
Author

iancarrasco-b10 commented Apr 8, 2026

Very interesting to see that it doesn't hold up on H200. I think it is fair to leave this up to users to flip based on their desired throughput/latency targets and hardware setup. Outside of what I shared about the setup in the related issue, what could be helpful here for getting a better sense of the discrepancy?

@iancarrasco-b10
Copy link
Copy Markdown
Author

iancarrasco-b10 commented Apr 8, 2026

Have you tried running the Base cloning task? This is what I actually got the results for vs. CustomVoice so arguably that could be playing a role. I'll also run CustomVoice on my setup to see if I observe the same.

@linyueqian
Copy link
Copy Markdown
Collaborator

Follow-up: Base (voice cloning) task shows different results on the same H20 hardware.

Concurrency Mean RTF (mp) Mean RTF (uni) Delta
1 0.297 0.240 uni 19.3% better
4 0.626 0.607 ~tied
10 1.193 1.044 uni 12.5% better
16 1.936 1.388 uni 28.3% better

For the Base task, uni is consistently better, consistent with the original findings in #2603. However, using CustomVoice on the same H20 GPU, we saw the opposite (mp winning at concurrency 4+).

This suggests the tradeoff is task-dependent, not just hardware-dependent. The Base task involves heavier per-request processing (reference audio encoding), making IPC serialization overhead a larger fraction of the total cost, which favors uni. CustomVoice is lighter per-request, so the process-level parallelism of mp dominates.

Given this, a blanket default change seems risky. Keeping mp as the explicit default in configs and letting users opt into uni for their specific workload would be the safer path.

@iancarrasco-b10
Copy link
Copy Markdown
Author

iancarrasco-b10 commented Apr 8, 2026

Thanks for running @linyueqian! Agreed to not merge the blanket change. I think it would be worth potentially adding this to the docs or commenting somewhere more permanent and that future deployments can take advantage of these perf gains. Potentially having a task-specific stage config?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

does it also apply to qwen-omni as well? @ZeldaHuang @amy-why-3459

@hsliuustc0106 hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nightly-test label to trigger buildkite nightly test CI ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance]: Uniproc executor has much better performance at higher concurrency on Qwen3-TTS (Single GPU)

3 participants