[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604
[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604iancarrasco-b10 wants to merge 5 commits intovllm-project:mainfrom
distributed_executor_backend to improve single-GPU performance#2604Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
fix pre-commit please |
|
fix dco please |
b122235 to
dab9604
Compare
|
vllm already defaults to uniproc executor when distributed_executor_backend is None and world_size=1 so this is actually just a config change. Similarly mp is used by default when world_size > 1. |
Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
b2abace to
e58e392
Compare
distributed_executor_backend to improve single-GPU performance
distributed_executor_backend to improve single-GPU performancedistributed_executor_backend to improve single-GPU performance
|
Thanks for the investigation! We ran the same benchmark on H20 (141GB) to verify the claim generalizes beyond H100. Setup: Results (H20):
On H20, This suggests the performance tradeoff is hardware-dependent. Auto-defaulting to A few observations on the PR itself:
Suggestion: keep the default as @tzhouam @hsliuustc0106 Could you help confirm these findings or share any thoughts on the mp vs uni tradeoff? |
|
Very interesting to see that it doesn't hold up on H200. I think it is fair to leave this up to users to flip based on their desired throughput/latency targets and hardware setup. Outside of what I shared about the setup in the related issue, what could be helpful here for getting a better sense of the discrepancy? |
|
Have you tried running the Base cloning task? This is what I actually got the results for vs. CustomVoice so arguably that could be playing a role. I'll also run CustomVoice on my setup to see if I observe the same. |
|
Follow-up: Base (voice cloning) task shows different results on the same H20 hardware.
For the Base task, This suggests the tradeoff is task-dependent, not just hardware-dependent. The Base task involves heavier per-request processing (reference audio encoding), making IPC serialization overhead a larger fraction of the total cost, which favors Given this, a blanket default change seems risky. Keeping |
|
Thanks for running @linyueqian! Agreed to not merge the blanket change. I think it would be worth potentially adding this to the docs or commenting somewhere more permanent and that future deployments can take advantage of these perf gains. Potentially having a task-specific stage config? |
|
does it also apply to qwen-omni as well? @ZeldaHuang @amy-why-3459 |
Summary
distributed_executor_backend: "mp"fromqwen3_tts.yamlstage configThis improves single-GPU performance by avoiding unnecessary multiprocessing overhead from the
mpexecutor when only one device is in use. This still preserves the current behavior of using mp in world_size > 1 scenarios.Test Plan
Test Qwen3-TTS with uniproc and mp executors and both worked in the single-gpu case. More results can be found here: #2603