[Model] Fun cosy voice3-0.5-b-2512#498
[Model] Fun cosy voice3-0.5-b-2512#498divyanshsinghvi wants to merge 119 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
…ress Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
…n also fixed Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
…/vllm-omni into Fun-CosyVoice3-0.5B-2512
examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py
Outdated
Show resolved
Hide resolved
examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/cosyvoice3/cosyvoice3_talker.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
|
@linyueqian All the comments are addressed. cc: @hsliuustc0106 |
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
|
Great work! A few things worth tracking as follow-ups:
|
|
|
Could you share a quick benchmark with just these metrics?
Please report for:
Same prompt/audio setup for both is enough. |
The ones in the description (#498 (comment)) don't suffice (Check the Performance Benchmarks: section)? |
|
Thanks, I saw the E2E and stage-time benchmarks in the PR description.\nCould you also share TTFA and RTF (same setup) for completeness? |
Will update. |
I am just using the previous benchmark numbers to compute these results, Let me know if I misunderstood it: enforce_eager = False
enforce_eager = True
|
we expect the RTF < 1.0 for a minimal requirement. |
Wait, this will depend on which GPU you are using, if you run it on H200 you will have a better number there. Unsure how it can be standardised across GPU for RTF <1.0 requirement? Also RTF makes more sense to be tracked in async TTS implementation for this model, which should be relatively faster as you will not have empty stages as right now. Please note the above experiments are in 3070. I don't have capacity to rent out better hardware right now. But I feel the current requirement is slightly arbitrary. What would you recommend @hsliuustc0106 @linyueqian , shall I close this PR given the current stats? I don't see any direct improvements, I have done most of which I could see and is probably among few models in the vllm-omni which supports the CUDA graph optimization. If you have any recommendations for performance optimization, I might try those. |
have you ever tried to adapt this model with async_chunk with streaming output? |
That's already discussed that it will be a followup PR which should convert the sync to async. Any other suggestions regarding speed up. |
Purpose
Resolves #315
This PR integrates the CosyVoice3 text-to-speech model into vllm-omni, implementing both the "Talker" (LLM) and "Code2Wav" (Flow Matching + HiFiGAN) stages. It includes critical
architectural enhancements to ensure stability and correctness within the vLLM execution engine.
Model Implementation
Current Limitations:
good first issuewith minimal changes required.Test Plan
python examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py --model pretrained_models/Fun-CosyVoice3-0.5B --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankENTest Result
Input: prompt.wav
Output: output_0.wav
Performance Benchmarks:
Run: NVIDIA GeForce RTX 3070
CUDA Version : 13.0
Driver Version : 580.95.05
Stats [latest to earlier]:
After fixing code to allow enforce_eager=False
Integration of vllm Qwen2Model for Stage 0 .
Stage 0: Memory Spiked integrating vllm implementation
from vllm.model_executor.models.qwen2 import Qwen2Modelcompared to usingfrom transformers import Qwen2ForCausalLM. Unsure Why?No effect on runtime.
Before integration of vllm Qwen2Model
For E2E time metrics, memory profiling was off.
Memory :
Memory Profiling:
Stage 0 :
Stage 1:

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)