Skip to content

[Model] Fun cosy voice3-0.5-b-2512#498

Open
divyanshsinghvi wants to merge 119 commits intovllm-project:mainfrom
divyanshsinghvi:Fun-CosyVoice3-0.5B-2512
Open

[Model] Fun cosy voice3-0.5-b-2512#498
divyanshsinghvi wants to merge 119 commits intovllm-project:mainfrom
divyanshsinghvi:Fun-CosyVoice3-0.5B-2512

Conversation

@divyanshsinghvi
Copy link
Contributor

@divyanshsinghvi divyanshsinghvi commented Dec 27, 2025

Purpose

Resolves #315

This PR integrates the CosyVoice3 text-to-speech model into vllm-omni, implementing both the "Talker" (LLM) and "Code2Wav" (Flow Matching + HiFiGAN) stages. It includes critical
architectural enhancements to ensure stability and correctness within the vLLM execution engine.

Model Implementation

  • Added CosyVoice3Model supporting multi-stage execution:
    • Talker Stage: Generates speech tokens from text using a Qwen2 backbone.
    • Code2Wav Stage: Converts speech tokens to waveforms using a DiT-based Flow Matching decoder and HiFiGAN vocoder.
  • Integrated CosyVoice3MultiModalProcessor for handling audio inputs and feature extraction.

Current Limitations:

  • Single Sample Support: The implementation currently supports a batch size of 1 only. Proper batching logic and verification can be handled in a separate, subsequent PR.
  • Scope: Limited to Zero-Shot mode; support for other CosyVoice variants is not yet implemented but would be a good first issue with minimal changes required.

Test Plan

python examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py --model pretrained_models/Fun-CosyVoice3-0.5B --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN

Test Result

Input: prompt.wav
Output: output_0.wav

Performance Benchmarks:

Run: NVIDIA GeForce RTX 3070
CUDA Version : 13.0
Driver Version : 580.95.05

Stats [latest to earlier]:

After fixing code to allow enforce_eager=False

======================================================================
BENCHMARK RESULTS (10/10 successful runs)
======================================================================

END-TO-END METRICS:
----------------------------------------
  Time (ms):     mean=4648.44, std=152.61, min=4457.91, max=5031.88, median=4637.42
  Total tokens:  mean=171.0, std=0.0, min=171, max=171

STAGE 0 (LLM) METRICS:
----------------------------------------
  Time (ms):     mean=4005.45, std=140.03, min=3834.02, max=4360.00, median=3989.22
  Tokens out:    mean=88.0, std=0.0, min=88, max=88

STAGE 1 (AUDIO SYNTHESIS) METRICS:
----------------------------------------
  Time (ms):     mean=641.88, std=14.44, min=622.78, max=670.87, median=643.62

TRANSFER METRICS (Stage 0 -> Stage 1):
----------------------------------------
  Time (ms):     mean=2.301, std=0.084, min=2.166, max=2.438
  Throughput:    mean=2924.54, std=107.69, min=2756.57, max=3102.47 Mbps

TIME BREAKDOWN (average):
----------------------------------------
  Stage 0:       4005.45 ms (86.2%)
  Stage 1:       641.88 ms (13.8%)
  Transfer:      2.301 ms (0.05%)
  Total E2E:     4648.44 ms

======================================================================

Integration of vllm Qwen2Model for Stage 0 .

Stage 0: Memory Spiked integrating vllm implementation from vllm.model_executor.models.qwen2 import Qwen2Model compared to using from transformers import Qwen2ForCausalLM. Unsure Why?

No effect on runtime.

Screenshot from 2026-01-28 13-49-49

Before integration of vllm Qwen2Model

For E2E time metrics, memory profiling was off.

======================================================================
BENCHMARK RESULTS (10/10 successful runs)
======================================================================

END-TO-END METRICS:
----------------------------------------
  Time (ms):     mean=6297.54, std=163.46, min=6050.48, max=6581.21, median=6306.11
  Total tokens:  mean=171.0, std=0.0, min=171, max=171

STAGE 0 (LLM) METRICS:
----------------------------------------
  Time (ms):     mean=5442.82, std=153.97, min=5217.16, max=5722.37, median=5449.38
  Tokens out:    mean=88.0, std=0.0, min=88, max=88

STAGE 1 (AUDIO SYNTHESIS) METRICS:
----------------------------------------
  Time (ms):     mean=853.59, std=20.55, min=832.21, max=904.91, median=847.82

TRANSFER METRICS (Stage 0 -> Stage 1):
----------------------------------------
  Time (ms):     mean=2.356, std=0.158, min=2.176, max=2.749
  Throughput:    mean=2862.73, std=176.95, min=2444.57, max=3087.85 Mbps

TIME BREAKDOWN (average):
----------------------------------------
  Stage 0:       5442.82 ms (86.4%)
  Stage 1:       853.59 ms (13.6%)
  Transfer:      2.356 ms (0.04%)
  Total E2E:     6297.54 ms

======================================================================                                                                                                                

Memory :

  Summary - Total VRAM Usage:                                                                                                                                                           
  ┌─────────────────────────┬────────────────────────────────────┐                                                                                                                      
  │          Stage          │            Peak Memory             │                                                                                                                      
  ├─────────────────────────┼────────────────────────────────────┤                                                                                                                      
  │ Stage-0 (LLM)           │ ~2.0 GB                            │                                                                                                                      
  ├─────────────────────────┼────────────────────────────────────┤                                                                                                                      
  │ Stage-1 (DiT + HiFiGAN) │ ~1.68 GB                           │                                                                                                                      
  ├─────────────────────────┼────────────────────────────────────┤                                                                                                                      
  │ Combined                │ ~3.7 GB (sequential, not additive) │                                                                                                                      
  └─────────────────────────┴────────────────────────────────────┘                                                                                                                      

Memory Profiling:

Stage 0 :

Screenshot from 2026-01-28 12-24-19

Stage 1:
Screenshot from 2026-01-28 12-30-25


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
…ress

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
…n also fixed

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
@divyanshsinghvi divyanshsinghvi changed the title [Draft] [Model] Fun cosy voice3-0.5-b-2512 [WIP] [Model] Fun cosy voice3-0.5-b-2512 Dec 30, 2025
@divyanshsinghvi divyanshsinghvi marked this pull request as ready for review December 30, 2025 14:16
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@divyanshsinghvi
Copy link
Contributor Author

@lishunyang12 I have few comments to address as it got recently reviewed. Should be ready to merge in next few days.

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
@divyanshsinghvi
Copy link
Contributor Author

@linyueqian All the comments are addressed.
Also made it compatible with the vllm 0.16.0 api changes.

cc: @hsliuustc0106

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Copy link
Contributor

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian
Copy link
Contributor

@hsliuustc0106

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 26, 2026
@divyanshsinghvi
Copy link
Contributor Author

@hsliuustc0106

@linyueqian
Copy link
Contributor

Great work! A few things worth tracking as follow-ups:

  1. Async chunk streaming - code2wav currently waits for all tokens; adding async_chunk + a frame-aligned stage processor (like Qwen3-TTS) would reduce TTFP significantly.
  2. vLLM-native talker - swapping HF Qwen2ForCausalLM for vLLM's native model with stacked_params_mapping would unlock quantization and memory efficiency.
  3. CUDA graph for code2wav - worth investigating if runtime_info access can be refactored to allow graph capture.

@divyanshsinghvi
Copy link
Contributor Author

divyanshsinghvi commented Mar 1, 2026

Great work! A few things worth tracking as follow-ups:

  1. Async chunk streaming - code2wav currently waits for all tokens; adding async_chunk + a frame-aligned stage processor (like Qwen3-TTS) would reduce TTFP significantly.
  2. vLLM-native talker - swapping HF Qwen2ForCausalLM for vLLM's native model with stacked_params_mapping would unlock quantization and memory efficiency.
  3. CUDA graph for code2wav - worth investigating if runtime_info access can be refactored to allow graph capture.
  1. Yes, I will work on it in a follow up PR after this is merged.
  2. I updated to vLLM's native model here, but I will have to check for stacked_params_mapping.
  3. CUDA Graph support is enabled, I added multiple fixes to support that a while ago. Was there some issue? I have tested with enforce_eager: False and it works and is significantly faster.

@linyueqian
Copy link
Contributor

linyueqian commented Mar 2, 2026

Could you share a quick benchmark with just these metrics?

  • TTFA (time to first audio)
  • RTF (real-time factor)

Please report for:

  1. enforce_eager=True
  2. enforce_eager=False

Same prompt/audio setup for both is enough.

@divyanshsinghvi
Copy link
Contributor Author

divyanshsinghvi commented Mar 2, 2026

Could you share a quick benchmark with just these metrics?

  • TTFA (time to first audio)
  • RTF (real-time factor)

Please report for:

  1. enforce_eager=True
  2. enforce_eager=False

Same prompt/audio setup for both is enough.

The ones in the description (#498 (comment)) don't suffice (Check the Performance Benchmarks: section)?
If not, I can do more clear Side by side comparison, but it shows clearly the jump for enforce_eager=False and previous setup with stage wise breakdown and transfer time

@linyueqian
Copy link
Contributor

Thanks, I saw the E2E and stage-time benchmarks in the PR description.\nCould you also share TTFA and RTF (same setup) for completeness?

@linyueqian
Copy link
Contributor

@hsliuustc0106

@divyanshsinghvi
Copy link
Contributor Author

Thanks, I saw the E2E and stage-time benchmarks in the PR description.\nCould you also share TTFA and RTF (same setup) for completeness?

Will update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Fun-CosyVoice3-0.5B

4 participants