[Model] Fun cosy voice3-0.5-b-2512 by divyanshsinghvi · Pull Request #498 · vllm-project/vllm-omni

divyanshsinghvi · 2025-12-27T09:59:55Z

Purpose

Resolves #315

This PR integrates the CosyVoice3 text-to-speech model into vllm-omni, implementing both the "Talker" (LLM) and "Code2Wav" (Flow Matching + HiFiGAN) stages. It includes critical
architectural enhancements to ensure stability and correctness within the vLLM execution engine.

Model Implementation

Added CosyVoice3Model supporting multi-stage execution:
- Talker Stage: Generates speech tokens from text using a Qwen2 backbone.
- Code2Wav Stage: Converts speech tokens to waveforms using a DiT-based Flow Matching decoder and HiFiGAN vocoder.
Integrated CosyVoice3MultiModalProcessor for handling audio inputs and feature extraction.

Current Limitations:

Single Sample Support: The implementation currently supports a batch size of 1 only. Proper batching logic and verification can be handled in a separate, subsequent PR.
Scope: Limited to Zero-Shot mode; support for other CosyVoice variants is not yet implemented but would be a good first issue with minimal changes required.

Test Plan

python examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py --model pretrained_models/Fun-CosyVoice3-0.5B --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN

Test Result

Input: prompt.wav
Output: output_0.wav

Performance Benchmarks:

Run: NVIDIA GeForce RTX 3070
CUDA Version : 13.0
Driver Version : 580.95.05

Stats [latest to earlier]:

After fixing code to allow enforce_eager=False

======================================================================
BENCHMARK RESULTS (10/10 successful runs)
======================================================================

END-TO-END METRICS:
----------------------------------------
  Time (ms):     mean=4648.44, std=152.61, min=4457.91, max=5031.88, median=4637.42
  Total tokens:  mean=171.0, std=0.0, min=171, max=171

STAGE 0 (LLM) METRICS:
----------------------------------------
  Time (ms):     mean=4005.45, std=140.03, min=3834.02, max=4360.00, median=3989.22
  Tokens out:    mean=88.0, std=0.0, min=88, max=88

STAGE 1 (AUDIO SYNTHESIS) METRICS:
----------------------------------------
  Time (ms):     mean=641.88, std=14.44, min=622.78, max=670.87, median=643.62

TRANSFER METRICS (Stage 0 -> Stage 1):
----------------------------------------
  Time (ms):     mean=2.301, std=0.084, min=2.166, max=2.438
  Throughput:    mean=2924.54, std=107.69, min=2756.57, max=3102.47 Mbps

TIME BREAKDOWN (average):
----------------------------------------
  Stage 0:       4005.45 ms (86.2%)
  Stage 1:       641.88 ms (13.8%)
  Transfer:      2.301 ms (0.05%)
  Total E2E:     4648.44 ms

======================================================================

Integration of vllm Qwen2Model for Stage 0 .

Stage 0: Memory Spiked integrating vllm implementation from vllm.model_executor.models.qwen2 import Qwen2Model compared to using from transformers import Qwen2ForCausalLM. Unsure Why?

No effect on runtime.

Before integration of vllm Qwen2Model

For E2E time metrics, memory profiling was off.

======================================================================
BENCHMARK RESULTS (10/10 successful runs)
======================================================================

END-TO-END METRICS:
----------------------------------------
  Time (ms):     mean=6297.54, std=163.46, min=6050.48, max=6581.21, median=6306.11
  Total tokens:  mean=171.0, std=0.0, min=171, max=171

STAGE 0 (LLM) METRICS:
----------------------------------------
  Time (ms):     mean=5442.82, std=153.97, min=5217.16, max=5722.37, median=5449.38
  Tokens out:    mean=88.0, std=0.0, min=88, max=88

STAGE 1 (AUDIO SYNTHESIS) METRICS:
----------------------------------------
  Time (ms):     mean=853.59, std=20.55, min=832.21, max=904.91, median=847.82

TRANSFER METRICS (Stage 0 -> Stage 1):
----------------------------------------
  Time (ms):     mean=2.356, std=0.158, min=2.176, max=2.749
  Throughput:    mean=2862.73, std=176.95, min=2444.57, max=3087.85 Mbps

TIME BREAKDOWN (average):
----------------------------------------
  Stage 0:       5442.82 ms (86.4%)
  Stage 1:       853.59 ms (13.6%)
  Transfer:      2.356 ms (0.04%)
  Total E2E:     6297.54 ms

======================================================================

Memory :

  Summary - Total VRAM Usage:                                                                                                                                                           
  ┌─────────────────────────┬────────────────────────────────────┐                                                                                                                      
  │          Stage          │            Peak Memory             │                                                                                                                      
  ├─────────────────────────┼────────────────────────────────────┤                                                                                                                      
  │ Stage-0 (LLM)           │ ~2.0 GB                            │                                                                                                                      
  ├─────────────────────────┼────────────────────────────────────┤                                                                                                                      
  │ Stage-1 (DiT + HiFiGAN) │ ~1.68 GB                           │                                                                                                                      
  ├─────────────────────────┼────────────────────────────────────┤                                                                                                                      
  │ Combined                │ ~3.7 GB (sequential, not additive) │                                                                                                                      
  └─────────────────────────┴────────────────────────────────────┘

Memory Profiling:

Stage 0 :

Stage 1:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

…ress Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

…n also fixed Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

…/vllm-omni into Fun-CosyVoice3-0.5B-2512

examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py

vllm_omni/model_executor/models/cosyvoice3/config.py

vllm_omni/model_executor/models/cosyvoice3/dit.py

vllm_omni/model_executor/models/cosyvoice3/cosyvoice3_talker.py

vllm_omni/model_executor/models/cosyvoice/llm.py

vllm_omni/model_executor/stage_configs/cosyvoice3.yaml

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/model_executor/models/cosyvoice3/cosyvoice3.py

divyanshsinghvi · 2026-02-21T09:35:29Z

@lishunyang12 I have few comments to address as it got recently reviewed. Should be ready to merge in next few days.

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

divyanshsinghvi · 2026-02-24T15:05:50Z

@linyueqian All the comments are addressed.
Also made it compatible with the vllm 0.16.0 api changes.

cc: @hsliuustc0106

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

linyueqian

LGTM

linyueqian · 2026-02-25T05:12:29Z

@hsliuustc0106

divyanshsinghvi · 2026-03-01T22:06:17Z

@hsliuustc0106

linyueqian · 2026-03-01T22:25:05Z

Great work! A few things worth tracking as follow-ups:

Async chunk streaming - code2wav currently waits for all tokens; adding async_chunk + a frame-aligned stage processor (like Qwen3-TTS) would reduce TTFP significantly.
vLLM-native talker - swapping HF Qwen2ForCausalLM for vLLM's native model with stacked_params_mapping would unlock quantization and memory efficiency.
CUDA graph for code2wav - worth investigating if runtime_info access can be refactored to allow graph capture.

divyanshsinghvi · 2026-03-01T23:30:28Z

Great work! A few things worth tracking as follow-ups:

Async chunk streaming - code2wav currently waits for all tokens; adding async_chunk + a frame-aligned stage processor (like Qwen3-TTS) would reduce TTFP significantly.

vLLM-native talker - swapping HF Qwen2ForCausalLM for vLLM's native model with stacked_params_mapping would unlock quantization and memory efficiency.

CUDA graph for code2wav - worth investigating if runtime_info access can be refactored to allow graph capture.

Yes, I will work on it in a follow up PR after this is merged.
I updated to vLLM's native model here, but I will have to check for stacked_params_mapping.
CUDA Graph support is enabled, I added multiple fixes to support that a while ago. Was there some issue? I have tested with enforce_eager: False and it works and is significantly faster.

linyueqian · 2026-03-02T00:01:25Z

Could you share a quick benchmark with just these metrics?

TTFA (time to first audio)
RTF (real-time factor)

Please report for:

enforce_eager=True
enforce_eager=False

Same prompt/audio setup for both is enough.

divyanshsinghvi · 2026-03-02T00:07:11Z

Could you share a quick benchmark with just these metrics?

TTFA (time to first audio)

RTF (real-time factor)

Please report for:

enforce_eager=True

enforce_eager=False

Same prompt/audio setup for both is enough.

The ones in the description (#498 (comment)) don't suffice (Check the Performance Benchmarks: section)?
If not, I can do more clear Side by side comparison, but it shows clearly the jump for enforce_eager=False and previous setup with stage wise breakdown and transfer time

linyueqian · 2026-03-02T00:12:59Z

Thanks, I saw the E2E and stage-time benchmarks in the PR description.\nCould you also share TTFA and RTF (same setup) for completeness?

linyueqian · 2026-03-02T04:33:38Z

@hsliuustc0106

divyanshsinghvi · 2026-03-02T09:05:37Z

Thanks, I saw the E2E and stage-time benchmarks in the PR description.\nCould you also share TTFA and RTF (same setup) for completeness?

Will update.

divyanshsinghvi added 13 commits December 20, 2025 18:06

cosyvoice 3 initiated

00cf5fe

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

cosyvoice base file started

0ddcbbf

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Initialized model paths

7c8054d

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Initialized model paths

83526ad

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

renaming

2c41af0

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

34bea22

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

769aaeb

stages are ready;

c1130e6

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Major fixes to stage e2e working; not correct audio yet but good prog…

bf29392

…ress Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

cosyvoice.yaml stage_configs

4f3cda3

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

stage input processor link function

2826cec

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

yaml updates and input handling corrected

5888ae7

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

fixes to inputs prompt token embed and token embed fixed; speech toke…

ca2ee7f

…n also fixed Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

david6666666 mentioned this pull request Dec 29, 2025

[RFC]: DiT model and feature support enhancement #85

Closed

58 tasks

divyanshsinghvi added 7 commits December 29, 2025 21:23

Getting finally multimodal embeds correct

c103c49

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

llm part fixed now

3ce5ddc

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

almost stage 1 half completed

3a2df52

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

e2e working

46429c0

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

e2e cosyvoice script some weird noise at end rest seems good

f481b38

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

cosyvoice llm done

f2b62af

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

cosyvoice config refactored

4ae2bea

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

hsliuustc0106 mentioned this pull request Dec 30, 2025

[RFC]: Does the official team of vllm-omni plan to support the integration of the index-tts model? #545

Open

1 task

divyanshsinghvi changed the title ~~[Draft] [Model] Fun cosy voice3-0.5-b-2512~~ [WIP] [Model] Fun cosy voice3-0.5-b-2512 Dec 30, 2025

divyanshsinghvi added 3 commits December 30, 2025 18:36

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

c74249f

registry cosyvoice everything upgraded

7f54763

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

Merge branch 'Fun-CosyVoice3-0.5B-2512' of github.com:divyanshsinghvi…

45002af

…/vllm-omni into Fun-CosyVoice3-0.5B-2512

divyanshsinghvi commented Dec 30, 2025

View reviewed changes

divyanshsinghvi marked this pull request as ready for review December 30, 2025 14:16

divyanshsinghvi requested a review from hsliuustc0106 as a code owner December 30, 2025 14:16

chatgpt-codex-connector bot reviewed Dec 30, 2025

View reviewed changes

vllm_omni/model_executor/models/cosyvoice3/cosyvoice3.py Outdated Show resolved Hide resolved

divyanshsinghvi added 10 commits February 21, 2026 09:38

remove preallocation to on the fly allocation

0185592

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

cca7ee4

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

requirements updated

1d9241a

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

update to readme

0da9551

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

api changes

7a31f96

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

get data parser changed

19301a0

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

ok

7fb162b

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

remove precheck fil

803dd23

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

utils

dc662db

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

cached runtime components

014bf60

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

divyanshsinghvi requested a review from linyueqian February 24, 2026 15:05

divyanshsinghvi added 2 commits February 24, 2026 15:10

multimodal output not present?

7b4a1db

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

cosyvoice3.py

165177c

Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>

linyueqian approved these changes Feb 25, 2026

View reviewed changes

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

3150aec

hsliuustc0106 added the ready label to trigger buildkite CI label Feb 26, 2026

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

3c66b36

linyueqian mentioned this pull request Mar 2, 2026

Congratulations on the Fun-CosyVoice 3.5 Release! FunAudioLLM/CosyVoice#1840

Open

Merge branch 'main' into Fun-CosyVoice3-0.5B-2512

757313a

Conversation

divyanshsinghvi commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Model Implementation

Current Limitations:

Test Plan

Test Result

Performance Benchmarks:

Stats [latest to earlier]:

After fixing code to allow enforce_eager=False

Integration of vllm Qwen2Model for Stage 0 .

Before integration of vllm Qwen2Model

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

divyanshsinghvi commented Feb 21, 2026

Uh oh!

divyanshsinghvi commented Feb 24, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Feb 25, 2026

Uh oh!

divyanshsinghvi commented Mar 1, 2026

Uh oh!

linyueqian commented Mar 1, 2026

Uh oh!

divyanshsinghvi commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divyanshsinghvi commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Mar 2, 2026

Uh oh!

linyueqian commented Mar 2, 2026

Uh oh!

divyanshsinghvi commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

divyanshsinghvi commented Dec 27, 2025 •

edited

Loading

divyanshsinghvi commented Mar 1, 2026 •

edited

Loading

linyueqian commented Mar 2, 2026 •

edited

Loading

divyanshsinghvi commented Mar 2, 2026 •

edited

Loading