Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
00cf5fe
cosyvoice 3 initiated
divyanshsinghvi Dec 20, 2025
0ddcbbf
cosyvoice base file started
divyanshsinghvi Dec 20, 2025
7c8054d
Initialized model paths
divyanshsinghvi Dec 20, 2025
83526ad
Initialized model paths
divyanshsinghvi Dec 20, 2025
2c41af0
renaming
divyanshsinghvi Dec 23, 2025
34bea22
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Dec 23, 2025
769aaeb
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Dec 23, 2025
c1130e6
stages are ready;
divyanshsinghvi Dec 23, 2025
bf29392
Major fixes to stage e2e working; not correct audio yet but good prog…
divyanshsinghvi Dec 27, 2025
4f3cda3
cosyvoice.yaml stage_configs
divyanshsinghvi Dec 27, 2025
2826cec
stage input processor link function
divyanshsinghvi Dec 27, 2025
5888ae7
yaml updates and input handling corrected
divyanshsinghvi Dec 27, 2025
ca2ee7f
fixes to inputs prompt token embed and token embed fixed; speech toke…
divyanshsinghvi Dec 27, 2025
c103c49
Getting finally multimodal embeds correct
divyanshsinghvi Dec 29, 2025
3ce5ddc
llm part fixed now
divyanshsinghvi Dec 29, 2025
3a2df52
almost stage 1 half completed
divyanshsinghvi Dec 30, 2025
46429c0
e2e working
divyanshsinghvi Dec 30, 2025
f481b38
e2e cosyvoice script some weird noise at end rest seems good
divyanshsinghvi Dec 30, 2025
f2b62af
cosyvoice llm done
divyanshsinghvi Dec 30, 2025
4ae2bea
cosyvoice config refactored
divyanshsinghvi Dec 30, 2025
c74249f
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Dec 30, 2025
7f54763
registry cosyvoice everything upgraded
divyanshsinghvi Dec 30, 2025
45002af
Merge branch 'Fun-CosyVoice3-0.5B-2512' of github.com:divyanshsinghvi…
divyanshsinghvi Dec 30, 2025
cabb289
fix error
divyanshsinghvi Dec 30, 2025
ae8fca0
small fixes
divyanshsinghvi Dec 30, 2025
db23cfc
fun cosy voice 3 0.5b
divyanshsinghvi Dec 30, 2025
89ea6a6
Utils functions refactored
divyanshsinghvi Dec 30, 2025
7c78d69
Utils functions refactored
divyanshsinghvi Dec 30, 2025
11a6d89
remvoe clutter
divyanshsinghvi Dec 30, 2025
ecb7949
remove unused code
divyanshsinghvi Dec 30, 2025
2694153
Fixes for final .wav
divyanshsinghvi Dec 30, 2025
245553a
close fix
divyanshsinghvi Dec 30, 2025
29b6065
Making the project self sustainable
divyanshsinghvi Dec 31, 2025
b787363
rename to cosyvoice to cosyvoice3
divyanshsinghvi Dec 31, 2025
5c14f46
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Dec 31, 2025
cea72ca
Fixed eos token bug
divyanshsinghvi Jan 1, 2026
cfae813
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
hsliuustc0106 Jan 1, 2026
4ea8bb7
Need to extend word due to typos check in pre-commit.
divyanshsinghvi Jan 1, 2026
fad1647
Merge branch 'Fun-CosyVoice3-0.5B-2512' of github.com:divyanshsinghvi…
divyanshsinghvi Jan 1, 2026
b3d4903
Delete typos.toml
divyanshsinghvi Jan 1, 2026
b655d39
Fix precommit
divyanshsinghvi Jan 1, 2026
4ca4b65
Remove whisper dependency
divyanshsinghvi Jan 1, 2026
ca2a7b2
Added instructures in examples....md file
divyanshsinghvi Jan 1, 2026
ddf4387
Fix import order and clean up commented code
divyanshsinghvi Jan 1, 2026
f5b2ab6
Signoff fix
divyanshsinghvi Jan 1, 2026
237cc7c
To remove docs error
divyanshsinghvi Jan 1, 2026
954fe3b
args tokenizer added
divyanshsinghvi Jan 1, 2026
914c737
Organize import statements in verify_e2e_cosyvoice.py
divyanshsinghvi Jan 5, 2026
a0b4e49
removed extra prints
divyanshsinghvi Jan 5, 2026
64e547f
remove or moved to debug logs logger in cosyvoice.py
divyanshsinghvi Jan 5, 2026
1ad27c5
fixes to flow.py remove dups
divyanshsinghvi Jan 5, 2026
1d59000
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 5, 2026
88d9c93
remove log comments
divyanshsinghvi Jan 5, 2026
2b94e3e
Pyproject extend words
divyanshsinghvi Jan 5, 2026
1d7d3cd
init .py added
divyanshsinghvi Jan 5, 2026
55439c3
docs improved
divyanshsinghvi Jan 5, 2026
9a98e76
torch mask
divyanshsinghvi Jan 5, 2026
cbb4c21
update supported model
divyanshsinghvi Jan 5, 2026
acca1b7
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 5, 2026
7f0eadc
t
divyanshsinghvi Jan 5, 2026
fa0c87f
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 5, 2026
52bc5b4
t removed
divyanshsinghvi Jan 5, 2026
5c48f00
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 10, 2026
fb010d7
fix copyright sentences and remove random seed function
divyanshsinghvi Jan 10, 2026
80141c0
clean imports
divyanshsinghvi Jan 10, 2026
0fd6732
commit multimodal
divyanshsinghvi Jan 10, 2026
c6d93e4
fix repo names for adoption
divyanshsinghvi Jan 10, 2026
f467a01
remove unwanted comments
divyanshsinghvi Jan 10, 2026
05c8cf7
llm -> cosyvoice3_talker
divyanshsinghvi Jan 10, 2026
1d523a6
cosyvoice3.yaml
divyanshsinghvi Jan 10, 2026
6f99e36
fixes
divyanshsinghvi Jan 10, 2026
919777b
Reorder import statements in verify_e2e_cosyvoice.py
divyanshsinghvi Jan 10, 2026
25ed1ad
working
divyanshsinghvi Jan 10, 2026
2821dcd
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 11, 2026
955c4c4
remove streaming erlated excess code; currently not supported; also a…
divyanshsinghvi Jan 27, 2026
f1ec0fb
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 27, 2026
3c75364
Fixed with new versions
divyanshsinghvi Jan 27, 2026
7dbe52f
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 28, 2026
e214cf4
cosyvoice3 e2e empty token at end fix
divyanshsinghvi Jan 28, 2026
bcf4e43
vllm integration for qwen2model
divyanshsinghvi Jan 28, 2026
54db8d7
remove batch input form yaml; remove extra logging
divyanshsinghvi Jan 28, 2026
e215b97
support cuda graph compilation
divyanshsinghvi Jan 28, 2026
fad1b71
remove unwanted batch items
divyanshsinghvi Jan 28, 2026
2a33319
remove unwanted batch items
divyanshsinghvi Jan 28, 2026
4c60905
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 28, 2026
5cdc9a2
cosyvoice3_audio dit added
divyanshsinghvi Jan 28, 2026
864689b
added to diffusion
divyanshsinghvi Jan 28, 2026
645393f
final fixes and refactoring
divyanshsinghvi Jan 28, 2026
ec9f6b4
Refactor
divyanshsinghvi Jan 28, 2026
2c37db6
Refactors
divyanshsinghvi Jan 28, 2026
08036e1
Renaming files
divyanshsinghvi Jan 28, 2026
9d68d8d
cosyvoice3 tests added
divyanshsinghvi Jan 28, 2026
6ec7bef
added tests for cosyvoice
divyanshsinghvi Jan 28, 2026
bf27241
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Jan 29, 2026
0bd3902
Aucto config registration; remove hardcoded paths
divyanshsinghvi Feb 2, 2026
2bb958c
Updated readme
divyanshsinghvi Feb 2, 2026
be3e075
speed improvement
divyanshsinghvi Feb 2, 2026
588b4dc
cosyvoice3.md
divyanshsinghvi Feb 2, 2026
6e64675
pyproject toml updated
divyanshsinghvi Feb 2, 2026
3ef758d
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Feb 2, 2026
fe10e30
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Feb 4, 2026
e70f714
few comments addressed
divyanshsinghvi Feb 21, 2026
1de710c
download should be done by user; removed network dependency
divyanshsinghvi Feb 21, 2026
9dc4fc2
mel basis lru cache
divyanshsinghvi Feb 21, 2026
0185592
remove preallocation to on the fly allocation
divyanshsinghvi Feb 21, 2026
cca7ee4
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Feb 24, 2026
1d9241a
requirements updated
divyanshsinghvi Feb 24, 2026
0da9551
update to readme
divyanshsinghvi Feb 24, 2026
7a31f96
api changes
divyanshsinghvi Feb 24, 2026
19301a0
get data parser changed
divyanshsinghvi Feb 24, 2026
7fb162b
ok
divyanshsinghvi Feb 24, 2026
803dd23
remove precheck fil
divyanshsinghvi Feb 24, 2026
dc662db
utils
divyanshsinghvi Feb 24, 2026
014bf60
cached runtime components
divyanshsinghvi Feb 24, 2026
7b4a1db
multimodal output not present?
divyanshsinghvi Feb 24, 2026
165177c
cosyvoice3.py
divyanshsinghvi Feb 24, 2026
3150aec
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
hsliuustc0106 Feb 26, 2026
3c66b36
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi Mar 1, 2026
757313a
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
hsliuustc0106 Mar 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ th {
|`LongcatImagePipeline` | LongCat-Image | `meituan-longcat/LongCat-Image` |
|`LongCatImageEditPipeline` | LongCat-Image-Edit | `meituan-longcat/LongCat-Image-Edit` |
|`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` |
|`CosyVoice3Model` | CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` |
|`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` |
|`FluxPipeline` | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` |
|`OmniGen2Pipeline` | OmniGen2 | `OmniGen2/OmniGen2` |
Expand Down
59 changes: 59 additions & 0 deletions examples/offline_inference/text_to_speech/cosyvoice3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## Setup

Install dependencies:
```
uv pip install -e .
```

> **Note:** This includes required libraries such as `librosa`, `soundfile`,
> `onnxruntime`, `x-transformers`, and `einops` via
> `requirements/common.txt` and platform-specific requirements files.

Download the model snapshot:
```
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
```

Add `config.json` in `pretrained_models/Fun-CosyVoice3-0.5B/`:
```json
{
"model_type": "cosyvoice3",
"architectures": [
"CosyVoice3Model"
]
}
```

> **Why `config.json` is required:**
> `AutoConfig.register("cosyvoice3", CosyVoice3Config)` only registers a class mapping.
> The loader still needs `model_type: "cosyvoice3"` from `config.json` to select that class.
> If no `config.json` is present, model type cannot be inferred automatically.
> If your downloaded checkpoint already includes a valid `config.json` with
> `model_type: "cosyvoice3"`, this manual step can be skipped.

Run the offline verification script:
```
python examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py \
--model pretrained_models/Fun-CosyVoice3-0.5B \
--tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN
```

## Implementation Overview

CosyVoice3 runs as a 2-stage Omni pipeline:
- Stage 0 (text_speech_lm) converts text + prompt audio to speech tokens.
- Stage 1 (chunk_aware_flow_matching) converts speech tokens + prompt features to audio.

Key components in `vllm_omni/model_executor/models/cosyvoice3/cosyvoice3.py`:
- `CosyVoice3MultiModalProcessor` builds the multimodal inputs:
- Tokenizes `prompt` and `prompt_text`.
- Extracts speech tokens and mel features from the prompt audio.
- Extracts a speaker embedding.
- `CosyVoice3Model` implements both stages:
- Stage 0 uses `CosyVoice3LM` and outputs speech tokens + conditioning features.
- Stage 1 runs the flow model (DiT-based CFM) and HiFiGAN to synthesize waveform.

Stage wiring is configured in `vllm_omni/model_executor/stage_configs/cosyvoice3.yaml`:
- Stage 0 emits latent speech tokens .
- Stage 1 consumes them via `custom_process_input_func` and outputs audio.
201 changes: 201 additions & 0 deletions examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
import argparse
import os
from pathlib import Path

import librosa
import numpy as np
import soundfile as sf
from vllm import SamplingParams
from vllm.assets.audio import AudioAsset

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.model_executor.models.cosyvoice3.config import CosyVoice3Config
from vllm_omni.model_executor.models.cosyvoice3.tokenizer import get_qwen_tokenizer
from vllm_omni.model_executor.models.cosyvoice3.utils import extract_text_token


def _ensure_mel_filters_asset() -> None:
repo_root = Path(__file__).resolve().parents[3]
filters_path = repo_root / "vllm_omni" / "model_executor" / "models" / "cosyvoice3" / "assets" / "mel_filters.npz"
if filters_path.exists():
return

source_url = "https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz"
raise FileNotFoundError(
"Missing CosyVoice3 mel filter asset:\n"
f" {filters_path}\n"
"Download it with:\n"
f" mkdir -p {filters_path.parent} && "
f"curl -L {source_url} -o {filters_path}"
)


def run_e2e():
parser = argparse.ArgumentParser()
# ""FunAudioLLM/Fun-CosyVoice3-0.5B-2512
parser.add_argument(
"--model",
type=str,
required=True,
help="Path to CosyVoice3 model directory (e.g., pretrained_models/Fun-CosyVoice3-0.5B/).",
)
parser.add_argument("--stage-config", type=str, default="vllm_omni/model_executor/stage_configs/cosyvoice3.yaml")
parser.add_argument("--prompt", type=str, default="Hello, this is a test of the CosyVoice system capability.")
parser.add_argument(
"--prompt-text",
type=str,
default="You are a helpful assistant.<|endofprompt|>Testing my voices. Why should I not?",
)
parser.add_argument("--audio-path", type=str, default="prompt.wav")
parser.add_argument(
"--tokenizer",
type=str,
required=True,
help="Path to tokenizer directory (e.g., <model_path>/CosyVoice-BlankEN).",
)
args = parser.parse_args()
_ensure_mel_filters_asset()
# Ensure tokenizer directory exists
if not os.path.exists(args.tokenizer):
raise FileNotFoundError(f"{args.tokenizer} does not exist!")

# Ensure stage config exists
if not os.path.exists(args.stage_config):
raise FileNotFoundError(f"{args.stage_config} does not exist!")

print(f"Initializing cosyvoice E2E with model={args.model}")

# Initialize Omni
# This spins up the engine(s) based on the stage config
# We pass trust_remote_code=True same as Qwen examples
omni = Omni(
model=args.model,
stage_configs_path=args.stage_config,
trust_remote_code=True,
tokenizer=args.tokenizer,
log_stats=True,
)

# Map CosyVoice sampling config into vLLM SamplingParams for stage 0.
try:
# TODO: This is not working correctly right now.
hf_config = omni.instance.stage_list[0].vllm_config.model_config.hf_config
sampling_cfg = hf_config.llm["sampling"]
except Exception:
sampling_cfg = {"top_p": 0.8, "top_k": 25, "eos_token_id": 6561 + 1}

print("Model initialized. Preparing inputs...")
if args.audio_path:
if not os.path.exists(args.audio_path):
raise FileNotFoundError(f"Audio file not found: {args.audio_path}")
# Load at native sample rate
audio_signal, sr = librosa.load(args.audio_path, sr=None)

# Validate sample rate before processing (similar to original CosyVoice)
min_sr = 16000
if sr < min_sr:
raise ValueError(
f"Audio sample rate {sr} Hz is too low. "
f"Minimum required: {min_sr} Hz. "
f"Please provide audio with sample rate >= {min_sr} Hz."
)

audio_data = (audio_signal.astype(np.float32), sr)
else:
audio_data = AudioAsset("mary_had_lamb").audio_and_sample_rate

prompts = {
"prompt": args.prompt,
"multi_modal_data": {
"audio": audio_data,
},
"mm_processor_kwargs": {
"prompt_text": args.prompt_text,
"sample_rate": audio_data[1],
},
}

print(f"Generating for prompt: {args.prompt}")

config = CosyVoice3Config()
tokenizer = get_qwen_tokenizer(
token_path=args.tokenizer,
skip_special_tokens=config.skip_special_tokens,
version=config.version,
)
_, text_token_len = extract_text_token(args.prompt, tokenizer, config.allowed_special)
base_len = int(text_token_len)
min_len = int(base_len * config.min_token_text_ratio)
max_len = int(base_len * config.max_token_text_ratio)

# Build SamplingParams for each stage (GPT, S2Mel, Vocoder)
gpt_sampling = SamplingParams(
temperature=1.0,
top_p=sampling_cfg["top_p"],
top_k=sampling_cfg["top_k"],
repetition_penalty=2.0,
min_tokens=min_len,
max_tokens=max_len,
stop_token_ids=[sampling_cfg["eos_token_id"]],
# allowed_token_ids=list(range(6561+3)),
detokenize=False,
)
# Not used
s2mel_sampling = SamplingParams(
temperature=1.0,
top_p=1.0,
top_k=-1,
repetition_penalty=2.0,
max_tokens=256,
detokenize=False,
)

sampling_params_list = [gpt_sampling, s2mel_sampling]

# Start profiling (requires VLLM_TORCH_PROFILER_DIR env var)
if os.environ.get("VLLM_TORCH_PROFILER_DIR"):
print("Starting profiler...")
omni.start_profile()

# Generate (Omni orchestrator requires a per-stage SamplingParams list)
outputs = list(omni.generate(prompts, sampling_params_list=sampling_params_list[:2]))

# Stop profiling and get results
if os.environ.get("VLLM_TORCH_PROFILER_DIR"):
print("Stopping profiler...")
profile_results = omni.stop_profile()
print(f"Profile traces saved to: {profile_results}")

print(outputs)
# Verify outputs
print(f"Received {len(outputs)} outputs.")
for i, output in enumerate(outputs):
try:
ro_list = output.request_output or []
if not ro_list:
print("No request_output found.")
continue

for ro in ro_list:
# Multimodal output may be attached to RequestOutput or CompletionOutput.
mm = getattr(ro, "multimodal_output", None)
if not mm and ro.outputs:
mm = getattr(ro.outputs[0], "multimodal_output", None)

if mm:
print(f"Multimodal output keys: {mm.keys()}")
if "audio" in mm:
audio_out = mm["audio"]
print(f"Generated Audio Shape: {audio_out.shape}")
out_path = f"output_{i}.wav"
sf.write(out_path, audio_out.cpu().numpy().squeeze(), 22050)
print(f"Saved audio to {out_path}")
else:
print("No multimodal output found.")
except Exception as e:
print(f"Error inspecting output: {e}")
omni.close()


if __name__ == "__main__":
run_e2e()
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -202,3 +202,5 @@ extend-ignore-identifiers-re = [
".*[Oo]no_[Aa]nna.*",
".*cann.*",
]
[tool.typos.default.extend-words]
ue = "ue"
2 changes: 2 additions & 0 deletions requirements/common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ torchsde>=0.2.6
openai-whisper>=20250625
imageio[ffmpeg]>=2.37.2
sox>=1.5.0
x-transformers>=2.12.2
einops>=0.8.1
prettytable>=3.8.0
aenum==3.1.16
pyzmq>=25.0.0
Loading