-
Notifications
You must be signed in to change notification settings - Fork 473
[Model] Fun cosy voice3-0.5-b-2512 #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
divyanshsinghvi
wants to merge
119
commits into
vllm-project:main
Choose a base branch
from
divyanshsinghvi:Fun-CosyVoice3-0.5B-2512
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4,261
−0
Open
Changes from all commits
Commits
Show all changes
119 commits
Select commit
Hold shift + click to select a range
00cf5fe
cosyvoice 3 initiated
divyanshsinghvi 0ddcbbf
cosyvoice base file started
divyanshsinghvi 7c8054d
Initialized model paths
divyanshsinghvi 83526ad
Initialized model paths
divyanshsinghvi 2c41af0
renaming
divyanshsinghvi 34bea22
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 769aaeb
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi c1130e6
stages are ready;
divyanshsinghvi bf29392
Major fixes to stage e2e working; not correct audio yet but good prog…
divyanshsinghvi 4f3cda3
cosyvoice.yaml stage_configs
divyanshsinghvi 2826cec
stage input processor link function
divyanshsinghvi 5888ae7
yaml updates and input handling corrected
divyanshsinghvi ca2ee7f
fixes to inputs prompt token embed and token embed fixed; speech toke…
divyanshsinghvi c103c49
Getting finally multimodal embeds correct
divyanshsinghvi 3ce5ddc
llm part fixed now
divyanshsinghvi 3a2df52
almost stage 1 half completed
divyanshsinghvi 46429c0
e2e working
divyanshsinghvi f481b38
e2e cosyvoice script some weird noise at end rest seems good
divyanshsinghvi f2b62af
cosyvoice llm done
divyanshsinghvi 4ae2bea
cosyvoice config refactored
divyanshsinghvi c74249f
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 7f54763
registry cosyvoice everything upgraded
divyanshsinghvi 45002af
Merge branch 'Fun-CosyVoice3-0.5B-2512' of github.com:divyanshsinghvi…
divyanshsinghvi cabb289
fix error
divyanshsinghvi ae8fca0
small fixes
divyanshsinghvi db23cfc
fun cosy voice 3 0.5b
divyanshsinghvi 89ea6a6
Utils functions refactored
divyanshsinghvi 7c78d69
Utils functions refactored
divyanshsinghvi 11a6d89
remvoe clutter
divyanshsinghvi ecb7949
remove unused code
divyanshsinghvi 2694153
Fixes for final .wav
divyanshsinghvi 245553a
close fix
divyanshsinghvi 29b6065
Making the project self sustainable
divyanshsinghvi b787363
rename to cosyvoice to cosyvoice3
divyanshsinghvi 5c14f46
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi cea72ca
Fixed eos token bug
divyanshsinghvi cfae813
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
hsliuustc0106 4ea8bb7
Need to extend word due to typos check in pre-commit.
divyanshsinghvi fad1647
Merge branch 'Fun-CosyVoice3-0.5B-2512' of github.com:divyanshsinghvi…
divyanshsinghvi b3d4903
Delete typos.toml
divyanshsinghvi b655d39
Fix precommit
divyanshsinghvi 4ca4b65
Remove whisper dependency
divyanshsinghvi ca2a7b2
Added instructures in examples....md file
divyanshsinghvi ddf4387
Fix import order and clean up commented code
divyanshsinghvi f5b2ab6
Signoff fix
divyanshsinghvi 237cc7c
To remove docs error
divyanshsinghvi 954fe3b
args tokenizer added
divyanshsinghvi 914c737
Organize import statements in verify_e2e_cosyvoice.py
divyanshsinghvi a0b4e49
removed extra prints
divyanshsinghvi 64e547f
remove or moved to debug logs logger in cosyvoice.py
divyanshsinghvi 1ad27c5
fixes to flow.py remove dups
divyanshsinghvi 1d59000
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 88d9c93
remove log comments
divyanshsinghvi 2b94e3e
Pyproject extend words
divyanshsinghvi 1d7d3cd
init .py added
divyanshsinghvi 55439c3
docs improved
divyanshsinghvi 9a98e76
torch mask
divyanshsinghvi cbb4c21
update supported model
divyanshsinghvi acca1b7
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 7f0eadc
t
divyanshsinghvi fa0c87f
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 52bc5b4
t removed
divyanshsinghvi 5c48f00
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi fb010d7
fix copyright sentences and remove random seed function
divyanshsinghvi 80141c0
clean imports
divyanshsinghvi 0fd6732
commit multimodal
divyanshsinghvi c6d93e4
fix repo names for adoption
divyanshsinghvi f467a01
remove unwanted comments
divyanshsinghvi 05c8cf7
llm -> cosyvoice3_talker
divyanshsinghvi 1d523a6
cosyvoice3.yaml
divyanshsinghvi 6f99e36
fixes
divyanshsinghvi 919777b
Reorder import statements in verify_e2e_cosyvoice.py
divyanshsinghvi 25ed1ad
working
divyanshsinghvi 2821dcd
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 955c4c4
remove streaming erlated excess code; currently not supported; also a…
divyanshsinghvi f1ec0fb
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 3c75364
Fixed with new versions
divyanshsinghvi 7dbe52f
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi e214cf4
cosyvoice3 e2e empty token at end fix
divyanshsinghvi bcf4e43
vllm integration for qwen2model
divyanshsinghvi 54db8d7
remove batch input form yaml; remove extra logging
divyanshsinghvi e215b97
support cuda graph compilation
divyanshsinghvi fad1b71
remove unwanted batch items
divyanshsinghvi 2a33319
remove unwanted batch items
divyanshsinghvi 4c60905
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 5cdc9a2
cosyvoice3_audio dit added
divyanshsinghvi 864689b
added to diffusion
divyanshsinghvi 645393f
final fixes and refactoring
divyanshsinghvi ec9f6b4
Refactor
divyanshsinghvi 2c37db6
Refactors
divyanshsinghvi 08036e1
Renaming files
divyanshsinghvi 9d68d8d
cosyvoice3 tests added
divyanshsinghvi 6ec7bef
added tests for cosyvoice
divyanshsinghvi bf27241
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 0bd3902
Aucto config registration; remove hardcoded paths
divyanshsinghvi 2bb958c
Updated readme
divyanshsinghvi be3e075
speed improvement
divyanshsinghvi 588b4dc
cosyvoice3.md
divyanshsinghvi 6e64675
pyproject toml updated
divyanshsinghvi 3ef758d
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi fe10e30
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi e70f714
few comments addressed
divyanshsinghvi 1de710c
download should be done by user; removed network dependency
divyanshsinghvi 9dc4fc2
mel basis lru cache
divyanshsinghvi 0185592
remove preallocation to on the fly allocation
divyanshsinghvi cca7ee4
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 1d9241a
requirements updated
divyanshsinghvi 0da9551
update to readme
divyanshsinghvi 7a31f96
api changes
divyanshsinghvi 19301a0
get data parser changed
divyanshsinghvi 7fb162b
ok
divyanshsinghvi 803dd23
remove precheck fil
divyanshsinghvi dc662db
utils
divyanshsinghvi 014bf60
cached runtime components
divyanshsinghvi 7b4a1db
multimodal output not present?
divyanshsinghvi 165177c
cosyvoice3.py
divyanshsinghvi 3150aec
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
hsliuustc0106 3c66b36
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
divyanshsinghvi 757313a
Merge branch 'main' into Fun-CosyVoice3-0.5B-2512
hsliuustc0106 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| ## Setup | ||
|
|
||
| Install dependencies: | ||
| ``` | ||
| uv pip install -e . | ||
| ``` | ||
|
|
||
| > **Note:** This includes required libraries such as `librosa`, `soundfile`, | ||
| > `onnxruntime`, `x-transformers`, and `einops` via | ||
| > `requirements/common.txt` and platform-specific requirements files. | ||
|
|
||
| Download the model snapshot: | ||
| ``` | ||
| from huggingface_hub import snapshot_download | ||
| snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') | ||
| ``` | ||
|
|
||
| Add `config.json` in `pretrained_models/Fun-CosyVoice3-0.5B/`: | ||
| ```json | ||
| { | ||
| "model_type": "cosyvoice3", | ||
| "architectures": [ | ||
| "CosyVoice3Model" | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| > **Why `config.json` is required:** | ||
| > `AutoConfig.register("cosyvoice3", CosyVoice3Config)` only registers a class mapping. | ||
| > The loader still needs `model_type: "cosyvoice3"` from `config.json` to select that class. | ||
| > If no `config.json` is present, model type cannot be inferred automatically. | ||
| > If your downloaded checkpoint already includes a valid `config.json` with | ||
| > `model_type: "cosyvoice3"`, this manual step can be skipped. | ||
|
|
||
| Run the offline verification script: | ||
| ``` | ||
| python examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py \ | ||
| --model pretrained_models/Fun-CosyVoice3-0.5B \ | ||
| --tokenizer pretrained_models/Fun-CosyVoice3-0.5B/CosyVoice-BlankEN | ||
| ``` | ||
|
|
||
| ## Implementation Overview | ||
|
|
||
| CosyVoice3 runs as a 2-stage Omni pipeline: | ||
| - Stage 0 (text_speech_lm) converts text + prompt audio to speech tokens. | ||
| - Stage 1 (chunk_aware_flow_matching) converts speech tokens + prompt features to audio. | ||
|
|
||
| Key components in `vllm_omni/model_executor/models/cosyvoice3/cosyvoice3.py`: | ||
| - `CosyVoice3MultiModalProcessor` builds the multimodal inputs: | ||
| - Tokenizes `prompt` and `prompt_text`. | ||
| - Extracts speech tokens and mel features from the prompt audio. | ||
| - Extracts a speaker embedding. | ||
| - `CosyVoice3Model` implements both stages: | ||
| - Stage 0 uses `CosyVoice3LM` and outputs speech tokens + conditioning features. | ||
| - Stage 1 runs the flow model (DiT-based CFM) and HiFiGAN to synthesize waveform. | ||
|
|
||
| Stage wiring is configured in `vllm_omni/model_executor/stage_configs/cosyvoice3.yaml`: | ||
| - Stage 0 emits latent speech tokens . | ||
| - Stage 1 consumes them via `custom_process_input_func` and outputs audio. | ||
201 changes: 201 additions & 0 deletions
201
examples/offline_inference/text_to_speech/verify_e2e_cosyvoice.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| import argparse | ||
| import os | ||
| from pathlib import Path | ||
|
|
||
| import librosa | ||
| import numpy as np | ||
| import soundfile as sf | ||
| from vllm import SamplingParams | ||
| from vllm.assets.audio import AudioAsset | ||
|
|
||
| from vllm_omni.entrypoints.omni import Omni | ||
| from vllm_omni.model_executor.models.cosyvoice3.config import CosyVoice3Config | ||
| from vllm_omni.model_executor.models.cosyvoice3.tokenizer import get_qwen_tokenizer | ||
| from vllm_omni.model_executor.models.cosyvoice3.utils import extract_text_token | ||
|
|
||
|
|
||
| def _ensure_mel_filters_asset() -> None: | ||
| repo_root = Path(__file__).resolve().parents[3] | ||
| filters_path = repo_root / "vllm_omni" / "model_executor" / "models" / "cosyvoice3" / "assets" / "mel_filters.npz" | ||
| if filters_path.exists(): | ||
| return | ||
|
|
||
| source_url = "https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz" | ||
| raise FileNotFoundError( | ||
| "Missing CosyVoice3 mel filter asset:\n" | ||
| f" {filters_path}\n" | ||
| "Download it with:\n" | ||
| f" mkdir -p {filters_path.parent} && " | ||
| f"curl -L {source_url} -o {filters_path}" | ||
| ) | ||
|
|
||
|
|
||
| def run_e2e(): | ||
| parser = argparse.ArgumentParser() | ||
| # ""FunAudioLLM/Fun-CosyVoice3-0.5B-2512 | ||
| parser.add_argument( | ||
| "--model", | ||
| type=str, | ||
| required=True, | ||
| help="Path to CosyVoice3 model directory (e.g., pretrained_models/Fun-CosyVoice3-0.5B/).", | ||
| ) | ||
| parser.add_argument("--stage-config", type=str, default="vllm_omni/model_executor/stage_configs/cosyvoice3.yaml") | ||
| parser.add_argument("--prompt", type=str, default="Hello, this is a test of the CosyVoice system capability.") | ||
| parser.add_argument( | ||
| "--prompt-text", | ||
| type=str, | ||
| default="You are a helpful assistant.<|endofprompt|>Testing my voices. Why should I not?", | ||
| ) | ||
| parser.add_argument("--audio-path", type=str, default="prompt.wav") | ||
| parser.add_argument( | ||
| "--tokenizer", | ||
| type=str, | ||
| required=True, | ||
| help="Path to tokenizer directory (e.g., <model_path>/CosyVoice-BlankEN).", | ||
| ) | ||
| args = parser.parse_args() | ||
| _ensure_mel_filters_asset() | ||
| # Ensure tokenizer directory exists | ||
| if not os.path.exists(args.tokenizer): | ||
| raise FileNotFoundError(f"{args.tokenizer} does not exist!") | ||
|
|
||
| # Ensure stage config exists | ||
| if not os.path.exists(args.stage_config): | ||
| raise FileNotFoundError(f"{args.stage_config} does not exist!") | ||
|
|
||
| print(f"Initializing cosyvoice E2E with model={args.model}") | ||
|
|
||
| # Initialize Omni | ||
| # This spins up the engine(s) based on the stage config | ||
| # We pass trust_remote_code=True same as Qwen examples | ||
| omni = Omni( | ||
| model=args.model, | ||
| stage_configs_path=args.stage_config, | ||
| trust_remote_code=True, | ||
| tokenizer=args.tokenizer, | ||
| log_stats=True, | ||
| ) | ||
|
|
||
| # Map CosyVoice sampling config into vLLM SamplingParams for stage 0. | ||
| try: | ||
| # TODO: This is not working correctly right now. | ||
| hf_config = omni.instance.stage_list[0].vllm_config.model_config.hf_config | ||
| sampling_cfg = hf_config.llm["sampling"] | ||
| except Exception: | ||
| sampling_cfg = {"top_p": 0.8, "top_k": 25, "eos_token_id": 6561 + 1} | ||
|
|
||
| print("Model initialized. Preparing inputs...") | ||
| if args.audio_path: | ||
| if not os.path.exists(args.audio_path): | ||
| raise FileNotFoundError(f"Audio file not found: {args.audio_path}") | ||
| # Load at native sample rate | ||
| audio_signal, sr = librosa.load(args.audio_path, sr=None) | ||
|
|
||
| # Validate sample rate before processing (similar to original CosyVoice) | ||
| min_sr = 16000 | ||
| if sr < min_sr: | ||
| raise ValueError( | ||
| f"Audio sample rate {sr} Hz is too low. " | ||
| f"Minimum required: {min_sr} Hz. " | ||
| f"Please provide audio with sample rate >= {min_sr} Hz." | ||
| ) | ||
|
|
||
| audio_data = (audio_signal.astype(np.float32), sr) | ||
| else: | ||
| audio_data = AudioAsset("mary_had_lamb").audio_and_sample_rate | ||
|
|
||
| prompts = { | ||
| "prompt": args.prompt, | ||
| "multi_modal_data": { | ||
| "audio": audio_data, | ||
| }, | ||
| "mm_processor_kwargs": { | ||
| "prompt_text": args.prompt_text, | ||
| "sample_rate": audio_data[1], | ||
| }, | ||
| } | ||
|
|
||
| print(f"Generating for prompt: {args.prompt}") | ||
|
|
||
| config = CosyVoice3Config() | ||
| tokenizer = get_qwen_tokenizer( | ||
| token_path=args.tokenizer, | ||
| skip_special_tokens=config.skip_special_tokens, | ||
| version=config.version, | ||
| ) | ||
| _, text_token_len = extract_text_token(args.prompt, tokenizer, config.allowed_special) | ||
| base_len = int(text_token_len) | ||
| min_len = int(base_len * config.min_token_text_ratio) | ||
| max_len = int(base_len * config.max_token_text_ratio) | ||
|
|
||
| # Build SamplingParams for each stage (GPT, S2Mel, Vocoder) | ||
| gpt_sampling = SamplingParams( | ||
| temperature=1.0, | ||
| top_p=sampling_cfg["top_p"], | ||
| top_k=sampling_cfg["top_k"], | ||
| repetition_penalty=2.0, | ||
| min_tokens=min_len, | ||
| max_tokens=max_len, | ||
| stop_token_ids=[sampling_cfg["eos_token_id"]], | ||
| # allowed_token_ids=list(range(6561+3)), | ||
| detokenize=False, | ||
| ) | ||
| # Not used | ||
| s2mel_sampling = SamplingParams( | ||
| temperature=1.0, | ||
| top_p=1.0, | ||
| top_k=-1, | ||
| repetition_penalty=2.0, | ||
| max_tokens=256, | ||
| detokenize=False, | ||
| ) | ||
|
|
||
| sampling_params_list = [gpt_sampling, s2mel_sampling] | ||
|
|
||
| # Start profiling (requires VLLM_TORCH_PROFILER_DIR env var) | ||
| if os.environ.get("VLLM_TORCH_PROFILER_DIR"): | ||
| print("Starting profiler...") | ||
| omni.start_profile() | ||
|
|
||
| # Generate (Omni orchestrator requires a per-stage SamplingParams list) | ||
| outputs = list(omni.generate(prompts, sampling_params_list=sampling_params_list[:2])) | ||
|
|
||
| # Stop profiling and get results | ||
| if os.environ.get("VLLM_TORCH_PROFILER_DIR"): | ||
| print("Stopping profiler...") | ||
| profile_results = omni.stop_profile() | ||
| print(f"Profile traces saved to: {profile_results}") | ||
|
|
||
| print(outputs) | ||
| # Verify outputs | ||
| print(f"Received {len(outputs)} outputs.") | ||
| for i, output in enumerate(outputs): | ||
| try: | ||
| ro_list = output.request_output or [] | ||
| if not ro_list: | ||
| print("No request_output found.") | ||
| continue | ||
|
|
||
| for ro in ro_list: | ||
| # Multimodal output may be attached to RequestOutput or CompletionOutput. | ||
| mm = getattr(ro, "multimodal_output", None) | ||
| if not mm and ro.outputs: | ||
| mm = getattr(ro.outputs[0], "multimodal_output", None) | ||
|
|
||
| if mm: | ||
| print(f"Multimodal output keys: {mm.keys()}") | ||
| if "audio" in mm: | ||
| audio_out = mm["audio"] | ||
| print(f"Generated Audio Shape: {audio_out.shape}") | ||
| out_path = f"output_{i}.wav" | ||
| sf.write(out_path, audio_out.cpu().numpy().squeeze(), 22050) | ||
| print(f"Saved audio to {out_path}") | ||
| else: | ||
| print("No multimodal output found.") | ||
| except Exception as e: | ||
| print(f"Error inspecting output: {e}") | ||
| omni.close() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| run_e2e() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.