Skip to content

Commit 081845b

Browse files
周逸轩周逸轩
authored andcommitted
FX: readme description
1 parent 6684b54 commit 081845b

File tree

2 files changed

+41
-34
lines changed

2 files changed

+41
-34
lines changed

README.md

Lines changed: 40 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,25 @@
11
## 🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
22

33

4-
[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](hhttps://huggingface.co/openbmb/VoxCPM-0.5B) [![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Page-Samples-red)](https://thuhcsi.github.io/VoxCPM/)
4+
[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https://huggingface.co/openbmb/VoxCPM-0.5B) [![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Page-Samples-red)](https://thuhcsi.github.io/VoxCPM/)
55

66

77
<div align="center">
88
<img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
99
</div>
1010

1111
## News
12-
* [2025.09.16] 🔥 🔥 🔥 We Open Source the VoxCPM-0.5B weights!
12+
* [2025.09.16] 🔥 🔥 🔥 We Open Source the VoxCPM-0.5B [weights](https://huggingface.co/openbmb/VoxCPM-0.5B)!
1313
* [2025.09.16] 🎉 🎉 🎉 We Provide the [Gradio PlayGround](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) for VoxCPM-0.5B, try it now!
1414

1515
## Overview
1616

1717
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
1818

19-
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B), it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
19+
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
2020

2121
<div align="center">
22-
<img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="500">
22+
<img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="90%">
2323
</div>
2424

2525

@@ -30,6 +30,13 @@ Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses
3030

3131

3232

33+
34+
35+
36+
37+
38+
39+
3340
## Quick Start
3441

3542
### 🔧 Install from PyPI
@@ -61,13 +68,13 @@ wav = model.generate(
6168
text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
6269
prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
6370
prompt_text=None, # optional: reference text
64-
cfg_value=2.0,
65-
inference_timesteps=10,
66-
normalize=True,
67-
denoise=True,
68-
retry_badcase=True, # optional: enable retrying mode
69-
retry_badcase_max_times=3,
70-
retry_badcase_ratio_threshold=6.0,
71+
cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
72+
inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
73+
normalize=True, # enable external TN tool
74+
denoise=True, # enable external Denoise tool
75+
retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
76+
retry_badcase_max_times=3, # maximum retrying times
77+
retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
7178
)
7279
7380
sf.write("output.wav", wav, 16000)
@@ -175,41 +182,41 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks:
175182
| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | |
176183
|------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|
177184
| | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ |
185+
| MegaTTS3 | 0.5B || 2.79 | 77.1 | 1.52 | 79.0 | - | - |
186+
| DiTAR | 0.6B || 1.69 | 73.5 | 1.02 | 75.3 | - | - |
187+
| CosyVoice3 | 0.5B || 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
188+
| CosyVoice3 | 1.5B || 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
189+
| Seed-TTS | - || 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
190+
| MiniMax-Speech | - || 1.65 | 69.2 | 0.83 | 78.3 | - | - |
178191
| CosyVoice | 0.3B || 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
179-
| CosyVoice2 | 0.5B || 3.09 | 65.9 | 1.38 | 75.7 | 6.83 | 72.4 |
192+
| CosyVoice2 | 0.5B || 3.09 | 65.9 | 1.38 | 75.7 | **6.83** | 72.4 |
180193
| F5-TTS | 0.3B || 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
181194
| SparkTTS | 0.5B || 3.14 | 57.3 | 1.54 | 66.0 | - | - |
182195
| FireRedTTS | 0.5B || 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
183196
| FireRedTTS-2 | 1.5B || 1.95 | 66.5 | 1.14 | 73.6 | - | - |
184-
| Qwen2.5-Omni | 7B || 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | 74.7 |
197+
| Qwen2.5-Omni | 7B || 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | **74.7** |
185198
| OpenAudio-s1-mini | 0.5B || 1.94 | 55.0 | 1.18 | 68.5 | - | - |
186199
| IndexTTS2 | 1.5B || 2.23 | 70.6 | 1.03 | 76.5 | - | - |
187200
| VibeVoice | 1.5B || 3.04 | 68.9 | 1.16 | 74.4 | - | - |
188201
| HiggsAudio-v2 | 3B || 2.44 | 67.7 | 1.50 | 74.0 | - | - |
189-
| CosyVoice3 | 0.5B || 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
190-
| CosyVoice3 | 1.5B || 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
191-
| MegaTTS3 | 0.5B || 2.79 | 77.1 | 1.52 | 79.0 | - | - |
192-
| DiTAR | 0.6B || 1.69 | 73.5 | 1.02 | 75.3 | - | - |
193-
| Seed-TTS | - || 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
194-
| MiniMax-Speech | - || 1.65 | 69.2 | 0.83 | 78.3 | - | - |
195-
| **VoxCPM** | **0.5B** | **** | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |
202+
| **VoxCPM** | 0.5B || **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |
196203

197204

198205
### CV3-eval Benchmark
199206

200-
| Model | zh | en | hard-zh | | | hard-en | | | |
201-
|-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|:--:|
202-
| | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ | |
203-
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - | |
204-
| SparkTTS | 5.15 | 11.0 | - | - | - | - | - | - | |
205-
| GPT-SoVits | 7.34 | 12.5 | - | - | - | - | - | - | |
206-
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 72.6 | 3.81 | 11.96 | 66.7 | 3.95 | |
207-
| OpenAudio-s1-mini | 4.00 | 5.54 | 18.1 | 58.2 | 3.77 | 12.4 | 55.7 | 3.89 | |
208-
| IndexTTS2 | 3.58 | 4.45 | 12.8 | 74.6 | 3.65 | fail | fail | fail | |
209-
| HiggsAudio-v2 | 9.54 | 7.89 | 41.0 | 60.2 | 3.39 | 10.3 | 61.8 | 3.68 | |
210-
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 | |
211-
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 | |
212-
| **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 | |
207+
| Model | zh | en | hard-zh | | | hard-en | | |
208+
|-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|
209+
| | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ |
210+
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
211+
| SparkTTS | 5.15 | 11.0 | - | - | - | - | - | - |
212+
| GPT-SoVits | 7.34 | 12.5 | - | - | - | - | - | - |
213+
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 72.6 | 3.81 | 11.96 | 66.7 | 3.95 |
214+
| OpenAudio-s1-mini | 4.00 | 5.54 | 18.1 | 58.2 | 3.77 | 12.4 | 55.7 | 3.89 |
215+
| IndexTTS2 | 3.58 | 4.45 | 12.8 | 74.6 | 3.65 | - | - | - |
216+
| HiggsAudio-v2 | 9.54 | 7.89 | 41.0 | 60.2 | 3.39 | 10.3 | 61.8 | 3.68 |
217+
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
218+
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
219+
| **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 |
213220

214221

215222

src/voxcpm/modules/locdit/unified_cfm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ def solve_euler(
8888
shape: (n_timesteps + 1,)
8989
mu (torch.Tensor): output of encoder
9090
shape: (batch_size, n_feats)
91-
cond: Not used but kept for future purposes
91+
cond: condition -- prefix prompt
9292
cfg_value (float, optional): cfg value for guidance. Defaults to 1.0.
9393
"""
9494
t, _, dt = t_span[0], t_span[-1], t_span[0] - t_span[1]

0 commit comments

Comments
 (0)