You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[2025.09.16] 🔥 🔥 🔥 We Open Source the VoxCPM-0.5B weights!
12
+
*[2025.09.16] 🔥 🔥 🔥 We Open Source the VoxCPM-0.5B [weights](https://huggingface.co/openbmb/VoxCPM-0.5B)!
13
13
*[2025.09.16] 🎉 🎉 🎉 We Provide the [Gradio PlayGround](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) for VoxCPM-0.5B, try it now!
14
14
15
15
## Overview
16
16
17
17
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
18
18
19
-
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B), it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
19
+
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
20
20
21
21
<divalign="center">
22
-
<imgsrc="assets/voxcpm_model.png"alt="VoxCPM Model Architecture"width="500">
22
+
<imgsrc="assets/voxcpm_model.png"alt="VoxCPM Model Architecture"width="90%">
23
23
</div>
24
24
25
25
@@ -30,6 +30,13 @@ Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses
30
30
31
31
32
32
33
+
34
+
35
+
36
+
37
+
38
+
39
+
33
40
## Quick Start
34
41
35
42
### 🔧 Install from PyPI
@@ -61,13 +68,13 @@ wav = model.generate(
61
68
text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
62
69
prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
72
+
inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
73
+
normalize=True, # enable external TN tool
74
+
denoise=True, # enable external Denoise tool
75
+
retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
76
+
retry_badcase_max_times=3, # maximum retrying times
77
+
retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
71
78
)
72
79
73
80
sf.write("output.wav", wav, 16000)
@@ -175,41 +182,41 @@ VoxCPM achieves competitive results on public zero-shot TTS benchmarks:
0 commit comments