GitHub - stepfunction83/VibeVoice_Mirror: Frontier Open-Source Text-to-Speech

🎙️ VibeVoice: A Frontier Long Conversational Text-to-Speech Model

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

🎵 Demo Example

Listen to a sample of VibeVoice generating multi-speaker conversational audio:

Your browser does not support the audio element.

Try it out via Demo.

Models

Model	Context Length	Generation Length	Weight
VibeVoice-0.5B-Streaming	-	-	On the way
VibeVoice-1.5B	64K	~90 min	HF link
VibeVoice-7B	32K	~45 min	On the way

Installation

We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment.

Launch docker

# NVIDIA PyTorch Container 24.07 / 24.10 / 24.12 verified. 
# Later versions are also compatible.
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it  nvcr.io/nvidia/pytorch:24.07-py3

## If flash attention is not included in your docker environment, you need to install it manually
## Refer to https://github.com/Dao-AILab/flash-attention for installation instructions
# pip install flash-attn --no-build-isolation

Install from github

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/

pip install -e .

Usages

Usage 1: Launch Gradio demo

apt update && apt install ffmpeg -y # for demo
python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share

Usage 2: Inference from files directly

# We provide some LLM generated example scripts under demo/text_examples/ for demo
# 1 speaker
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice

# or more speakers
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan

Risks and limitations

Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

English and Chinese only: Transcripts in language other than English or Chinese may result in unexpected audio outputs.

Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.

Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Figures		Figures
demo		demo
report		report
vibevoice		vibevoice
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ VibeVoice: A Frontier Long Conversational Text-to-Speech Model

🎵 Demo Example

Models

Installation

Usages

Usage 1: Launch Gradio demo

Usage 2: Inference from files directly

Risks and limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ VibeVoice: A Frontier Long Conversational Text-to-Speech Model

🎵 Demo Example

Models

Installation

Usages

Usage 1: Launch Gradio demo

Usage 2: Inference from files directly

Risks and limitations

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages