Skip to content

TronMetatron/VibeVoice-RT-Bridge

Repository files navigation

VibeVoice-RT-Bridge

Windows SAPI5 Bridge for Microsoft's VibeVoice Realtime TTS

Based on VibeVoice Windows SAPI5

Demo

Demo Video

VibeVoice TTS in Diablo 4 - Narrating game events via SAPI5

What is this?

This project adds Windows SAPI5 integration to Microsoft's VibeVoice-Realtime-0.5B model, allowing any SAPI-compatible Windows application to use VibeVoice voices for text-to-speech.

Features:

  • SAPI5 COM DLL bridge (C++) that connects to the VibeVoice model
  • Named pipe server for efficient IPC between SAPI and Python
  • System tray app for auto-starting on Windows login
  • Installer/Manager UI for easy setup
  • 7 high-quality voices: Carter, Davis, Emma, Frank, Grace, Mike, Samuel

Architecture

Windows App (SAPI) --> VibeVoiceSAPI.dll --> Named Pipe --> Python Server --> VibeVoice Model (GPU)

Quick Start

  1. Install dependencies:

    pip install -e .
  2. Start the SAPI server:

    python demo/sapi_pipe_server.py --device cuda:0
  3. Run the installer (as Administrator) to register voices:

    python sapi/install/vibevoice_installer.py
  4. Enable auto-start (optional):

    python sapi/install/vibevoice_tray.py --add-startup

Credits


Original VibeVoice README (click to expand)

VibeVoice: Open-Source Frontier Voice AI

Project Page Hugging Face Technical Report

VibeVoice Logo

Overview

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

VibeVoice currently includes two model variants:

  • Long-form multi-speaker model: Synthesizes conversational/single-speaker speech up to 90 minutes with up to 4 distinct speakers, surpassing the typical 1–2 speaker limits of many prior models.
  • Realtime streaming TTS model: Produces initial audible speech in ~300 ms and supports streaming text input for single-speaker real-time speech generation; designed for low-latency generation.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

MOS Preference Results VibeVoice Overview

🎵 Demo Examples

Video Demo

We produced this video with Wan2.2. We sincerely appreciate the Wan-Video team for their great work.

English

ES_._3.mp4

Chinese

default.mp4

Cross-Lingual

1p_EN2CH.mp4

Spontaneous Singing

2p_see_u_again.mp4

Long Conversation with 4 people

4p_climate_45min.mp4

For more examples, see the Project Page.

Risks and limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.

Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.

Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors