This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
VibeVoice-Realtime is a lightweight (0.5B parameter) real-time text-to-speech model supporting streaming text input and long-form speech generation. It produces initial audible speech in ~300ms and is designed for low-latency generation. This repository contains the streaming/realtime variant only (single speaker, English).
pip install -e .python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5B --port 3000 --device cudaOn Windows with batch file:
run_demo.batpython demo/realtime_model_inference_from_file.py \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--txt_path demo/text_examples/1p_vibevoice.txt \
--speaker_name Carter \
--device cudaModel (vibevoice/modular/):
modeling_vibevoice_streaming_inference.py- Main inference model (VibeVoiceStreamingForConditionalGenerationInference). Uses a staged forward approach:forward_lm()for base text LM,forward_tts_lm()for TTS LM with diffusion, andgenerate()for full streaming pipelinemodeling_vibevoice_streaming.py- Base model definition (VibeVoiceStreamingModel)configuration_vibevoice_streaming.py- Model configurationmodular_vibevoice_diffusion_head.py- Diffusion head for speech generationmodular_vibevoice_tokenizer.py- Acoustic tokenizer with streaming cache supportstreamer.py-AudioStreamerfor real-time audio chunk delivery
Processor (vibevoice/processor/):
vibevoice_streaming_processor.py-VibeVoiceStreamingProcessorhandles text tokenization and audio I/Ovibevoice_tokenizer_processor.py- Audio normalization and file handling
Demo (demo/):
vibevoice_realtime_demo.py- Launches FastAPI/uvicorn WebSocket serverweb/app.py- WebSocket endpoint (/stream) andStreamingTTSServiceclassrealtime_model_inference_from_file.py- Batch inference from text filesVoiceMapperclass handles speaker name to voice file path mapping
The model uses windowed text prefill with interleaved speech generation:
- Text is fed in windows of 5 tokens (
TTS_TEXT_WINDOW_SIZE) - After each text window, 6 speech latents are sampled (
TTS_SPEECH_WINDOW_SIZE) - Speech latents are decoded to audio chunks via acoustic tokenizer
- Audio chunks are streamed via
AudioStreamerif provided - Binary EOS classifier determines when to stop generation
- Sample rate: 24kHz
- Acoustic tokenizer frame rate: 7.5 Hz
- Default diffusion inference steps: 5
- Default CFG scale: 1.5
- CUDA: Uses bfloat16 + flash_attention_2 (recommended)
- MPS (Apple Silicon): Uses float32 + SDPA
- CPU: Uses float32 + SDPA
Voice presets are .pt files in demo/voices/streaming_model/ containing pre-computed KV cache for voice prompts. Available voices: Carter, Davis, Emma, Frank, Grace, Mike (English), Samuel (Indian English).
Endpoint: ws://localhost:3000/stream
Query parameters:
text- Text to synthesizevoice- Voice preset name (e.g., "en-Carter_man")cfg- CFG scale (default: 1.5)steps- Diffusion inference steps (default: 5)
REST endpoints:
GET /- Web UIGET /config- Returns available voices and default voice
- English only (other languages produce unpredictable results)
- Very short inputs (<3 words) may cause instability
- Does not support code, mathematical formulas, or uncommon symbols
- Single speaker only (use multi-speaker variants for conversations)
The sapi/ directory contains a Windows SAPI5 TTS engine that allows any SAPI-compatible application to use VibeVoice voices.
SAPI Application → VibeVoiceSAPI.dll (C++) → Named Pipe → Python Server → VibeVoice Model
C++ SAPI DLL (sapi/VibeVoiceSAPI/):
VibeVoiceSAPI.cpp- ImplementsISpTTSEngineandISpObjectWithTokeninterfacesVibeVoiceSAPI.h- Header withPipeClientclass for named pipe communication- Communicates with Python server via
\\.\pipe\vibevoice
Python Pipe Server (demo/sapi_pipe_server.py):
- Named pipe server using win32pipe
- Reuses
StreamingTTSServicefromweb/app.py - Protocol: 4-byte length prefix + UTF-16LE text + 32-byte voice ID
Windows Service (service/vibevoice_service.py):
- Runs pipe server as a Windows service
- Install:
python vibevoice_service.py install - Start:
net start VibeVoiceTTS
- Build the DLL in Visual Studio (Release x64)
- Run
sapi/install/install.batas Administrator - Start the service or run
run_sapi_server.bat
# Start the pipe server manually
python demo/sapi_pipe_server.py --model_path microsoft/VibeVoice-Realtime-0.5B
# Test with pipe client
python demo/test_pipe_client.py --text "Hello world" --voice Carter --output test.wavAfter installation, these voices appear in Windows Settings > Speech:
- VibeVoice Carter (Male)
- VibeVoice Davis (Male)
- VibeVoice Emma (Female)
- VibeVoice Frank (Male)
- VibeVoice Grace (Female)
- VibeVoice Mike (Male)
- VibeVoice Samuel (Male)
Key dependencies from pyproject.toml:
- transformers==4.51.3 (specific version required)
- torch, accelerate, diffusers
- fastapi, uvicorn (for WebSocket demo)
- flash-attn (optional, for CUDA acceleration)
- pywin32 (for SAPI pipe server and Windows service)