Based on BreezyVoice by MediaTek Labs.
A zero-shot voice cloning TTS system for Taiwanese-accented Mandarin. Give it a short audio clip of any speaker, and it generates natural speech in that voice — with phonetic control via 注音 (bopomofo).
BreezyVoiceX wraps MediaTek's BreezyVoice with a streamlined two-step workflow (cache speaker → synthesize), Windows support, and performance profiling. No Linux-only dependencies required.
- Fast zero-shot voice synthesis via prompt caching
- Built-in time profiler for each major inference step
- Fully runnable without Linux-only ttsfrd dependency
Python 3.11 is required. CUDA 12.1 recommended for GPU users.
git clone https://github.com/Docat0209/BreezyVoiceX.git
cd BreezyVoiceXpip install -r requirements.txtpip install -r requirements.txt
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install WeTextProcessing --no-depsUTF8 encoding is required:
export PYTHONUTF8=1This version separates the process into two explicit steps
Run single_inference.py with the following arguments:
| Argument | Description |
|---|---|
--speaker_prompt_audio_path |
Required. Path to the speaker reference audio. |
--speaker_prompt_text_transcription |
Optional. Manual transcription. If not provided, Whisper will be used. |
--prompt_feature_path |
Optional. Output cache file path. Default: cache/prompt.pt. |
--model_path |
Optional. HF model ID or directory. Default: MediaTek-Research/BreezyVoice-300M. |
| Argument | Description |
|---|---|
--content_to_synthesize |
Required. The target text for TTS. |
--prompt_feature_path |
Required. Path to previously saved speaker cache (.pt). |
--output_path |
Optional. Output WAV file path. Default: results/output.wav. |
--model_path |
Optional. HF model ID or directory. Default: MediaTek-Research/BreezyVoice-300M. |
Example Usage:
python single_inference.py --mode cache --speaker_prompt_audio_path data/example.wav --prompt_feature_path cache/example.ptpython single_inference.py --mode synthesize --content_to_synthesize "您好,這是一段生成測試語音。" --prompt_feature_path cache/example.pt --output_path results/output.wavThis project is based on BreezyVoice by MediaTek Labs,
a voice-cloning TTS system tailored for Taiwanese Mandarin with phonetic control via 注音 (bopomofo).
The original project was derived in part from CosyVoice, and is part of the Breeze2 model family.
We appreciate the efforts of the original authors, and this repository continues that work by providing deployment-ready infrastructure, Windows compatibility, and modular serving enhancements.
For official demo, model, and paper, please refer to: