|
1 | 1 | # BreezyVoiceX |
2 | 2 |
|
3 | 3 | > Based on [BreezyVoice](https://github.com/mtkresearch/BreezyVoice) by MediaTek Labs. |
4 | | -> This repository will be updated gradually with deployment optimizations and feature extensions. |
5 | 4 |
|
6 | | -Documentation and acknowledgements will be completed after core functionality is refactored. |
| 5 | +BreezyVoiceX is an enhanced version of MediaTek [BreezyVoice](https://github.com/mtkresearch/BreezyVoice), focused on usability. |
| 6 | + |
| 7 | +## Key Improvements |
| 8 | +- Fast zero-shot voice synthesis via prompt caching |
| 9 | +- Built-in time profiler for each major inference step |
| 10 | +- Fully runnable without Linux-only ttsfrd dependency |
| 11 | + |
| 12 | +## Install |
| 13 | + |
| 14 | +> Python 3.11 is required. CUDA 12.1 recommended for GPU users. |
| 15 | +
|
| 16 | +### Clone the repo |
| 17 | +```bash |
| 18 | +git clone https://github.com/Docat0209/BreezyVoiceX.git |
| 19 | +cd BreezyVoiceX |
| 20 | +``` |
| 21 | + |
| 22 | +### Linux |
| 23 | +```bash |
| 24 | +pip install -r requirements.txt |
| 25 | +``` |
| 26 | + |
| 27 | +### Windows |
| 28 | +```bash |
| 29 | +pip install -r requirements.txt |
| 30 | +pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121 |
| 31 | +pip install WeTextProcessing --no-deps |
| 32 | +``` |
| 33 | + |
| 34 | +## Inference |
| 35 | + |
| 36 | +UTF8 encoding is required: |
| 37 | + |
| 38 | +``` sh |
| 39 | +export PYTHONUTF8=1 |
| 40 | +``` |
| 41 | + |
| 42 | +--- |
| 43 | +> This version separates the process into two explicit steps |
| 44 | +
|
| 45 | +**Run single_inference.py with the following arguments:** |
| 46 | + |
| 47 | +### `--mode cache`(Generate speaker prompt cache) |
| 48 | +| Argument | Description | |
| 49 | +| ------------------------------------- | ---------------------------------------------------------------------------------- | |
| 50 | +| `--speaker_prompt_audio_path` | Required. Path to the speaker reference audio. | |
| 51 | +| `--speaker_prompt_text_transcription` | Optional. Manual transcription. If not provided, Whisper will be used. | |
| 52 | +| `--prompt_feature_path` | Optional. Output cache file path. Default: `cache/prompt.pt`. | |
| 53 | +| `--model_path` | Optional. HF model ID or directory. Default: `MediaTek-Research/BreezyVoice-300M`. | |
| 54 | + |
| 55 | + |
| 56 | +### `--mode synthesize`(Generate Audio) |
| 57 | + |
| 58 | +| Argument | Description | |
| 59 | +|----------|-------------| |
| 60 | +| `--content_to_synthesize` | Required. The target text for TTS. | |
| 61 | +| `--prompt_feature_path` | Required. Path to previously saved speaker cache (`.pt`). | |
| 62 | +| `--output_path` | Optional. Output WAV file path. Default: `results/output.wav`. | |
| 63 | +| `--model_path` | Optional. HF model ID or directory. Default: `MediaTek-Research/BreezyVoice-300M`. | |
| 64 | + |
| 65 | +**Example Usage:** |
| 66 | + |
| 67 | +### Step 1: Cache Speaker Prompt |
| 68 | +```bash |
| 69 | +python single_inference.py --mode cache --speaker_prompt_audio_path data/example.wav --prompt_feature_path cache/example.pt |
| 70 | +``` |
| 71 | + |
| 72 | +### Step 2: Synthesize Voice from Text |
| 73 | +```bash |
| 74 | +python single_inference.py --mode synthesize --content_to_synthesize "您好,這是一段生成測試語音。" --prompt_feature_path cache/example.pt --output_path results/output.wav |
| 75 | +``` |
| 76 | + |
7 | 77 |
|
8 | 78 | ## Credits & Acknowledgement |
9 | 79 |
|
|
0 commit comments