docs(readme): update instructions for two-step CLI and Windows support

Docat0209 · Docat0209 · commit 6e5083c14539 · 2025-06-08T21:27:12.000+08:00
diff --git a/README.md b/README.md
@@ -1,9 +1,79 @@
 # BreezyVoiceX
 
 > Based on [BreezyVoice](https://github.com/mtkresearch/BreezyVoice) by MediaTek Labs.  
-> This repository will be updated gradually with deployment optimizations and feature extensions.
 
-Documentation and acknowledgements will be completed after core functionality is refactored.
+BreezyVoiceX is an enhanced version of MediaTek [BreezyVoice](https://github.com/mtkresearch/BreezyVoice), focused on usability.
+
+## Key Improvements
+- Fast zero-shot voice synthesis via prompt caching
+- Built-in time profiler for each major inference step
+- Fully runnable without Linux-only ttsfrd dependency
+
+## Install
+
+> Python 3.11 is required. CUDA 12.1 recommended for GPU users.
+
+### Clone the repo
+```bash
+git clone https://github.com/Docat0209/BreezyVoiceX.git
+cd BreezyVoiceX
+```
+
+### Linux
+```bash
+pip install -r requirements.txt
+```
+
+### Windows
+```bash
+pip install -r requirements.txt
+pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
+pip install WeTextProcessing --no-deps
+```
+
+## Inference
+
+UTF8 encoding is required:
+
+``` sh
+export PYTHONUTF8=1
+```
+
+---
+> This version separates the process into two explicit steps
+
+**Run single_inference.py with the following arguments:**
+
+### `--mode cache`（Generate speaker prompt cache）
+| Argument                              | Description                                                                        |
+| ------------------------------------- | ---------------------------------------------------------------------------------- |
+| `--speaker_prompt_audio_path`         | Required. Path to the speaker reference audio.                         |
+| `--speaker_prompt_text_transcription` | Optional. Manual transcription. If not provided, Whisper will be used.             |
+| `--prompt_feature_path`               | Optional. Output cache file path. Default: `cache/prompt.pt`.                      |
+| `--model_path`                        | Optional. HF model ID or directory. Default: `MediaTek-Research/BreezyVoice-300M`. |
+
+
+### `--mode synthesize`（Generate Audio）
+
+| Argument | Description |
+|----------|-------------|
+| `--content_to_synthesize` | Required. The target text for TTS. |
+| `--prompt_feature_path` | Required. Path to previously saved speaker cache (`.pt`). |
+| `--output_path` | Optional. Output WAV file path. Default: `results/output.wav`. |
+| `--model_path` | Optional. HF model ID or directory. Default: `MediaTek-Research/BreezyVoice-300M`. |
+
+**Example Usage:**
+
+### Step 1: Cache Speaker Prompt
+```bash
+python single_inference.py --mode cache --speaker_prompt_audio_path data/example.wav --prompt_feature_path cache/example.pt
+```
+
+### Step 2: Synthesize Voice from Text
+```bash
+python single_inference.py --mode synthesize --content_to_synthesize "您好，這是一段生成測試語音。" --prompt_feature_path cache/example.pt --output_path results/output.wav
+```
+
 
 ## Credits & Acknowledgement