Production-quality FastAPI backend for Qwen3-TTS voice cloning.
- ✅ Voice Profile Management - Create, update, delete voice profiles with multi-sample support
- ✅ Voice Cloning - Generate speech using voice profiles with caching
- ✅ Generation History - Full history tracking with search and filtering
- ✅ Transcription - Whisper-based audio transcription
- ✅ Multi-Sample Profiles - Combine multiple reference samples for better quality
- ✅ Voice Prompt Caching - Dual memory + disk caching for fast generation
- ✅ Audio Validation - Automatic validation of reference audio quality
- ✅ Model Management - Lazy loading and VRAM management
backend/
├── main.py # FastAPI app with all routes
├── models.py # Pydantic request/response models
├── platform_detect.py # Platform detection for backend selection
├── tts.py # TTS backend abstraction (delegates to MLX or PyTorch)
├── transcribe.py # STT backend abstraction (delegates to MLX or PyTorch)
├── backends/ # Backend implementations
│ ├── __init__.py # Backend factory and protocols
│ ├── mlx_backend.py # MLX backend (Apple Silicon)
│ └── pytorch_backend.py # PyTorch backend (Windows/Linux/Intel)
├── profiles.py # Voice profile CRUD
├── history.py # Generation history
├── studio.py # Audio editing (TODO)
├── database.py # SQLite ORM
└── utils/
├── audio.py # Audio processing utilities
├── cache.py # Voice prompt caching
└── validation.py # Input validation
Voicebox automatically selects the best backend based on platform:
- Apple Silicon (M1/M2/M3): Uses MLX backend with native Metal acceleration (4-5x faster)
- Windows/Linux/Intel Mac: Uses PyTorch backend (CUDA GPU if available, CPU fallback)
The backend is detected at runtime via platform_detect.py. Both backends implement the same interface, so the API remains consistent across platforms.
Root endpoint with version info.
Health check with model status.
Response:
{
"status": "healthy",
"model_loaded": true,
"gpu_available": true,
"gpu_type": "Metal (Apple Silicon via MLX)",
"backend_type": "mlx",
"vram_used_mb": null
}Backend Types:
"mlx"- MLX backend (Apple Silicon with Metal acceleration)"pytorch"- PyTorch backend (Windows/Linux/Intel Mac)
Note: The database is automatically initialized when the server starts. No manual setup required.
Create a new voice profile.
Request:
{
"name": "My Voice",
"description": "Optional description",
"language": "en"
}Response:
{
"id": "uuid",
"name": "My Voice",
"description": "Optional description",
"language": "en",
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-01T00:00:00Z"
}List all voice profiles.
Get a specific profile.
Update a profile.
Delete a profile and all associated samples.
Add a sample to a profile.
Form Data:
file: Audio file (WAV, MP3, etc.)reference_text: Transcript of the audio
Response:
{
"id": "sample-uuid",
"profile_id": "profile-uuid",
"audio_path": "/path/to/sample.wav",
"reference_text": "This is my voice"
}List all samples for a profile.
Delete a specific sample.
Generate speech from text using a voice profile.
Request:
{
"profile_id": "uuid",
"text": "Hello, this is a test.",
"language": "en",
"seed": 42
}Response:
{
"id": "generation-uuid",
"profile_id": "profile-uuid",
"text": "Hello, this is a test.",
"language": "en",
"audio_path": "/path/to/audio.wav",
"duration": 2.5,
"seed": 42,
"created_at": "2024-01-01T00:00:00Z"
}List generation history with optional filters.
Query Parameters:
profile_id(optional): Filter by profilesearch(optional): Search in text contentlimit(default: 50): Results per pageoffset(default: 0): Pagination offset
Get a specific generation.
Delete a generation.
Get generation statistics.
Response:
{
"total_generations": 100,
"total_duration_seconds": 250.5,
"generations_by_profile": {
"profile-uuid-1": 50,
"profile-uuid-2": 50
}
}Download generated audio file.
Returns WAV file with appropriate headers.
Transcribe audio file to text.
Form Data:
file: Audio filelanguage(optional): Language hint (en or zh)
Response:
{
"text": "Transcribed text here",
"duration": 5.5
}Manually load TTS model.
Query Parameters:
model_size: Model size (1.7B or 0.6B)
Unload TTS model to free memory.
id: UUID primary keyname: Profile name (unique)description: Optional descriptionlanguage: Language code (en/zh)created_at: Creation timestampupdated_at: Last update timestamp
id: UUID primary keyprofile_id: Foreign key to profilesaudio_path: Path to audio filereference_text: Transcript
id: UUID primary keyprofile_id: Foreign key to profilestext: Generated textlanguage: Language codeaudio_path: Path to audio fileduration: Duration in secondsseed: Random seed (optional)created_at: Creation timestamp
id: UUID primary keyname: Project namedata: JSON datacreated_at: Creation timestampupdated_at: Last update timestamp
data/
├── profiles/
│ └── {profile_id}/
│ ├── {sample_id}.wav
│ └── ...
├── generations/
│ └── {generation_id}.wav
├── cache/
│ └── {hash}.prompt
├── projects/
│ └── {project_id}.json
└── voicebox.db
pip install -r requirements.txtNote: On Apple Silicon, also install MLX dependencies for faster inference:
pip install -r requirements-mlx.txtThe Qwen3-TTS models are automatically downloaded from HuggingFace Hub on first use, similar to how Whisper models work.
No manual download required! The models will be cached locally after the first download.
Available models:
- 1.7B (recommended):
Qwen/Qwen3-TTS-12Hz-1.7B-Base(~4GB) - 0.6B (faster):
Qwen/Qwen3-TTS-12Hz-0.6B-Base(~2GB)
Note: The first generation will take longer as the model downloads. Subsequent generations will use the cached model.
If you prefer to download models manually or have limited internet during runtime:
# Install huggingface-cli
pip install huggingface_hub
# Download 1.7B model
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base
# Or use Python
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-TTS-12Hz-1.7B-Base')"Models are cached in ~/.cache/huggingface/hub/ by default.
# Development (local only)
python -m backend.main
# Production (allow remote access)
python -m backend.main --host 0.0.0.0 --port 8000# 1. Create profile
curl -X POST http://localhost:8000/profiles \
-H "Content-Type: application/json" \
-d '{"name": "My Voice", "language": "en"}'
# Response: {"id": "abc-123", ...}
# 2. Add sample
curl -X POST http://localhost:8000/profiles/abc-123/samples \
-F "file=@sample.wav" \
-F "reference_text=This is my voice sample"curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"profile_id": "abc-123",
"text": "Hello, this is a test.",
"language": "en",
"seed": 42
}'
# Response: {"id": "gen-456", "audio_path": "/path/to/audio.wav", ...}
# Download audio
curl http://localhost:8000/audio/gen-456 -o output.wavcurl -X POST http://localhost:8000/transcribe \
-F "file=@audio.wav" \
-F "language=en"
# Response: {"text": "Transcribed text", "duration": 5.5}Add multiple samples to a profile for better quality:
# Add first sample
curl -X POST http://localhost:8000/profiles/abc-123/samples \
-F "file=@sample1.wav" \
-F "reference_text=First sample"
# Add second sample
curl -X POST http://localhost:8000/profiles/abc-123/samples \
-F "file=@sample2.wav" \
-F "reference_text=Second sample"
# Generation will automatically combine all samplesVoice prompts are automatically cached for faster generation:
- First generation: ~5-10 seconds (creates prompt)
- Subsequent generations: ~1-2 seconds (uses cached prompt)
Cache is stored in data/cache/ and persists across server restarts.
Models are lazy-loaded and can be manually unloaded:
# Unload TTS model
curl -X POST http://localhost:8000/models/unload
# Load specific model size
curl -X POST "http://localhost:8000/models/load?model_size=0.6B"All endpoints return proper HTTP status codes:
200 OK: Success400 Bad Request: Invalid input404 Not Found: Resource not found500 Internal Server Error: Server error
Error responses include details:
{
"detail": "Profile not found"
}- Use multi-sample profiles - Better quality than single sample
- Let caching work - Voice prompts are cached automatically
- Use 0.6B model on CPU - Faster than 1.7B with acceptable quality
- Use 1.7B model on GPU - Best quality, still fast
- Unload Whisper after transcription - Frees VRAM for TTS
- WebSocket support for generation progress
- Batch generation endpoint
- Audio effects (M3GAN, etc.)
- Voice design (text-to-voice)
- Audio studio timeline features
- Project management
- Authentication & rate limiting
- Export/import profiles
See main project LICENSE.