voicebox Backend

Production-quality FastAPI backend for Qwen3-TTS voice cloning.

Features

✅ Voice Profile Management - Create, update, delete voice profiles with multi-sample support
✅ Voice Cloning - Generate speech using voice profiles with caching
✅ Generation History - Full history tracking with search and filtering
✅ Transcription - Whisper-based audio transcription
✅ Multi-Sample Profiles - Combine multiple reference samples for better quality
✅ Voice Prompt Caching - Dual memory + disk caching for fast generation
✅ Audio Validation - Automatic validation of reference audio quality
✅ Model Management - Lazy loading and VRAM management

Architecture

backend/
├── main.py              # FastAPI app with all routes
├── models.py            # Pydantic request/response models
├── platform_detect.py   # Platform detection for backend selection
├── tts.py              # TTS backend abstraction (delegates to MLX or PyTorch)
├── transcribe.py       # STT backend abstraction (delegates to MLX or PyTorch)
├── backends/           # Backend implementations
│   ├── __init__.py     # Backend factory and protocols
│   ├── mlx_backend.py  # MLX backend (Apple Silicon)
│   └── pytorch_backend.py  # PyTorch backend (Windows/Linux/Intel)
├── profiles.py         # Voice profile CRUD
├── history.py          # Generation history
├── studio.py           # Audio editing (TODO)
├── database.py         # SQLite ORM
└── utils/
    ├── audio.py        # Audio processing utilities
    ├── cache.py        # Voice prompt caching
    └── validation.py   # Input validation

Backend Selection

Voicebox automatically selects the best backend based on platform:

Apple Silicon (M1/M2/M3): Uses MLX backend with native Metal acceleration (4-5x faster)
Windows/Linux/Intel Mac: Uses PyTorch backend (CUDA GPU if available, CPU fallback)

The backend is detected at runtime via platform_detect.py. Both backends implement the same interface, so the API remains consistent across platforms.

API Endpoints

Health & Info

`GET /`

Root endpoint with version info.

`GET /health`

Health check with model status.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "gpu_available": true,
  "gpu_type": "Metal (Apple Silicon via MLX)",
  "backend_type": "mlx",
  "vram_used_mb": null
}

Backend Types:

"mlx" - MLX backend (Apple Silicon with Metal acceleration)
"pytorch" - PyTorch backend (Windows/Linux/Intel Mac)

Voice Profiles

Note: The database is automatically initialized when the server starts. No manual setup required.

`POST /profiles`

Create a new voice profile.

Request:

{
  "name": "My Voice",
  "description": "Optional description",
  "language": "en"
}

Response:

{
  "id": "uuid",
  "name": "My Voice",
  "description": "Optional description",
  "language": "en",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-01T00:00:00Z"
}

`GET /profiles`

List all voice profiles.

`GET /profiles/{profile_id}`

Get a specific profile.

`PUT /profiles/{profile_id}`

Update a profile.

`DELETE /profiles/{profile_id}`

Delete a profile and all associated samples.

`POST /profiles/{profile_id}/samples`

Add a sample to a profile.

Form Data:

file: Audio file (WAV, MP3, etc.)
reference_text: Transcript of the audio

Response:

{
  "id": "sample-uuid",
  "profile_id": "profile-uuid",
  "audio_path": "/path/to/sample.wav",
  "reference_text": "This is my voice"
}

`GET /profiles/{profile_id}/samples`

List all samples for a profile.

`DELETE /profiles/samples/{sample_id}`

Delete a specific sample.

Generation

`POST /generate`

Generate speech from text using a voice profile.

Request:

{
  "profile_id": "uuid",
  "text": "Hello, this is a test.",
  "language": "en",
  "seed": 42
}

Response:

{
  "id": "generation-uuid",
  "profile_id": "profile-uuid",
  "text": "Hello, this is a test.",
  "language": "en",
  "audio_path": "/path/to/audio.wav",
  "duration": 2.5,
  "seed": 42,
  "created_at": "2024-01-01T00:00:00Z"
}

History

`GET /history`

List generation history with optional filters.

Query Parameters:

profile_id (optional): Filter by profile
search (optional): Search in text content
limit (default: 50): Results per page
offset (default: 0): Pagination offset

`GET /history/{generation_id}`

Get a specific generation.

`DELETE /history/{generation_id}`

Delete a generation.

`GET /history/stats`

Get generation statistics.

Response:

{
  "total_generations": 100,
  "total_duration_seconds": 250.5,
  "generations_by_profile": {
    "profile-uuid-1": 50,
    "profile-uuid-2": 50
  }
}

Audio Files

`GET /audio/{generation_id}`

Download generated audio file.

Returns WAV file with appropriate headers.

Transcription

`POST /transcribe`

Transcribe audio file to text.

Form Data:

file: Audio file
language (optional): Language hint (en or zh)

Response:

{
  "text": "Transcribed text here",
  "duration": 5.5
}

Model Management

`POST /models/load`

Manually load TTS model.

Query Parameters:

model_size: Model size (1.7B or 0.6B)

`POST /models/unload`

Unload TTS model to free memory.

Database Schema

profiles

id: UUID primary key
name: Profile name (unique)
description: Optional description
language: Language code (en/zh)
created_at: Creation timestamp
updated_at: Last update timestamp

profile_samples

id: UUID primary key
profile_id: Foreign key to profiles
audio_path: Path to audio file
reference_text: Transcript

generations

id: UUID primary key
profile_id: Foreign key to profiles
text: Generated text
language: Language code
audio_path: Path to audio file
duration: Duration in seconds
seed: Random seed (optional)
created_at: Creation timestamp

projects

id: UUID primary key
name: Project name
data: JSON data
created_at: Creation timestamp
updated_at: Last update timestamp

File Structure

data/
├── profiles/
│   └── {profile_id}/
│       ├── {sample_id}.wav
│       └── ...
├── generations/
│   └── {generation_id}.wav
├── cache/
│   └── {hash}.prompt
├── projects/
│   └── {project_id}.json
└── voicebox.db

Setup

1. Install Dependencies

pip install -r requirements.txt

Note: On Apple Silicon, also install MLX dependencies for faster inference:

pip install -r requirements-mlx.txt

2. Download Models (Automatic)

The Qwen3-TTS models are automatically downloaded from HuggingFace Hub on first use, similar to how Whisper models work.

No manual download required! The models will be cached locally after the first download.

Available models:

1.7B (recommended): Qwen/Qwen3-TTS-12Hz-1.7B-Base (~4GB)
0.6B (faster): Qwen/Qwen3-TTS-12Hz-0.6B-Base (~2GB)

Note: The first generation will take longer as the model downloads. Subsequent generations will use the cached model.

Manual Download (Optional)

If you prefer to download models manually or have limited internet during runtime:

# Install huggingface-cli
pip install huggingface_hub

# Download 1.7B model
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

# Or use Python
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-TTS-12Hz-1.7B-Base')"

Models are cached in ~/.cache/huggingface/hub/ by default.

4. Run Server

# Development (local only)
python -m backend.main

# Production (allow remote access)
python -m backend.main --host 0.0.0.0 --port 8000

Usage Examples

Creating a Voice Profile

# 1. Create profile
curl -X POST http://localhost:8000/profiles \
  -H "Content-Type: application/json" \
  -d '{"name": "My Voice", "language": "en"}'

# Response: {"id": "abc-123", ...}

# 2. Add sample
curl -X POST http://localhost:8000/profiles/abc-123/samples \
  -F "file=@sample.wav" \
  -F "reference_text=This is my voice sample"

Generating Speech

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc-123",
    "text": "Hello, this is a test.",
    "language": "en",
    "seed": 42
  }'

# Response: {"id": "gen-456", "audio_path": "/path/to/audio.wav", ...}

# Download audio
curl http://localhost:8000/audio/gen-456 -o output.wav

Transcribing Audio

curl -X POST http://localhost:8000/transcribe \
  -F "file=@audio.wav" \
  -F "language=en"

# Response: {"text": "Transcribed text", "duration": 5.5}

Advanced Features

Multi-Sample Profiles

Add multiple samples to a profile for better quality:

# Add first sample
curl -X POST http://localhost:8000/profiles/abc-123/samples \
  -F "file=@sample1.wav" \
  -F "reference_text=First sample"

# Add second sample
curl -X POST http://localhost:8000/profiles/abc-123/samples \
  -F "file=@sample2.wav" \
  -F "reference_text=Second sample"

# Generation will automatically combine all samples

Voice Prompt Caching

Voice prompts are automatically cached for faster generation:

First generation: ~5-10 seconds (creates prompt)
Subsequent generations: ~1-2 seconds (uses cached prompt)

Cache is stored in data/cache/ and persists across server restarts.

VRAM Management

Models are lazy-loaded and can be manually unloaded:

# Unload TTS model
curl -X POST http://localhost:8000/models/unload

# Load specific model size
curl -X POST "http://localhost:8000/models/load?model_size=0.6B"

Error Handling

All endpoints return proper HTTP status codes:

200 OK: Success
400 Bad Request: Invalid input
404 Not Found: Resource not found
500 Internal Server Error: Server error

Error responses include details:

{
  "detail": "Profile not found"
}

Performance Tips

Use multi-sample profiles - Better quality than single sample
Let caching work - Voice prompts are cached automatically
Use 0.6B model on CPU - Faster than 1.7B with acceptable quality
Use 1.7B model on GPU - Best quality, still fast
Unload Whisper after transcription - Frees VRAM for TTS

TODO

License

See main project LICENSE.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

voicebox Backend

Features

Architecture

Backend Selection

API Endpoints

Health & Info

GET /

GET /health

Voice Profiles

POST /profiles

GET /profiles

GET /profiles/{profile_id}

PUT /profiles/{profile_id}

DELETE /profiles/{profile_id}

POST /profiles/{profile_id}/samples

GET /profiles/{profile_id}/samples

DELETE /profiles/samples/{sample_id}

Generation

POST /generate

History

GET /history

GET /history/{generation_id}

DELETE /history/{generation_id}

GET /history/stats

Audio Files

GET /audio/{generation_id}

Transcription

POST /transcribe

Model Management

POST /models/load

POST /models/unload

Database Schema

profiles

profile_samples

generations

projects

File Structure

Setup

1. Install Dependencies

2. Download Models (Automatic)

Manual Download (Optional)

4. Run Server

Usage Examples

Creating a Voice Profile

Generating Speech

Transcribing Audio

Advanced Features

Multi-Sample Profiles

Voice Prompt Caching

VRAM Management

Error Handling

Performance Tips

TODO

License

`GET /`

`GET /health`

`POST /profiles`

`GET /profiles`

`GET /profiles/{profile_id}`

`PUT /profiles/{profile_id}`

`DELETE /profiles/{profile_id}`

`POST /profiles/{profile_id}/samples`

`GET /profiles/{profile_id}/samples`

`DELETE /profiles/samples/{sample_id}`

`POST /generate`

`GET /history`

`GET /history/{generation_id}`

`DELETE /history/{generation_id}`

`GET /history/stats`

`GET /audio/{generation_id}`

`POST /transcribe`

`POST /models/load`

`POST /models/unload`