Skip to content

Voice Upload & Library

Travis Van Nimwegen edited this page Jun 22, 2025 · 1 revision

Voice Library Management

🎵 Overview

The Chatterbox TTS API now includes a comprehensive voice library management system that allows users to upload, manage, and use custom voices across all speech generation endpoints. This feature enables you to create a persistent collection of voices that can be referenced by name in API calls.

✨ Key Features

  • Persistent Voice Storage: Uploaded voices are stored persistently and survive container restarts
  • Voice Selection by Name: Reference uploaded voices by name in any speech generation endpoint
  • Multiple Audio Formats: Support for MP3, WAV, FLAC, M4A, and OGG files
  • RESTful Voice Management: Full CRUD operations for voice management
  • Docker & Local Support: Works seamlessly with both Docker and direct Python installations
  • Frontend Integration: Complete voice management UI in the web frontend

🚀 Getting Started

For Docker Users

The voice library is automatically configured when using Docker. Voices are stored in a persistent volume:

# Start with voice library enabled
docker-compose up -d

# Your voices will be persisted in the "chatterbox-voices" Docker volume

For Local Python Users

Create a voice library directory (default: ./voices):

# Create voices directory
mkdir voices

# Or set custom location
export VOICE_LIBRARY_DIR="/path/to/your/voices"

📚 API Endpoints

List Voices

GET /v1/voices

Get a list of all voices in the library.

curl -X GET "http://localhost:4123/v1/voices"

Response:

{
  "voices": [
    {
      "name": "sarah_professional",
      "filename": "sarah_professional.mp3",
      "original_filename": "sarah_recording.mp3",
      "file_extension": ".mp3",
      "file_size": 1024768,
      "upload_date": "2024-01-15T10:30:00Z",
      "path": "/voices/sarah_professional.mp3"
    }
  ],
  "count": 1
}

Upload Voice

POST /v1/voices

Upload a new voice to the library.

curl -X POST "http://localhost:4123/v1/voices" \
  -F "voice_name=sarah_professional" \
  -F "voice_file=@/path/to/voice.mp3"

Parameters:

  • voice_name (string): Name for the voice (used in API calls)
  • voice_file (file): Audio file (MP3, WAV, FLAC, M4A, OGG, max 10MB)

Delete Voice

DELETE /v1/voices/{voice_name}

Delete a voice from the library.

curl -X DELETE "http://localhost:4123/v1/voices/sarah_professional"

Rename Voice

PUT /v1/voices/{voice_name}

Rename an existing voice.

curl -X PUT "http://localhost:4123/v1/voices/sarah_professional" \
  -F "new_name=sarah_business"

Get Voice Info

GET /v1/voices/{voice_name}

Get detailed information about a specific voice.

curl -X GET "http://localhost:4123/v1/voices/sarah_professional"

Download Voice

GET /v1/voices/{voice_name}/download

Download the original voice file.

curl -X GET "http://localhost:4123/v1/voices/sarah_professional/download" \
  --output voice.mp3

🎤 Using Voices in Speech Generation

JSON API (Recommended)

Use the voice name in the voice parameter:

curl -X POST "http://localhost:4123/v1/audio/speech" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello! This is using my custom voice.",
    "voice": "sarah_professional",
    "exaggeration": 0.7,
    "temperature": 0.8
  }' \
  --output speech.wav

Form Data API

curl -X POST "http://localhost:4123/v1/audio/speech/upload" \
  -F "input=Hello! This is using my custom voice." \
  -F "voice=sarah_professional" \
  -F "exaggeration=0.7" \
  --output speech.wav

Streaming API

curl -X POST "http://localhost:4123/v1/audio/speech/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "This will stream with my custom voice.",
    "voice": "sarah_professional"
  }' \
  --output stream.wav

🔧 Configuration

Environment Variables

# Voice library directory (default: ./voices for local, /voices for Docker)
VOICE_LIBRARY_DIR=/path/to/voices

# For Docker, this is typically set to /voices and mounted as a volume

Docker Configuration

The voice library is automatically configured in Docker with a persistent volume:

volumes:
  - chatterbox-voices:/voices

📝 Voice Naming Guidelines

Valid Characters

  • Letters (a-z, A-Z)
  • Numbers (0-9)
  • Underscores (_)
  • Hyphens (-)
  • Spaces (converted to underscores)

Invalid Characters

  • Forward/backward slashes (/, \)
  • Colons (:)
  • Asterisks (*)
  • Question marks (?)
  • Quotes (", ')
  • Angle brackets (<, >)
  • Pipes (|)

Examples

✅ Good names:
- "sarah_professional"
- "john-voice-2024"
- "female_american"
- "narration_style"

❌ Invalid names:
- "sarah/professional"  # Contains slash
- "voice:sample"        # Contains colon
- "my voice?"           # Contains question mark

🎯 Best Practices

Voice Quality

  • Use high-quality audio samples (16-48kHz sample rate)
  • Aim for 10-30 seconds of clean speech
  • Avoid background noise and music
  • Choose samples with consistent volume

File Management

  • Use descriptive voice names
  • Keep file sizes reasonable (< 10MB)
  • Organize voices by speaker or style
  • Clean up unused voices periodically

API Usage

  • Use the JSON API for better performance
  • Cache voice lists on the client side
  • Handle voice-not-found errors gracefully
  • Test voices before production use

🔍 Troubleshooting

Voice Not Found

{
  "error": {
    "message": "Voice 'my_voice' not found in voice library. Use /voices endpoint to list available voices.",
    "type": "voice_not_found_error"
  }
}

Solution: Check available voices with GET /v1/voices or upload the voice first.

Upload Failed

{
  "error": {
    "message": "Unsupported audio format: .txt. Supported formats: .mp3, .wav, .flac, .m4a, .ogg",
    "type": "invalid_request_error"
  }
}

Solution: Use a supported audio format and ensure the file is valid.

Voice Already Exists

{
  "error": {
    "message": "Voice 'sarah_professional' already exists",
    "type": "voice_exists_error"
  }
}

Solution: Use a different name or delete the existing voice first.

🎛️ Frontend Integration

The web frontend includes a complete voice library management interface:

  • Voice Library Panel: Browse and manage voices
  • Upload Modal: Easy voice upload with drag-and-drop
  • Voice Selection: Choose voices in the TTS interface
  • Preview Playback: Listen to voice samples before use
  • Rename/Delete: Manage voice metadata

📊 Migration from Client-Side Storage

If you were previously using the client-side voice library (localStorage), you'll need to re-upload your voices to the new server-side library for persistence and cross-device access.

🔗 API Aliases

All voice endpoints support multiple URL formats:

  • /v1/voices (recommended)
  • /voices
  • /voice-library
  • /voice_library

🏷️ OpenAI Compatibility

The voice parameter also accepts OpenAI voice names for compatibility:

  • alloy, echo, fable, onyx, nova, shimmer

These will use the default configured voice sample, while custom names will use uploaded voices from the library.

🛡️ Security Considerations

  • Voice files are stored on the server filesystem
  • File uploads are validated for type and size
  • Voice names are sanitized to prevent path traversal
  • No authentication required (same as other endpoints)

📈 Performance Notes

  • Voice library operations are fast (< 100ms typical)
  • Voice files are loaded on-demand for TTS generation
  • Large voice files may increase TTS processing time
  • Consider voice file size vs. quality trade-offs

🆙 Future Enhancements

Planned features for future releases:

  • Voice categorization and tagging
  • Bulk voice operations
  • Voice sharing between users
  • Advanced voice metadata
  • Voice quality analysis
  • Automatic voice optimization

Voice Upload Feature Implementation Summary

🎤 Overview

Successfully implemented voice file upload functionality for the Chatterbox TTS API, allowing users to upload custom voice samples per request while maintaining full backward compatibility.

📋 Changes Made

1. Core Dependencies Added

python-multipart>=0.0.6 - Required for FastAPI multipart/form-data support

Files Updated:

  • requirements.txt - Added python-multipart dependency
  • pyproject.toml - Added python-multipart to project dependencies
  • All Docker files - Added python-multipart to pip install commands

2. Enhanced Speech Endpoint (app/api/endpoints/speech.py)

New Features:

  • Voice file upload support - Optional voice_file parameter
  • Multiple endpoint formats - Both JSON and form data support
  • File validation - Format, size, and content validation
  • Temporary file handling - Secure file processing with automatic cleanup
  • Backward compatibility - Existing JSON requests continue to work

Supported File Formats:

  • MP3 (.mp3)
  • WAV (.wav)
  • FLAC (.flac)
  • M4A (.m4a)
  • OGG (.ogg)
  • Maximum size: 10MB

New Endpoints:

  • POST /v1/audio/speech - Multipart form data (supports voice upload)
  • POST /v1/audio/speech/json - Legacy JSON endpoint (backward compatibility)

3. Comprehensive Testing

New Test Files:

  • tests/test_voice_upload.py - Dedicated voice upload testing
  • Updated tests/test_api.py - Tests both JSON and form data endpoints

Test Coverage:

  • ✅ Default voice (both endpoints)
  • ✅ Custom voice upload
  • ✅ File format validation
  • ✅ Error handling
  • ✅ Parameter validation
  • ✅ Backward compatibility

4. Updated Documentation

README.md Updates:

  • Added voice upload examples
  • Documented supported file formats
  • Provided usage examples in multiple languages (Python, cURL)
  • Added file requirements and best practices

🚀 Usage Examples

Basic Usage (Default Voice)

# JSON (legacy)
curl -X POST http://localhost:4123/v1/audio/speech/json \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world!"}' \
  --output output.wav

# Form data (new)
curl -X POST http://localhost:4123/v1/audio/speech \
  -F "input=Hello world!" \
  --output output.wav

Custom Voice Upload

curl -X POST http://localhost:4123/v1/audio/speech \
  -F "input=Hello with my custom voice!" \
  -F "exaggeration=0.8" \
  -F "voice_file=@my_voice.mp3" \
  --output custom_voice.wav

Python Example

import requests

# With custom voice upload
with open("my_voice.mp3", "rb") as voice_file:
    response = requests.post(
        "http://localhost:4123/v1/audio/speech",
        data={
            "input": "Hello with my custom voice!",
            "exaggeration": 0.8,
            "temperature": 1.0
        },
        files={
            "voice_file": ("my_voice.mp3", voice_file, "audio/mpeg")
        }
    )

with open("output.wav", "wb") as f:
    f.write(response.content)

🐳 Docker Support

All Docker files updated with python-multipart:

  • docker/Dockerfile - Standard Docker image
  • docker/Dockerfile.cpu - CPU-only image
  • docker/Dockerfile.gpu - GPU-enabled image
  • docker/Dockerfile.uv - uv-optimized image
  • docker/Dockerfile.uv.gpu - uv + GPU image

Docker Usage:

# Build and run with voice upload support
docker compose -f docker/docker-compose.yml up -d

# Test voice upload
curl -X POST http://localhost:4123/v1/audio/speech \
  -F "input=Hello from Docker!" \
  -F "[email protected]" \
  --output docker_test.wav

🔧 Technical Implementation

File Processing Flow

  1. Upload - Receive multipart form data with optional voice file
  2. Validate - Check file format, size, and content
  3. Store - Create temporary file with secure naming
  4. Process - Use uploaded file or default voice sample for TTS
  5. Cleanup - Automatically remove temporary files

Memory Management

  • Temporary files are automatically cleaned up in finally blocks
  • File validation prevents oversized uploads
  • Secure temporary file creation with unique names

Error Handling

  • File format validation with helpful error messages
  • File size limits (10MB maximum)
  • Graceful fallback to default voice on upload errors
  • Comprehensive error responses with error codes

🧪 Testing

Quick Test

# Start the API
python main.py

# Run comprehensive tests
python tests/test_voice_upload.py
python tests/test_api.py

Test Results Expected

  • ✅ Health check
  • ✅ API documentation endpoints
  • ✅ Legacy JSON endpoint compatibility
  • ✅ New form data endpoint
  • ✅ Voice file upload functionality
  • ✅ Error handling and validation

📚 API Documentation

The API documentation is automatically updated and available at:

The documentation now includes:

  • Multipart form data support
  • File upload parameters
  • Example requests and responses
  • Error codes and descriptions

✅ Backward Compatibility

100% backward compatible:

  • Existing JSON requests work unchanged
  • All previous API behavior preserved
  • Legacy endpoint (/v1/audio/speech/json) maintains exact same interface
  • No breaking changes to existing functionality

🔐 Security Considerations

  • File type validation prevents malicious uploads
  • File size limits prevent DoS attacks
  • Temporary files use secure random naming
  • Automatic cleanup prevents file system bloat
  • No persistent storage of uploaded files

📈 Performance Impact

  • Minimal overhead for JSON requests (unchanged code path)
  • Temporary file I/O only when voice files are uploaded
  • Efficient memory management with automatic cleanup
  • FastAPI's built-in multipart handling is highly optimized

Status: ✅ Complete and Production Ready

The voice upload feature is fully implemented, tested, and documented. Users can now upload custom voice files for personalized text-to-speech generation while maintaining full backward compatibility with existing implementations.

Clone this wiki locally