TTS Studio - AI Voice Cloning

Professional voice cloning and text-to-speech system with hexagonal architecture, powered by Qwen3-TTS. Clone any voice with just a few audio samples and generate natural-sounding speech from text.

Desktop application coming soon! The core Python library is production-ready and can be integrated into your applications today.

Features

🎤 Voice Cloning: Clone any voice using 1-3 audio samples
🗣️ Text-to-Speech: Generate speech from text in the cloned voice
🎯 High Quality: Powered by Qwen3-TTS for natural-sounding results
⚡ Fast Processing: Optimized for Apple Silicon (MPS) and CUDA GPUs
📦 Batch Processing: Process multiple text segments at once
🏗️ Hexagonal Architecture: Clean, testable, maintainable code
🔧 Python API: Easy-to-use Python library for integration
🖥️ Desktop App: Native Tauri desktop application (coming soon)
📥 Model Management: Download and manage TTS models on-demand
🔒 Privacy-First: Everything runs locally, no cloud required

Architecture

TTS Studio uses a monorepo structure with hexagonal architecture (Ports & Adapters):

tts-studio/
├── apps/
│   ├── core/          # Python core library (hexagonal architecture)
│   │   ├── src/
│   │   │   ├── domain/      # Business logic (pure, no dependencies)
│   │   │   ├── app/         # Use cases and orchestration
│   │   │   ├── infra/       # Adapters (Qwen3, audio, storage)
│   │   │   ├── api/         # Python API entry point
│   │   │   └── shared/      # Shared utilities
│   │   ├── tests/           # Comprehensive test suite
│   │   ├── config/          # Configuration files
│   │   ├── data/            # Data directory (gitignored)
│   │   └── .env.example     # Environment variables template
│   └── desktop/       # Tauri desktop app (coming soon)
└── docs/              # Documentation

Hexagonal Architecture

The core library follows hexagonal architecture principles for maximum flexibility and testability:

Domain Layer: Pure business logic with zero external dependencies
- Entities (VoiceProfile, AudioSample)
- Ports (interfaces for TTS engines, audio processors, storage)
- Domain services (voice cloning logic)
Application Layer: Use cases that orchestrate domain logic
- CreateVoiceProfile, GenerateAudio, ValidateSamples
- DTOs for data transfer
- No business logic, only coordination
Infrastructure Layer: Concrete implementations (adapters)
- Qwen3 TTS engine adapter
- Librosa audio processor adapter
- File-based profile repository
- YAML configuration provider
API Layer: Entry point for external consumers
- TTSStudio class (main Python API)
- Dependency injection and wiring

This architecture makes the code:

✅ Easy to test: Domain logic testable without infrastructure
✅ Easy to maintain: Clear separation of concerns
✅ Easy to extend: Swap TTS engines without changing business logic
✅ Easy to understand: Follows SOLID principles

See docs/HEXAGONAL_ARCHITECTURE.md for detailed architecture documentation.

Quick Start

Installation

# Clone the repository
git clone https://github.com/bryanstevensacosta/tts-studio.git
cd tts-studio

# Navigate to core library
cd apps/core

# Run the automated setup script
./setup.sh

The setup script will:

Create a Python virtual environment
Install all dependencies
Set up pre-commit hooks for development

Model Download

TTS Studio uses an on-demand model download system. Models are not included in the installation to keep the package size small.

First-time setup:

from api.studio import TTSStudio

# Initialize the API (will prompt for model download if needed)
studio = TTSStudio()

# The Qwen3-TTS model (~3.4GB) will download automatically on first use
# This happens once and takes 10-15 minutes depending on your connection

Model storage locations:

macOS: ~/Library/Application Support/TTS Studio/models/
Windows: %LOCALAPPDATA%\TTS Studio\models\
Linux: ~/.local/share/tts-studio/models/

You can delete models anytime to free disk space and re-download them later.

Python API Usage

from api.studio import TTSStudio

# Initialize the API
studio = TTSStudio()

# 1. Validate audio samples
validation = studio.validate_samples(
    sample_paths=["./apps/core/data/samples/neutral_01.wav", "./apps/core/data/samples/happy_01.wav"]
)

if validation["all_valid"]:
    # 2. Create voice profile
    profile = studio.create_voice_profile(
        name="my_voice",
        sample_paths=["./apps/core/data/samples/neutral_01.wav", "./apps/core/data/samples/happy_01.wav"],
        language="es"
    )

    if profile["status"] == "success":
        # 3. Generate audio from text
        result = studio.generate_audio(
            profile_id=profile["profile"]["id"],
            text="Hola, esta es una prueba de mi voz clonada.",
            temperature=0.75,
            speed=1.0
        )

        if result["status"] == "success":
            print(f"Audio generated: {result['output_path']}")

See examples/api_usage.py for complete examples.

Installation Options

Option 1: Automated Setup (Recommended)

cd apps/core
./setup.sh

Option 2: Manual Setup

cd apps/core

# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate  # macOS/Linux
# or
venv\Scripts\activate  # Windows

# Install dependencies
pip install --upgrade pip
pip install -e ".[dev]"

# Install pre-commit hooks (for development)
pre-commit install
pre-commit install --hook-type commit-msg
pre-commit install --hook-type pre-push

Python API Reference

TTSStudio Class

Main API entry point for TTS Studio.

Methods

create_voice_profile(name, sample_paths, language="es", reference_text="")

Creates a voice profile from audio samples
Returns: {"status": "success|error", "profile": {...}, "error": None|str}

generate_audio(profile_id, text, temperature=0.75, speed=1.0, mode="clone")

Generates audio from text using a voice profile
Returns: {"status": "success|error", "output_path": str, "duration": float, ...}

list_voice_profiles()

Lists all available voice profiles
Returns: {"status": "success|error", "profiles": [...], "count": int, ...}

delete_voice_profile(profile_id)

Deletes a voice profile
Returns: {"status": "success|error", "deleted": bool, ...}

validate_samples(sample_paths)

Validates audio samples for quality
Returns: {"status": "success|error", "results": [...], "all_valid": bool, ...}

See docs/api.md for complete API documentation.

Audio Sample Requirements

For best results, your audio samples should be:

Format: WAV, 12000 Hz, mono, 16-bit
Duration: 3-30 seconds per sample
Quantity: 1-3 samples (Qwen3-TTS requires fewer samples)
Quality: Clear speech, no background noise
Variety: Different emotions and tones
Content: Natural speech, complete sentences

Sample Recording Tips

Environment: Record in a quiet room
Microphone: Use a decent quality mic (built-in MacBook mic is acceptable)
Distance: 15-20cm from microphone
Volume: Natural speaking volume (not whispering or shouting)
Emotions: Include neutral, happy, serious, calm tones
Avoid: Background noise, echo, mouth clicks, breathing sounds

Configuration

Create a apps/core/config/config.yaml file to customize settings (see apps/core/config/config.yaml.example):

# Model configuration
model:
  name: "Qwen/Qwen3-TTS-12Hz-1.7B-Base"
  device: "auto"  # auto, mps (M1/M2), or cpu
  dtype: "float32"  # Required for MPS

# Generation parameters
generation:
  language: "Spanish"
  temperature: 0.75  # 0.5-1.0 (consistency vs variety)
  speed: 1.0  # 0.8-1.2 (speaking speed)
  max_new_tokens: 2048  # Maximum tokens to generate

# Audio settings
audio:
  sample_rate: 12000  # Qwen3-TTS native
  channels: 1  # Mono
  bit_depth: 16

# Paths
paths:
  samples: "./data/samples"
  outputs: "./data/outputs"
  profiles: "./data/profiles"
  models: "./data/models"

Documentation

For detailed documentation, see:

Installation Guide - Detailed installation instructions
Usage Guide - Comprehensive usage examples
Development Guide - Contributing and development setup
API Documentation - API reference and integration guide
Hexagonal Architecture - Architecture overview

Requirements

Python 3.10 or 3.11
8GB+ RAM (16GB recommended for M1 Pro)
GPU recommended for faster processing:
- NVIDIA GPU with CUDA (Linux/Windows)
- Apple Silicon M1/M2 with MPS (macOS)
- CPU-only mode supported (slower)

Hardware Performance

Hardware	Generation Speed	Notes
M1 Pro (16GB)	~15-25s per minute	Native MPS acceleration
RTX 3060 (12GB)	~10-20s per minute	CUDA acceleration
Intel i7 (CPU)	~2-3 min per minute	CPU-only, slower

Development

We welcome contributions! See CONTRIBUTING.md for guidelines.

Quick Development Setup

# Clone and setup
git clone https://github.com/bryanstevensacosta/tts-studio.git
cd tts-studio/apps/core
./setup.sh

# Run tests from core library
cd apps/core
pytest

# Or from root
pytest apps/core/

# Run linting and formatting
make pre-commit

# See all available commands
make help

Code Quality

This project uses:

Black for code formatting
Ruff for linting
MyPy for type checking
pytest for testing
pre-commit for automated checks

All checks run automatically via pre-commit hooks.

Git Workflow

This project enforces a strict rebase workflow to maintain a clean, linear history. See docs/git-workflow.md for details.

Project Structure

tts-studio/
├── apps/
│   ├── core/                    # Python core library
│   │   ├── src/
│   │   │   ├── domain/         # Domain layer (business logic)
│   │   │   ├── app/            # Application layer (use cases)
│   │   │   ├── infra/          # Infrastructure layer (adapters)
│   │   │   ├── api/            # API layer (entry points)
│   │   │   └── shared/         # Shared utilities
│   │   ├── tests/              # Test suite
│   │   ├── config/             # Configuration files
│   │   ├── data/               # Data directory (gitignored)
│   │   │   ├── samples/        # Audio samples
│   │   │   ├── profiles/       # Voice profiles
│   │   │   ├── models/         # Cached models
│   │   │   └── outputs/        # Generated audio
│   │   ├── .env.example        # Environment variables template
│   │   ├── setup.py            # Package setup
│   │   └── requirements.txt    # Dependencies
│   └── desktop/                 # Tauri desktop app (coming soon)
├── docs/                        # Documentation
└── examples/                    # Usage examples

Troubleshooting

Common Issues

Import errors: Make sure you've activated the virtual environment:

cd apps/core
source venv/bin/activate  # macOS/Linux
# or
venv\Scripts\activate  # Windows

Model download fails: The Qwen3-TTS model (~3.4GB) downloads automatically on first use. Ensure you have:

Stable internet connection
At least 10GB free disk space
Patience (first download takes 10-15 minutes)

Model storage: Models are stored in OS-specific directories:

macOS: ~/Library/Application Support/TTS Studio/models/
Windows: %LOCALAPPDATA%\TTS Studio\models\
Linux: ~/.local/share/tts-studio/models/

You can delete models to free space and re-download them later.

Audio quality issues: Ensure your input samples are:

12000 Hz sample rate (or will be converted)
Mono (single channel)
16-bit depth
Clear speech without background noise
3-30 seconds duration each
At least 1 sample (1-3 recommended)

Generation is slow:

First generation is slower (model loading ~30-60 seconds)
CPU-only mode is significantly slower than GPU
For M1/M2 Mac: Ensure PyTorch has MPS support and dtype is set to float32
For NVIDIA GPU: Ensure CUDA is properly installed

Voice sounds robotic:

Add more samples with emotional variety
Ensure samples are high quality
Try adjusting temperature (0.7-0.9)
Record samples with natural expression

"Out of memory" errors:

Close other applications
Reduce batch size
Use shorter text chunks
Consider upgrading RAM (16GB recommended)

Getting Help

Check docs/development.md for detailed troubleshooting
Review the steering guides for workflow tips
Open an issue on GitHub with:
- Error message
- Python version
- Hardware specs
- Steps to reproduce

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with Qwen3-TTS by Alibaba Cloud
Inspired by the open-source TTS community

Support

Roadmap

Note: This is a personal project for educational and research purposes. Please respect voice rights and obtain proper consent before cloning someone's voice.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
.github		.github
.kiro		.kiro
apps		apps
docs		docs
examples		examples
scripts		scripts
terraform		terraform
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

TTS Studio - AI Voice Cloning

Features

Architecture

Hexagonal Architecture

Quick Start

Installation

Model Download

Python API Usage

Installation Options

Option 1: Automated Setup (Recommended)

Option 2: Manual Setup

Python API Reference

TTSStudio Class

Methods

Audio Sample Requirements

Sample Recording Tips

Configuration

Documentation

Requirements

Hardware Performance

Development

Quick Development Setup

Code Quality

Git Workflow

Project Structure

Troubleshooting

Common Issues

Getting Help

License

Acknowledgments

Support

Roadmap

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages