Skip to content

Erio-Harrison/kokorotts_service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

kokorotts_service - Text-to-Speech API Service

kokorotts_service is a Text-to-Speech (TTS) API service engineered in Rust. Built upon the Kokoro v1.0 architecture and utilizing ONNX Runtime, it delivers state-of-the-art audio synthesis with exceptional speed and a low memory footprint. Designed for efficiency, this service features a REST API supporting 10+ voice styles (American & British), real-time style mixing, automatic long-text chunking, and flexible output formats (WAV/MP3). This is a lightweight solution for integrating high-quality voice synthesis into local LLM workflows and real-time applications.

Based on:

  1. https://github.com/lucasjinreal/Kokoros
  2. https://huggingface.co/hexgrad/Kokoro-82M

πŸ“¦ Model Information

  • Model: Kokoro v1.0 ONNX
  • Sample Rate: 24kHz
  • Voices: 10+ styles (American & British accents)
  • Languages: Multi-language support (primary: English)

Models are automatically downloaded on first run to the models/ directory.

πŸš€ Quick Start

1. Configuration

Edit config.toml:

[server]
host = "0.0.0.0"
port = 8080

[api]
keys = [
    "your-secret-api-key-here"
]

2. Run the Service

cargo run --release

The service will automatically download required models on first run.


πŸ“‘ API Endpoints

Base URL: http://localhost:8080

Authentication: All API endpoints (except /health) require Bearer token authentication:

-H "Authorization: Bearer your-secret-api-key-here"

πŸ” Available Endpoints

1. Health Check (Public)

Check if the service is running.

Endpoint: GET /health

Request:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "timestamp": 1732723200
}

2. Get Available Voice Styles

List all available voice styles.

Endpoint: GET /api/styles

Request:

curl -H "Authorization: Bearer your-secret-api-key-here" \
     http://localhost:8080/api/styles

Response:

{
  "styles": [
    "af_bella",
    "af_nicole",
    "af_sarah",
    "af_sky",
    "am_adam",
    "am_michael",
    "bf_emma",
    "bf_isabella",
    "bm_george",
    "bm_lewis"
  ]
}

3. Get Service Status

Check current service load and capacity.

Endpoint: GET /api/status

Request:

curl -H "Authorization: Bearer your-secret-api-key-here" \
     http://localhost:8080/api/status

Response:

{
  "available_slots": 3,
  "max_concurrent": 4,
  "service_healthy": true,
  "estimated_wait_time_seconds": null
}

When service is busy:

{
  "available_slots": 0,
  "max_concurrent": 4,
  "service_healthy": true,
  "estimated_wait_time_seconds": 20
}

4. Text-to-Speech Synthesis

Synthesize speech from text.

Endpoint: POST /api/tts

Request Body

Parameter Type Required Default Description
text string βœ… Yes - Text to synthesize (max 10,000 characters)
language string No "en" Language code: en, zh, ja, etc.
style string No "default" Voice style (see /api/styles)
speed float No 1.0 Speech speed (0.5 - 2.0)
initial_silence integer No null Number of silence tokens at start
mono boolean No true Mono audio (false for stereo)
mp3 boolean No false Output as MP3 (true) or WAV (false)

Response

  • Success: Binary audio file (WAV or MP3)

    • Content-Type: audio/wav or audio/mpeg
    • Content-Disposition: attachment; filename="tts_output.wav" or "tts_output.mp3"
  • Error: JSON error response

    {
      "success": false,
      "message": "Error description"
    }

πŸ“ Usage Examples

Example 1: Basic WAV Output

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test. Congratulations! You did it!"
  }' \
  --output output.wav

Example 2: MP3 with Custom Voice

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is an MP3 output with a custom voice.",
    "style": "af_sky",
    "mp3": true
  }' \
  --output output.mp3

Example 3: Fast Speech

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "I speak fast! Can you follow my speed?",
    "style": "af_sarah",
    "speed": 1.5,
    "mp3": true
  }' \
  --output fast.mp3

Example 4: Slow Speech with Stereo

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "I speak slowly and clearly.",
    "style": "am_adam",
    "speed": 0.8,
    "mono": false
  }' \
  --output slow_stereo.wav

Example 5: Long Text

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a very long text. The service will automatically split it into chunks and process them sequentially. Each chunk is processed independently to handle texts longer than the model token limit. The final audio is seamlessly concatenated to produce a smooth output.",
    "style": "bf_emma",
    "mp3": true
  }' \
  --output long.mp3

Example 6: Voice Style Mixing (Advanced)

You can blend multiple voice styles by using the + operator:

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This uses a mixed voice style.",
    "style": "af_sky.5+af_sarah.5",
    "mp3": true
  }' \
  --output mixed.mp3

Style Mixing Format: style1.weight+style2.weight

  • Weights are multiplied by 0.1 (so 5 means 0.5)
  • Example: af_sky.5+af_sarah.5 = 50% af_sky + 50% af_sarah

🎀 Available Voice Styles

American Female (af_*)

  • af_bella - Bella (Female, American)
  • af_nicole - Nicole (Female, American)
  • af_sarah - Sarah (Female, American)
  • af_sky - Sky (Female, American)

American Male (am_*)

  • am_adam - Adam (Male, American)
  • am_michael - Michael (Male, American)

British Female (bf_*)

  • bf_emma - Emma (Female, British)
  • bf_isabella - Isabella (Female, British)

British Male (bm_*)

  • bm_george - George (Male, British)
  • bm_lewis - Lewis (Male, British)

βš™οΈ Configuration Reference

Server Configuration

[server]
host = "0.0.0.0"  # Bind address
port = 8080        # Port number

TTS Configuration

[tts]
sample_rate = 24000              # Audio sample rate (Hz)
max_concurrent_requests = 4      # Maximum concurrent TTS requests
max_text_length = 10000          # Maximum input text length (characters)
max_tokens_per_chunk = 300       # Maximum tokens per processing chunk
default_language = "en"          # Default language
default_style = "default"        # Default voice style
default_speed = 1.0              # Default speech speed

Audio Configuration

[audio]
default_mono = true    # Default to mono audio
default_mp3 = false    # Default to WAV output
wav_bit_depth = 32     # WAV bit depth (16, 24, or 32)

Execution Provider

[execution]
provider = "cpu"  # Options: "cpu", "cuda", "coreml"

API Keys

[api]
keys = [
    "your-first-key",
    "your-second-key"
]

Service Settings

[service]
estimated_wait_seconds = 20  # Estimated wait time when queue is full

πŸ”’ Security Notes

  • Never commit your API keys to version control
  • Use environment variables for production deployments
  • Consider implementing rate limiting per API key
  • Use HTTPS in production environments

🚦 Error Responses

400 Bad Request

{
  "success": false,
  "message": "Text cannot be empty"
}

or

{
  "success": false,
  "message": "Text too long. Maximum 10000 characters allowed."
}

401 Unauthorized

{
  "success": false,
  "message": "Invalid or missing API key"
}

500 Internal Server Error

{
  "success": false,
  "message": "Audio synthesis failed: [error details]"
}

πŸ“Š Performance Tips

  1. Use MP3 for smaller file sizes (typically 10x smaller than WAV)
  2. Adjust max_concurrent_requests based on your CPU cores
  3. Monitor /api/status to check service load
  4. Split very long texts into multiple requests for better responsiveness
  5. Use speed parameter carefully (0.8-1.2 recommended for natural speech)

About

A TTS service that deploys Kokoro model inference

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages