kokorotts_service - Text-to-Speech API Service

kokorotts_service is a Text-to-Speech (TTS) API service engineered in Rust. Built upon the Kokoro v1.0 architecture and utilizing ONNX Runtime, it delivers state-of-the-art audio synthesis with exceptional speed and a low memory footprint. Designed for efficiency, this service features a REST API supporting 10+ voice styles (American & British), real-time style mixing, automatic long-text chunking, and flexible output formats (WAV/MP3). This is a lightweight solution for integrating high-quality voice synthesis into local LLM workflows and real-time applications.

Based on:

📦 Model Information

Model: Kokoro v1.0 ONNX
Sample Rate: 24kHz
Voices: 10+ styles (American & British accents)
Languages: Multi-language support (primary: English)

Models are automatically downloaded on first run to the models/ directory.

🚀 Quick Start

1. Configuration

Edit config.toml:

[server]
host = "0.0.0.0"
port = 8080

[api]
keys = [
    "your-secret-api-key-here"
]

2. Run the Service

cargo run --release

The service will automatically download required models on first run.

📡 API Endpoints

Base URL: http://localhost:8080

Authentication: All API endpoints (except /health) require Bearer token authentication:

-H "Authorization: Bearer your-secret-api-key-here"

🔍 Available Endpoints

1. Health Check (Public)

Check if the service is running.

Endpoint: GET /health

Request:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "timestamp": 1732723200
}

2. Get Available Voice Styles

List all available voice styles.

Endpoint: GET /api/styles

Request:

curl -H "Authorization: Bearer your-secret-api-key-here" \
     http://localhost:8080/api/styles

Response:

{
  "styles": [
    "af_bella",
    "af_nicole",
    "af_sarah",
    "af_sky",
    "am_adam",
    "am_michael",
    "bf_emma",
    "bf_isabella",
    "bm_george",
    "bm_lewis"
  ]
}

3. Get Service Status

Check current service load and capacity.

Endpoint: GET /api/status

Request:

curl -H "Authorization: Bearer your-secret-api-key-here" \
     http://localhost:8080/api/status

Response:

{
  "available_slots": 3,
  "max_concurrent": 4,
  "service_healthy": true,
  "estimated_wait_time_seconds": null
}

When service is busy:

{
  "available_slots": 0,
  "max_concurrent": 4,
  "service_healthy": true,
  "estimated_wait_time_seconds": 20
}

4. Text-to-Speech Synthesis

Synthesize speech from text.

Endpoint: POST /api/tts

Request Body

Parameter	Type	Required	Default	Description
`text`	string	✅ Yes	-	Text to synthesize (max 10,000 characters)
`language`	string	No	`"en"`	Language code: `en`, `zh`, `ja`, etc.
`style`	string	No	`"default"`	Voice style (see `/api/styles`)
`speed`	float	No	`1.0`	Speech speed (0.5 - 2.0)
`initial_silence`	integer	No	`null`	Number of silence tokens at start
`mono`	boolean	No	`true`	Mono audio (`false` for stereo)
`mp3`	boolean	No	`false`	Output as MP3 (`true`) or WAV (`false`)

Response

Success: Binary audio file (WAV or MP3)
- Content-Type: audio/wav or audio/mpeg
- Content-Disposition: attachment; filename="tts_output.wav" or "tts_output.mp3"

Error: JSON error response

{
  "success": false,
  "message": "Error description"
}

📝 Usage Examples

Example 1: Basic WAV Output

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test. Congratulations! You did it!"
  }' \
  --output output.wav

Example 2: MP3 with Custom Voice

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is an MP3 output with a custom voice.",
    "style": "af_sky",
    "mp3": true
  }' \
  --output output.mp3

Example 3: Fast Speech

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "I speak fast! Can you follow my speed?",
    "style": "af_sarah",
    "speed": 1.5,
    "mp3": true
  }' \
  --output fast.mp3

Example 4: Slow Speech with Stereo

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "I speak slowly and clearly.",
    "style": "am_adam",
    "speed": 0.8,
    "mono": false
  }' \
  --output slow_stereo.wav

Example 5: Long Text

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a very long text. The service will automatically split it into chunks and process them sequentially. Each chunk is processed independently to handle texts longer than the model token limit. The final audio is seamlessly concatenated to produce a smooth output.",
    "style": "bf_emma",
    "mp3": true
  }' \
  --output long.mp3

Example 6: Voice Style Mixing (Advanced)

You can blend multiple voice styles by using the + operator:

curl -X POST http://localhost:8080/api/tts \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This uses a mixed voice style.",
    "style": "af_sky.5+af_sarah.5",
    "mp3": true
  }' \
  --output mixed.mp3

Style Mixing Format: style1.weight+style2.weight

Weights are multiplied by 0.1 (so 5 means 0.5)
Example: af_sky.5+af_sarah.5 = 50% af_sky + 50% af_sarah

🎤 Available Voice Styles

American Female (af_*)

af_bella - Bella (Female, American)
af_nicole - Nicole (Female, American)
af_sarah - Sarah (Female, American)
af_sky - Sky (Female, American)

American Male (am_*)

am_adam - Adam (Male, American)
am_michael - Michael (Male, American)

British Female (bf_*)

bf_emma - Emma (Female, British)
bf_isabella - Isabella (Female, British)

British Male (bm_*)

bm_george - George (Male, British)
bm_lewis - Lewis (Male, British)

⚙️ Configuration Reference

Server Configuration

[server]
host = "0.0.0.0"  # Bind address
port = 8080        # Port number

TTS Configuration

[tts]
sample_rate = 24000              # Audio sample rate (Hz)
max_concurrent_requests = 4      # Maximum concurrent TTS requests
max_text_length = 10000          # Maximum input text length (characters)
max_tokens_per_chunk = 300       # Maximum tokens per processing chunk
default_language = "en"          # Default language
default_style = "default"        # Default voice style
default_speed = 1.0              # Default speech speed

Audio Configuration

[audio]
default_mono = true    # Default to mono audio
default_mp3 = false    # Default to WAV output
wav_bit_depth = 32     # WAV bit depth (16, 24, or 32)

Execution Provider

[execution]
provider = "cpu"  # Options: "cpu", "cuda", "coreml"

API Keys

[api]
keys = [
    "your-first-key",
    "your-second-key"
]

Service Settings

[service]
estimated_wait_seconds = 20  # Estimated wait time when queue is full

🔒 Security Notes

Never commit your API keys to version control
Use environment variables for production deployments
Consider implementing rate limiting per API key
Use HTTPS in production environments

🚦 Error Responses

400 Bad Request

{
  "success": false,
  "message": "Text cannot be empty"
}

or

{
  "success": false,
  "message": "Text too long. Maximum 10000 characters allowed."
}

401 Unauthorized

{
  "success": false,
  "message": "Invalid or missing API key"
}

500 Internal Server Error

{
  "success": false,
  "message": "Audio synthesis failed: [error details]"
}

📊 Performance Tips

Use MP3 for smaller file sizes (typically 10x smaller than WAV)
Adjust max_concurrent_requests based on your CPU cores
Monitor /api/status to check service load
Split very long texts into multiple requests for better responsiveness
Use speed parameter carefully (0.8-1.2 recommended for natural speech)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
server.toml		server.toml

Folders and files

Latest commit

History

Repository files navigation

kokorotts_service - Text-to-Speech API Service

Based on:

📦 Model Information

🚀 Quick Start

1. Configuration

2. Run the Service

📡 API Endpoints

🔍 Available Endpoints

1. Health Check (Public)

2. Get Available Voice Styles

3. Get Service Status

4. Text-to-Speech Synthesis

Request Body

Response

📝 Usage Examples

Example 1: Basic WAV Output

Example 2: MP3 with Custom Voice

Example 3: Fast Speech

Example 4: Slow Speech with Stereo

Example 5: Long Text

Example 6: Voice Style Mixing (Advanced)

🎤 Available Voice Styles

American Female (af_*)

American Male (am_*)

British Female (bf_*)

British Male (bm_*)

⚙️ Configuration Reference

Server Configuration

TTS Configuration

Audio Configuration

Execution Provider

API Keys

Service Settings

🔒 Security Notes

🚦 Error Responses

400 Bad Request

401 Unauthorized

500 Internal Server Error

📊 Performance Tips

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages