kokorotts_service is a Text-to-Speech (TTS) API service engineered in Rust. Built upon the Kokoro v1.0 architecture and utilizing ONNX Runtime, it delivers state-of-the-art audio synthesis with exceptional speed and a low memory footprint. Designed for efficiency, this service features a REST API supporting 10+ voice styles (American & British), real-time style mixing, automatic long-text chunking, and flexible output formats (WAV/MP3). This is a lightweight solution for integrating high-quality voice synthesis into local LLM workflows and real-time applications.
- Model: Kokoro v1.0 ONNX
- Sample Rate: 24kHz
- Voices: 10+ styles (American & British accents)
- Languages: Multi-language support (primary: English)
Models are automatically downloaded on first run to the models/ directory.
Edit config.toml:
[server]
host = "0.0.0.0"
port = 8080
[api]
keys = [
"your-secret-api-key-here"
]cargo run --releaseThe service will automatically download required models on first run.
Base URL: http://localhost:8080
Authentication: All API endpoints (except /health) require Bearer token authentication:
-H "Authorization: Bearer your-secret-api-key-here"Check if the service is running.
Endpoint: GET /health
Request:
curl http://localhost:8080/healthResponse:
{
"status": "healthy",
"timestamp": 1732723200
}List all available voice styles.
Endpoint: GET /api/styles
Request:
curl -H "Authorization: Bearer your-secret-api-key-here" \
http://localhost:8080/api/stylesResponse:
{
"styles": [
"af_bella",
"af_nicole",
"af_sarah",
"af_sky",
"am_adam",
"am_michael",
"bf_emma",
"bf_isabella",
"bm_george",
"bm_lewis"
]
}Check current service load and capacity.
Endpoint: GET /api/status
Request:
curl -H "Authorization: Bearer your-secret-api-key-here" \
http://localhost:8080/api/statusResponse:
{
"available_slots": 3,
"max_concurrent": 4,
"service_healthy": true,
"estimated_wait_time_seconds": null
}When service is busy:
{
"available_slots": 0,
"max_concurrent": 4,
"service_healthy": true,
"estimated_wait_time_seconds": 20
}Synthesize speech from text.
Endpoint: POST /api/tts
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text |
string | β Yes | - | Text to synthesize (max 10,000 characters) |
language |
string | No | "en" |
Language code: en, zh, ja, etc. |
style |
string | No | "default" |
Voice style (see /api/styles) |
speed |
float | No | 1.0 |
Speech speed (0.5 - 2.0) |
initial_silence |
integer | No | null |
Number of silence tokens at start |
mono |
boolean | No | true |
Mono audio (false for stereo) |
mp3 |
boolean | No | false |
Output as MP3 (true) or WAV (false) |
-
Success: Binary audio file (WAV or MP3)
- Content-Type:
audio/wavoraudio/mpeg - Content-Disposition:
attachment; filename="tts_output.wav"or"tts_output.mp3"
- Content-Type:
-
Error: JSON error response
{ "success": false, "message": "Error description" }
curl -X POST http://localhost:8080/api/tts \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a test. Congratulations! You did it!"
}' \
--output output.wavcurl -X POST http://localhost:8080/api/tts \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"text": "This is an MP3 output with a custom voice.",
"style": "af_sky",
"mp3": true
}' \
--output output.mp3curl -X POST http://localhost:8080/api/tts \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"text": "I speak fast! Can you follow my speed?",
"style": "af_sarah",
"speed": 1.5,
"mp3": true
}' \
--output fast.mp3curl -X POST http://localhost:8080/api/tts \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"text": "I speak slowly and clearly.",
"style": "am_adam",
"speed": 0.8,
"mono": false
}' \
--output slow_stereo.wavcurl -X POST http://localhost:8080/api/tts \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"text": "This is a very long text. The service will automatically split it into chunks and process them sequentially. Each chunk is processed independently to handle texts longer than the model token limit. The final audio is seamlessly concatenated to produce a smooth output.",
"style": "bf_emma",
"mp3": true
}' \
--output long.mp3You can blend multiple voice styles by using the + operator:
curl -X POST http://localhost:8080/api/tts \
-H "Authorization: Bearer your-secret-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"text": "This uses a mixed voice style.",
"style": "af_sky.5+af_sarah.5",
"mp3": true
}' \
--output mixed.mp3Style Mixing Format: style1.weight+style2.weight
- Weights are multiplied by 0.1 (so
5means 0.5) - Example:
af_sky.5+af_sarah.5= 50% af_sky + 50% af_sarah
af_bella- Bella (Female, American)af_nicole- Nicole (Female, American)af_sarah- Sarah (Female, American)af_sky- Sky (Female, American)
am_adam- Adam (Male, American)am_michael- Michael (Male, American)
bf_emma- Emma (Female, British)bf_isabella- Isabella (Female, British)
bm_george- George (Male, British)bm_lewis- Lewis (Male, British)
[server]
host = "0.0.0.0" # Bind address
port = 8080 # Port number[tts]
sample_rate = 24000 # Audio sample rate (Hz)
max_concurrent_requests = 4 # Maximum concurrent TTS requests
max_text_length = 10000 # Maximum input text length (characters)
max_tokens_per_chunk = 300 # Maximum tokens per processing chunk
default_language = "en" # Default language
default_style = "default" # Default voice style
default_speed = 1.0 # Default speech speed[audio]
default_mono = true # Default to mono audio
default_mp3 = false # Default to WAV output
wav_bit_depth = 32 # WAV bit depth (16, 24, or 32)[execution]
provider = "cpu" # Options: "cpu", "cuda", "coreml"[api]
keys = [
"your-first-key",
"your-second-key"
][service]
estimated_wait_seconds = 20 # Estimated wait time when queue is full- Never commit your API keys to version control
- Use environment variables for production deployments
- Consider implementing rate limiting per API key
- Use HTTPS in production environments
{
"success": false,
"message": "Text cannot be empty"
}or
{
"success": false,
"message": "Text too long. Maximum 10000 characters allowed."
}{
"success": false,
"message": "Invalid or missing API key"
}{
"success": false,
"message": "Audio synthesis failed: [error details]"
}- Use MP3 for smaller file sizes (typically 10x smaller than WAV)
- Adjust
max_concurrent_requestsbased on your CPU cores - Monitor
/api/statusto check service load - Split very long texts into multiple requests for better responsiveness
- Use
speedparameter carefully (0.8-1.2 recommended for natural speech)