This document describes the model management system that automatically handles loading and unloading of the Kokoro TTS model to optimize memory usage in Docker deployments.
The TorchTS backend now includes an intelligent model lifecycle manager that:
- Loads models on-demand when TTS requests are made
- Automatically unloads models after a period of inactivity to free memory
- Handles concurrent requests safely during model loading/unloading
- Provides API endpoints for monitoring and controlling model state
| Variable | Default | Description |
|---|---|---|
MODEL_UNLOAD_TIMEOUT |
300 |
Seconds of inactivity before unloading model |
MODEL_DEVICE |
auto-detect | Device to load model on (cuda, cpu, or empty for auto) |
TORCHTS_DB_URL |
sqlite:///data/torchts.db |
Database connection string |
LOG_LEVEL |
INFO |
Logging level |
FORCE_GC_AFTER_REQUEST |
false |
Force garbage collection after requests |
CLEAR_CUDA_CACHE |
true |
Clear CUDA cache after model unload |
For CPU-only deployment:
environment:
- MODEL_UNLOAD_TIMEOUT=300
- MODEL_DEVICE=cpuFor GPU deployment:
environment:
- MODEL_UNLOAD_TIMEOUT=180 # Shorter timeout for GPU
- MODEL_DEVICE=cuda- Server starts with no model loaded in memory
- First TTS request triggers model loading
- Loading takes 10-30 seconds depending on device
- Model remains in memory while processing requests
- Each request resets the inactivity timer
- Multiple concurrent requests are handled safely
- After
MODEL_UNLOAD_TIMEOUTseconds of inactivity - Model is automatically unloaded from memory
- GPU memory is cleared (if using CUDA)
- Subsequent requests trigger reload
GET /model/statusReturns current model state and memory usage:
{
"model_loaded": true,
"device": "cuda",
"unload_timeout": 300,
"time_since_last_activity": 45.2,
"is_loading": false,
"available_languages": ["a", "b", "e", "f", "h", "i", "j", "p", "z"],
"gpu_memory_allocated": 1234567890,
"gpu_memory_reserved": 2345678901
}POST /model/unloadImmediately unloads the model to free memory:
{
"message": "Model unloaded successfully"
}POST /model/timeout
Content-Type: application/json
{
"timeout_seconds": 600
}Updates the unload timeout (minimum 60 seconds):
{
"message": "Timeout updated to 600 seconds"
}# Aggressive unloading (2 minutes)
MODEL_UNLOAD_TIMEOUT=120
FORCE_GC_AFTER_REQUEST=true# Keep model loaded longer (10 minutes)
MODEL_UNLOAD_TIMEOUT=600
FORCE_GC_AFTER_REQUEST=false# Balance performance and memory
MODEL_UNLOAD_TIMEOUT=300
MODEL_DEVICE=cuda
CLEAR_CUDA_CACHE=trueMonitor model memory usage through the /model/status endpoint:
curl http://localhost:5005/model/status | jq '.gpu_memory_allocated'Monitor container memory usage:
docker stats torchts-backend-1Model loading/unloading events are logged:
[green]Model loaded successfully in 12.34s[/green]
[yellow]Unloading model due to inactivity[/yellow]
[green]Model unloaded successfully[/green]
- Check device availability:
nvidia-smifor GPU - Verify memory availability (model needs ~2-4GB)
- Check logs for specific error messages
- Reduce
MODEL_UNLOAD_TIMEOUT - Enable
FORCE_GC_AFTER_REQUEST - Monitor with
/model/statusendpoint
- This is expected behavior - model loading takes time
- Consider keeping model loaded with higher timeout
- Use
/model/statusto check loading progress
- Reduce
MODEL_UNLOAD_TIMEOUT - Ensure
CLEAR_CUDA_CACHE=true - Force unload with
POST /model/unload
| Scenario | Model Load Time | Memory Usage | First Request Latency |
|---|---|---|---|
| CPU (4 cores) | 20-30s | ~2GB | 25-35s |
| GPU (RTX 3080) | 10-15s | ~3GB VRAM | 15-20s |
| GPU (A100) | 5-10s | ~3GB VRAM | 8-15s |
- Set appropriate timeout: Balance memory savings vs. response time
- Monitor memory usage: Use
/model/statusand Docker stats - Consider usage patterns: Frequent use = higher timeout
- Test your hardware: Measure load times for your specific setup
- Use health checks: Monitor
/model/statusin production