TorchTS Model Management

This document describes the model management system that automatically handles loading and unloading of the Kokoro TTS model to optimize memory usage in Docker deployments.

Overview

The TorchTS backend now includes an intelligent model lifecycle manager that:

Loads models on-demand when TTS requests are made
Automatically unloads models after a period of inactivity to free memory
Handles concurrent requests safely during model loading/unloading
Provides API endpoints for monitoring and controlling model state

Configuration

Environment Variables

Variable	Default	Description
`MODEL_UNLOAD_TIMEOUT`	`300`	Seconds of inactivity before unloading model
`MODEL_DEVICE`	auto-detect	Device to load model on (`cuda`, `cpu`, or empty for auto)
`TORCHTS_DB_URL`	`sqlite:///data/torchts.db`	Database connection string
`LOG_LEVEL`	`INFO`	Logging level
`FORCE_GC_AFTER_REQUEST`	`false`	Force garbage collection after requests
`CLEAR_CUDA_CACHE`	`true`	Clear CUDA cache after model unload

Docker Compose Configuration

For CPU-only deployment:

environment:
  - MODEL_UNLOAD_TIMEOUT=300
  - MODEL_DEVICE=cpu

For GPU deployment:

environment:
  - MODEL_UNLOAD_TIMEOUT=180  # Shorter timeout for GPU
  - MODEL_DEVICE=cuda

Model Lifecycle

1. Initial State

Server starts with no model loaded in memory
First TTS request triggers model loading
Loading takes 10-30 seconds depending on device

2. Active State

Model remains in memory while processing requests
Each request resets the inactivity timer
Multiple concurrent requests are handled safely

3. Unload State

After MODEL_UNLOAD_TIMEOUT seconds of inactivity
Model is automatically unloaded from memory
GPU memory is cleared (if using CUDA)
Subsequent requests trigger reload

API Endpoints

Get Model Status

GET /model/status

Returns current model state and memory usage:

{
  "model_loaded": true,
  "device": "cuda",
  "unload_timeout": 300,
  "time_since_last_activity": 45.2,
  "is_loading": false,
  "available_languages": ["a", "b", "e", "f", "h", "i", "j", "p", "z"],
  "gpu_memory_allocated": 1234567890,
  "gpu_memory_reserved": 2345678901
}

Force Model Unload

POST /model/unload

Immediately unloads the model to free memory:

{
  "message": "Model unloaded successfully"
}

Update Timeout

POST /model/timeout
Content-Type: application/json

{
  "timeout_seconds": 600
}

Updates the unload timeout (minimum 60 seconds):

{
  "message": "Timeout updated to 600 seconds"
}

Memory Optimization Tips

For Low-Memory Systems

# Aggressive unloading (2 minutes)
MODEL_UNLOAD_TIMEOUT=120
FORCE_GC_AFTER_REQUEST=true

For High-Performance Systems

# Keep model loaded longer (10 minutes)
MODEL_UNLOAD_TIMEOUT=600
FORCE_GC_AFTER_REQUEST=false

For GPU Systems

# Balance performance and memory
MODEL_UNLOAD_TIMEOUT=300
MODEL_DEVICE=cuda
CLEAR_CUDA_CACHE=true

Monitoring

Memory Usage

Monitor model memory usage through the /model/status endpoint:

curl http://localhost:5005/model/status | jq '.gpu_memory_allocated'

Docker Stats

Monitor container memory usage:

docker stats torchts-backend-1

Logs

Model loading/unloading events are logged:

[green]Model loaded successfully in 12.34s[/green]
[yellow]Unloading model due to inactivity[/yellow]
[green]Model unloaded successfully[/green]

Troubleshooting

Model Won't Load

Check device availability: nvidia-smi for GPU
Verify memory availability (model needs ~2-4GB)
Check logs for specific error messages

High Memory Usage

Reduce MODEL_UNLOAD_TIMEOUT
Enable FORCE_GC_AFTER_REQUEST
Monitor with /model/status endpoint

Slow First Request

This is expected behavior - model loading takes time
Consider keeping model loaded with higher timeout
Use /model/status to check loading progress

CUDA Out of Memory

Reduce MODEL_UNLOAD_TIMEOUT
Ensure CLEAR_CUDA_CACHE=true
Force unload with POST /model/unload

Performance Characteristics

Scenario	Model Load Time	Memory Usage	First Request Latency
CPU (4 cores)	20-30s	~2GB	25-35s
GPU (RTX 3080)	10-15s	~3GB VRAM	15-20s
GPU (A100)	5-10s	~3GB VRAM	8-15s

Best Practices

Set appropriate timeout: Balance memory savings vs. response time
Monitor memory usage: Use /model/status and Docker stats
Consider usage patterns: Frequent use = higher timeout
Test your hardware: Measure load times for your specific setup
Use health checks: Monitor /model/status in production

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchTS Model Management

Overview

Configuration

Environment Variables

Docker Compose Configuration

Model Lifecycle

1. Initial State

2. Active State

3. Unload State

API Endpoints

Get Model Status

Force Model Unload

Update Timeout

Memory Optimization Tips

For Low-Memory Systems

For High-Performance Systems

For GPU Systems

Monitoring

Memory Usage

Docker Stats

Logs

Troubleshooting

Model Won't Load

High Memory Usage

Slow First Request

CUDA Out of Memory

Performance Characteristics

Best Practices

FilesExpand file tree

MODEL_MANAGEMENT.md

Latest commit

History

MODEL_MANAGEMENT.md

File metadata and controls

TorchTS Model Management

Overview

Configuration

Environment Variables

Docker Compose Configuration

Model Lifecycle

1. Initial State

2. Active State

3. Unload State

API Endpoints

Get Model Status

Force Model Unload

Update Timeout

Memory Optimization Tips

For Low-Memory Systems

For High-Performance Systems

For GPU Systems

Monitoring

Memory Usage

Docker Stats

Logs

Troubleshooting

Model Won't Load

High Memory Usage

Slow First Request

CUDA Out of Memory

Performance Characteristics

Best Practices