Fast, reliable speech recognition using OpenAI's Whisper models, optimized for Kubernetes deployments.
- Multiple Model Sizes: Choose from 5 model sizes (tiny to large) balancing speed vs. accuracy
- GPU Acceleration: NVIDIA GPU support for faster processing
- Multi-language: Supports 90+ languages with language auto-detection
- Translation: Translates non-English speech to English
- Production-Ready: Built for Kubernetes with monitoring, health checks, and scaling
- Flexible Configuration: Runtime configuration via API or environment variables
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: whisper-service
spec:
predictor:
containers:
- image: ghcr.io/bnallapeta/kube-whisperer:0.0.1
env:
- name: WHISPER_MODEL
value: "base" # Options: tiny, base, small, medium, large
- name: DEVICE
value: "cuda" # Use "cpu" for CPU-only
resources:
limits:
nvidia.com/gpu: "1" # Remove for CPU deploymentdocker run -p 8000:8000 \
-e WHISPER_MODEL=small \
-e DEVICE=cuda \
-e COMPUTE_TYPE=float16 \
ghcr.io/bnallapeta/kube-whisperer:0.0.1# Clone repository
git clone https://github.com/bnallapeta/kube-whisperer
cd kube-whisperer
# activate virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run locally with configuration
WHISPER_MODEL=medium DEVICE=cpu python -m src.serve| Parameter | Description | Default | Options |
|---|---|---|---|
whisper_model |
Model size | base |
tiny, base, small, medium, large |
device |
Compute device | cpu |
cpu, cuda, mps |
compute_type |
Precision | int8 |
int8, float16, float32 |
cpu_threads |
CPU threads | 4 |
Any positive integer |
num_workers |
Workers | 1 |
Any positive integer |
| Parameter | Description | Default | Options |
|---|---|---|---|
language |
Language code | en |
Any ISO language code (e.g., en, es, fr) |
task |
Task type | transcribe |
transcribe, translate |
beam_size |
Beam search size | 5 |
Integer between 1-10 |
temperature |
Sampling temperature | [0.0, 0.2, 0.4, 0.6, 0.8, 1.0] |
Float values 0.0-1.0 |
For Kubernetes deployments, configuration values are specified directly in the YAML file:
# In your InferenceService YAML
spec:
predictor:
containers:
- image: ghcr.io/bnallapeta/kube-whisperer:0.0.1
env:
- name: WHISPER_MODEL
value: "small"
- name: DEVICE
value: "cuda"
- name: COMPUTE_TYPE
value: "float16"
- name: CPU_THREADS
value: "8"
- name: NUM_WORKERS
value: "2"To modify these configurations:
- Edit the YAML file before applying
- Use
kubectl edit inferenceservice whisper-serviceto modify an existing deployment - Use Kustomize or Helm for managing different configurations
When using Docker, pass environment variables with the -e flag:
docker run -p 8000:8000 \
-e WHISPER_MODEL=small \
-e DEVICE=cuda \
-e COMPUTE_TYPE=float16 \
-e CPU_THREADS=8 \
ghcr.io/bnallapeta/kube-whisperer:0.0.1For local development, set environment variables before running:
# Set environment variables
export WHISPER_MODEL=small
export DEVICE=cuda
export COMPUTE_TYPE=float16
# Then run the service
python -m src.serveUpdate configuration without restarting:
curl -X POST http://whisper-service/config \
-H "Content-Type: application/json" \
-d '{
"whisper_model": "small",
"device": "cuda",
"compute_type": "float16"
}'Configure options for a specific request:
curl -X POST http://whisper-service/transcribe \
-F "file=@audio.wav" \
-F 'options={"language":"es", "task":"translate"}'| Endpoint | Method | Description |
|---|---|---|
/transcribe |
POST | Transcribe a single audio file |
/batch_transcribe |
POST | Transcribe multiple audio files |
/config |
POST | Update service configuration |
/ready |
GET | Readiness check |
/live |
GET | Liveness check |
/metrics |
GET | Prometheus metrics |
curl -X POST http://whisper-service/transcribe \
-F "file=@audio.wav"curl -X POST http://whisper-service/transcribe \
-F "file=@audio.wav" \
-F 'options={"language":"fr", "task":"translate"}'curl -X POST http://whisper-service/batch_transcribe \
-H "Content-Type: application/json" \
-d '{
"files": ["/path/to/file1.wav", "/path/to/file2.mp3"],
"options": {"language": "es"}
}'-
Out of Memory: Choose a smaller model or increase GPU memory
# Use a smaller model -e WHISPER_MODEL=small # Use lower precision -e COMPUTE_TYPE=int8
-
Slow Performance: Check your compute configuration
# Check if GPU is being used curl http://whisper-service/ready | grep device # For CPU deployment, increase threads -e CPU_THREADS=8
-
Language Issues: Specify the language explicitly
# Specify language in the request -F 'options={"language":"fr"}'
- Check logs:
kubectl logs -f deployment/whisper-service - File issues: GitHub Issues
This project is licensed under the MIT License - see the LICENSE file for details.