The runpod-whisper worker provides fast and accurate audio transcriptions. Built for deployment on RunPod as a serverless endpoint, the worker automatically scales with demand. The worker supports multiple languages, configurable model selection, and runs on CPU or GPU.
- Multilingual support – Supports any language that the configured model supports
- Hallucination detection – LLM assigns a hallucination score (0.0–1.0) with reasoning to assess reliability
- Optional LiteLLM translation – Automatically translates if not in the requested language
- Word-level timestamps – Set
enable_timestamps: trueto include start and end times for each segment
Configure the worker behavior through these environment variables:
WHISPER_MODEL_NAME: The Hugging Face model name for Faster Whisper- Default:
"Systran/faster-whisper-large-v1"
- Default:
BATCH_SIZE: Number of audio segments processed per batch- Default:
"8"
- Default:
USE_CPU: Force CPU execution even when GPU is available- Default: Not set (auto-detect)
- Set to
"1"to force CPU usage
DEBUG: Enable detailed debug logging- Default:
"false" - Set to
"true","1", or"yes"to enable
- Default:
Configure these variables to enable translation and hallucination detection:
LITELLM_MODEL: LiteLLM model identifier (required for LiteLLM features)LITELLM_API_KEY: API key for LiteLLM service (required)LITELLM_API_BASE: Base URL for LiteLLM API (optional)LITELLM_API_VERSION: API version for LiteLLM (optional)
The worker exposes a single handler that expects a JSON payload.
The input object in the JSON payload contains:
{
"audio_base_64": "string (optional)",
"audio": "string (URL, optional)",
"language": "string (optional, e.g., 'en', 'nl')",
"hotwords": "string (optional)",
"enable_timestamps": false,
"metadata_str": "string (optional)",
"disable_hallucination_detection": false,
"disable_translation": false
}audio_base_64(string, optional): Base64 encoded audio dataaudio(string, optional): URL / local path pointing to an audio file- Note: Either
audio_base_64oraudiomust be provided language(string, optional): Target language code. If the target language is not the same as the detected language, the transcription will be translated to the target language.hotwords(string, optional): Comma-separated list of hotwords to help transcribe. Use this to inform the model of proper nouns, technical terms, or other words that are important to the conversation.enable_timestamps(boolean, optional): Iftrue, response includes word-level timestamp datametadata_str(strings, optional): Metadata fields echoed back in responsedisable_hallucination_detection(boolean, optional): Iftrue, hallucination detection will be disableddisable_translation(boolean, optional): Iftrue, translation will be disabled
{
"input": {
"audio": "https://github.com/runpod-workers/sample-inputs/raw/refs/heads/main/audio/Arthur.mp3",
"language": "nl",
"hotwords": "RunPod,Directus,Sameer,Dembrane",
"metadata_str": "This is a test metadata string",
"enable_timestamps": true,
"disable_hallucination_detection": false,
"disable_translation": false
}
}{
"metadata_str": "optional string",
"enable_timestamps": true,
"language": "nl",
"detected_language": "nl",
"detected_language_confidence": 0.9805044531822205,
"joined_text": "... full transcription ...",
"translation_text": "...full translation...",
"translation_error": false,
"hallucination_score": 0.2,
"hallucination_reason": "Minor repetitions detected",
"segments": [
{
"text": "Segment text",
"start": 0.0,
"end": 2.5
}
]
}joined_text: Complete transcription text (translated if needed)translation_error:trueif any translation failed or timed outhallucination_score: Float 0.0–1.0 indicating severity:- 0.0: No hallucination detected
- 0.1–0.3: Minor errors, meaning intact
- 0.4–0.6: Moderate errors, partial distortion
- 0.7–0.9: Severe errors, strong distortion
- 1.0: Complete hallucination/nonsense
hallucination_reason: Brief explanation (max 20 words) when score > 0segments: Array of segment objects (only whenenable_timestamps: true)
{
"metadata_str": "",
"enable_timestamps": false,
"language": "en",
"error": "No audio input provided",
"message": "An unhandled error occurred while processing the request."
}-
Clone the repository:
git clone https://github.com/dembrane/runpod-whisper.git cd runpod-whisper -
Install dependencies:
pip install -r requirements.txt
-
Set up the environment variables and modify the
test_input.jsonfile. -
Run the handler:
python handler.py
The handler automatically detects available hardware:
- GPU: Uses
cudadevice withcompute_type="float16" - CPU with MPS (Apple Silicon): Uses
cpudevice withcompute_type="float32" - CPU without MPS: Uses
cpudevice withcompute_type="int8"
CPU threads are set to the available CPU count.
When LiteLLM is configured: If the detected language is not the same as the requested language, the transcription will be translated to the requested language.
When LiteLLM is configured, the system analyzes the complete transcription after translation and evaluates for common hallucination patterns:
- Excessive word/phrase repetition
- Nonsensical or contradictory sequences
- Abrupt topic changes
- Misplaced technical terms
- Transcribed filler sounds
GitHub Actions workflow (.github/workflows/ci.yml):
- Triggers: Push to
mainbranch or manual dispatch - Actions:
- Builds Docker image
- Pushes to Azure Container Registry
- Tags with Git commit SHA
- Uses layer caching for efficiency
AZURE_REGISTRY_LOGIN_SERVERAZURE_REGISTRY_USERNAMEAZURE_REGISTRY_PASSWORD
- Audio URLs must be publicly accessible
- Temporary files are automatically cleaned up if public audio url is used
- API keys should be stored securely as environment variables
- Consider network policies for production deployments