Dembrane RunPod Whisper

Overview

The runpod-whisper worker provides fast and accurate audio transcriptions. Built for deployment on RunPod as a serverless endpoint, the worker automatically scales with demand. The worker supports multiple languages, configurable model selection, and runs on CPU or GPU.

Features

Core Capabilities

Multilingual support – Supports any language that the configured model supports
Hallucination detection – LLM assigns a hallucination score (0.0–1.0) with reasoning to assess reliability
Optional LiteLLM translation – Automatically translates if not in the requested language
Word-level timestamps – Set enable_timestamps: true to include start and end times for each segment

Environment Variables

Configure the worker behavior through these environment variables:

Core Configuration

WHISPER_MODEL_NAME: The Hugging Face model name for Faster Whisper
- Default: "Systran/faster-whisper-large-v1"

Performance Settings

BATCH_SIZE: Number of audio segments processed per batch
- Default: "8"
USE_CPU: Force CPU execution even when GPU is available
- Default: Not set (auto-detect)
- Set to "1" to force CPU usage
DEBUG: Enable detailed debug logging
- Default: "false"
- Set to "true", "1", or "yes" to enable

LiteLLM Integration

Configure these variables to enable translation and hallucination detection:

LITELLM_MODEL: LiteLLM model identifier (required for LiteLLM features)
LITELLM_API_KEY: API key for LiteLLM service (required)
LITELLM_API_BASE: Base URL for LiteLLM API (optional)
LITELLM_API_VERSION: API version for LiteLLM (optional)

Handler API Usage

The worker exposes a single handler that expects a JSON payload.

Input Payload

The input object in the JSON payload contains:

{
    "audio_base_64": "string (optional)",
    "audio": "string (URL, optional)",
    "language": "string (optional, e.g., 'en', 'nl')",
    "hotwords": "string (optional)",
    "enable_timestamps": false,
    "metadata_str": "string (optional)",
    "disable_hallucination_detection": false,
    "disable_translation": false
}

Field Descriptions

audio_base_64 (string, optional): Base64 encoded audio data
audio (string, optional): URL / local path pointing to an audio file
Note: Either audio_base_64 or audio must be provided
language (string, optional): Target language code. If the target language is not the same as the detected language, the transcription will be translated to the target language.
hotwords (string, optional): Comma-separated list of hotwords to help transcribe. Use this to inform the model of proper nouns, technical terms, or other words that are important to the conversation.
enable_timestamps (boolean, optional): If true, response includes word-level timestamp data
metadata_str (strings, optional): Metadata fields echoed back in response
disable_hallucination_detection (boolean, optional): If true, hallucination detection will be disabled
disable_translation (boolean, optional): If true, translation will be disabled

Example Input

{
  "input": {
    "audio": "https://github.com/runpod-workers/sample-inputs/raw/refs/heads/main/audio/Arthur.mp3",
    "language": "nl",
    "hotwords": "RunPod,Directus,Sameer,Dembrane",
    "metadata_str": "This is a test metadata string",
    "enable_timestamps": true,
    "disable_hallucination_detection": false,
    "disable_translation": false
  }
}

Output Payload

Success Response

{
    "metadata_str": "optional string",
    "enable_timestamps": true,
    "language": "nl",
    "detected_language": "nl",
    "detected_language_confidence": 0.9805044531822205,
    "joined_text": "... full transcription ...",
    "translation_text": "...full translation...",
    "translation_error": false,
    "hallucination_score": 0.2,
    "hallucination_reason": "Minor repetitions detected",
    "segments": [
        {
            "text": "Segment text",
            "start": 0.0,
            "end": 2.5
        }
    ]
}

Response Fields

joined_text: Complete transcription text (translated if needed)
translation_error: true if any translation failed or timed out
hallucination_score: Float 0.0–1.0 indicating severity:
- 0.0: No hallucination detected
- 0.1–0.3: Minor errors, meaning intact
- 0.4–0.6: Moderate errors, partial distortion
- 0.7–0.9: Severe errors, strong distortion
- 1.0: Complete hallucination/nonsense
hallucination_reason: Brief explanation (max 20 words) when score > 0
segments: Array of segment objects (only when enable_timestamps: true)

Error Response

{
    "metadata_str": "",
    "enable_timestamps": false,
    "language": "en",
    "error": "No audio input provided",
    "message": "An unhandled error occurred while processing the request."
}

Local Development

Clone the repository:

git clone https://github.com/dembrane/runpod-whisper.git
cd runpod-whisper

Install dependencies:
```
pip install -r requirements.txt
```
Set up the environment variables and modify the test_input.json file.
Run the handler:
```
python handler.py
```

Compute Resources

The handler automatically detects available hardware:

GPU: Uses cuda device with compute_type="float16"
CPU with MPS (Apple Silicon): Uses cpu device with compute_type="float32"
CPU without MPS: Uses cpu device with compute_type="int8"

CPU threads are set to the available CPU count.

Translation Behavior

When LiteLLM is configured: If the detected language is not the same as the requested language, the transcription will be translated to the requested language.

Hallucination Detection

When LiteLLM is configured, the system analyzes the complete transcription after translation and evaluates for common hallucination patterns:

Excessive word/phrase repetition
Nonsensical or contradictory sequences
Abrupt topic changes
Misplaced technical terms
Transcribed filler sounds

CI/CD

GitHub Actions workflow (.github/workflows/ci.yml):

Triggers: Push to main branch or manual dispatch
Actions:
- Builds Docker image
- Pushes to Azure Container Registry
- Tags with Git commit SHA
- Uses layer caching for efficiency

Required Secrets

AZURE_REGISTRY_LOGIN_SERVER
AZURE_REGISTRY_USERNAME
AZURE_REGISTRY_PASSWORD

Security Considerations

Audio URLs must be publicly accessible
Temporary files are automatically cleaned up if public audio url is used
API keys should be stored securely as environment variables
Consider network policies for production deployments

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
.vscode		.vscode
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
handler.py		handler.py
requirements.txt		requirements.txt
test.mp3		test.mp3
test_input.json		test_input.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dembrane RunPod Whisper

Overview

Features

Core Capabilities

Environment Variables

Core Configuration

Performance Settings

LiteLLM Integration

Handler API Usage

Input Payload

Field Descriptions

Example Input

Output Payload

Success Response

Response Fields

Error Response

Local Development

Compute Resources

Translation Behavior

Hallucination Detection

CI/CD

Required Secrets

Security Considerations

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dembrane RunPod Whisper

Overview

Features

Core Capabilities

Environment Variables

Core Configuration

Performance Settings

LiteLLM Integration

Handler API Usage

Input Payload

Field Descriptions

Example Input

Output Payload

Success Response

Response Fields

Error Response

Local Development

Compute Resources

Translation Behavior

Hallucination Detection

CI/CD

Required Secrets

Security Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages