Python: >=3.10, <3.13
A CLI + Python package for automatic speech recognition (ASR) on audio or video using pluggable backends (currently WhisperX).
This tool:
- Accepts audio (e.g.,
wav/mp3/m4a) and video (e.g.,mp4/mkv/mov) - Automatically extracts audio from video inputs (requires ffmpeg)
- Writes a canonical
transcript.jsonplus optional sidecars:txt/srt/vtt/tsv - Optionally creates a subtitled MP4 with burned-in subtitles
ffmpeg is required for:
- extracting audio from video inputs
- creating subtitled videos
Install examples:
- macOS:
brew install ffmpeg - Conda env (recommended inside conda):
conda install -c conda-forge ffmpeg - Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y ffmpeg
pip install transcribe-audiopython -m transcribe_audio.transcribe -h
# or
python -m transcribe_audio.transcribe --helppython -c "from transcribe_audio.backends import available_backends; print('\\n'.join(available_backends()))"python -m transcribe_audio.transcribe \
-i in.wav \
--backend whisperx \
--model small \
--task transcribe \
-o out \
--run-name demo_transcribe \
-f all \
--save-subtitled-videopython -m transcribe_audio.transcribe \
-i in.wav \
--backend whisperx \
--model small \
--task transcribe \
-o out \
--run-name demo_diarize \
-f all \
--diarize \
--hf_token "$HF_TOKEN" \
--save-subtitled-videoOutputs land in:
out/<run-name>/
transcript.json
transcript.txt
transcript.srt
transcript.vtt
transcript.tsv
*_subtitled.mp4
Diarization models are often gated (require accepting terms) and may require a Hugging Face token.
When is an HF token needed?
- Usually only when models are not already cached (first download, cache miss, or when the library checks the Hub).
- For gated repos (common for diarization), a token + acceptance is required for downloads.
- If you run fully offline with everything cached (and
--model_cache_only truefor ASR), a token is often not needed.
- Go to HF settings → Access Tokens: https://huggingface.co/settings/tokens
- Create a New token with Role: Read
- Copy the token (starts with
hf_...)
Docs (tokens): https://huggingface.co/docs/hub/en/security-tokens
When you use diarization, you may be prompted to accept access for pyannote models. Open the model page mentioned in your errors (or the default diarization pipeline) and click “Agree and access”.
Browse diarization models: https://huggingface.co/pyannote/models
Common default diarization pipeline used by WhisperX: https://huggingface.co/pyannote/speaker-diarization-3.1
export HF_TOKEN="hf_XXXXXXXXXXXXXXXXXXXX"Then pass it to the CLI:
--hf_token "$HF_TOKEN"Security note: rotate/revoke tokens you’ve pasted anywhere public.
--input,-i <path ...>(required): One or more audio/video files.- Video inputs are auto-converted to WAV via ffmpeg into the run directory.
--output-dir,-o <dir>: Root output directory (defaultout).--run-name <name>: Run folder name under--output-dir.- If omitted, derived as
<first_input_stem>_<backend>.
- If omitted, derived as
--output-format,-f <format>: Transcript sidecars to generate:json(canonical output, always written)txt(plain text, includes speaker prefix when available)srt/vtt(subtitle formats)tsv(tab-separated: start/end/speaker/text)all(json + txt + srt + vtt + tsv)
--save-subtitled-video [out.mp4]: Create an MP4 with burned-in subtitles.- If a video input exists: subtitles burned onto the first video.
- If audio-only: creates a small black-strip video with subtitles.
- With no filename, auto-names output inside the run folder.
--backend <name>: Which backend to use (choices come fromavailable_backends()).- Current default:
whisperx.
- Current default:
--model <name>: Whisper/faster-whisper model name (defaultsmall).- Larger models are more accurate but slower and require more memory.
- Model list/reference: https://huggingface.co/collections/Systran/faster-whisper
--model_cache_only true|false: If true, do not download models (offline mode).- Will fail if models are not already cached.
--model_dir <path>: Directory for model downloads/cache.--device <cpu|cuda|...>: Device override. If omitted, backend auto-selects.--device_index <int>: GPU index (default 0).--batch_size <int>: Decoding batch size (default 8).- Higher can be faster but uses more VRAM. Lower if you hit OOM.
--compute_type <float16|float32|int8>:float16: faster on GPU, uses less VRAM (may be less stable on some setups)float32: safest/stableint8: lower memory, can be good on CPU; may impact accuracy
--task <transcribe|translate>:transcribe: speech → text in the same languagetranslate: speech → English (if supported by the model)
--language <code>: Force language (e.g.,en,es).- If omitted, auto-detect is used (slower but convenient).
--no_align: Disable alignment.- Faster, but timestamps may be less precise and word timings may be missing.
--align_model <name>: Optional alignment model override (usually auto-selected).- Alignment model reference list: https://docs.pytorch.org/audio/0.12.0/pipelines.html#wav2vec-2-0-hubert-fine-tuned-asr
--interpolate_method <nearest|linear|ignore>:nearest: fill missing word times using nearest neighbors (safe default)linear: interpolate missing timesignore: leave missing (can create gaps)
--return_char_alignments: Include character-level alignment (more detail, heavier output).
--vad_method <pyannote|silero>: Voice activity detection backend.- Changes segmentation and can affect diarization quality.
--vad_onset <float>: Speech-start threshold.- Higher = fewer false positives, but may miss quiet speech.
--vad_offset <float>: Speech-end threshold.- Higher = cuts off sooner; lower = keeps trailing audio longer.
--chunk_size <seconds>: Chunk size used in transcription (default 30).- Smaller chunks can reduce memory usage; larger can be faster.
--diarize: Enable speaker diarization.- Requires HF token + accepted model access in many setups.
- Browse diarization models: https://huggingface.co/pyannote/models
--min_speakers <int>: Optional lower bound on speakers.--max_speakers <int>: Optional upper bound on speakers.--diarize_model <repo>: Diarization pipeline to use.- Default:
pyannote/speaker-diarization-3.1
- Default:
--speaker_embeddings: Compute/return speaker embeddings.- Can improve speaker consistency but increases compute/memory.
These knobs mainly affect ASR decoding and stability:
--temperature <float>: 0.0 is deterministic; higher may help difficult audio but can introduce errors.--best_of <int>: More candidates; slower; sometimes better.--beam_size <int>: Larger beam can be more accurate; slower.--patience <float>: Beam search patience; higher explores more; slower.--length_penalty <float>: Penalizes/encourages longer outputs.--suppress_tokens <csv>: Token IDs to suppress (-1 = default behavior).--suppress_numerals: Suppress numeral tokens (may reduce number hallucinations but can remove real numbers).--initial_prompt <text>: Steer style/vocabulary (domain phrases).--hotwords <text>: Bias toward certain words (backend-dependent).--condition_on_previous_text true|false: Better continuity, but can compound errors.--fp16 true|false: Prefer fp16 when supported (mostly affects GPU).
Fallback / thresholds:
--temperature_increment_on_fallback <float>: Increase temperature when decoding fails.--compression_ratio_threshold <float>: Lower filters repetitive output more aggressively.--logprob_threshold <float>: Increase to drop low-confidence segments (can remove content).--no_speech_threshold <float>: Raise to skip more silence/background.
--max_line_width <int>: Wrap subtitle lines to this width.--max_line_count <int>: Limit subtitle lines per cue.--highlight_words true|false: Word highlighting in subtitle-like outputs (backend-dependent).--segment_resolution <sentence|chunk>:sentence: generally more readable subtitle segmentschunk: follows chunk/VAD boundaries more closely
--threads <int>: CPU thread hint (0 = backend default).--hf_token <token>: Hugging Face token for gated models (diarization, some VAD).--print_progress true|false: Show progress bars/logs.--verbose true|false: Print backend info logs and warnings (recommended while setting up).
After install:
python -c "from transcribe_audio import transcribe; print(transcribe)"from transcribe_audio import transcribe
res = transcribe(
inputs="in.wav",
backend="whisperx",
model="small",
output_format="all",
output_dir="out",
run_name="py_demo",
save_subtitled_video="__AUTO__", # or "my_subs.mp4"
)
print("Run dir:", res["run_dir"])
print("JSON:", res["outputs"]["json"])from transcribe_audio import transcribe
import os
res = transcribe(
inputs="in.wav",
backend="whisperx",
model="small",
output_format="all",
output_dir="out",
run_name="py_diarize",
save_subtitled_video="__AUTO__",
diarize=True,
hf_token=os.environ.get("HF_TOKEN"),
)
print(res["outputs"])Install uv (Astral): https://docs.astral.sh/uv/getting-started/installation/#standalone-installer
Verify:
uv --versionClone + install deps:
git clone https://github.com/Surya-Rayala/transcribe-audio.git
cd transcribe-audio
uv syncCLI help (uv environment):
uv run python -m transcribe_audio.transcribe -hBasic transcription (uv):
uv run python -m transcribe_audio.transcribe \
-i in.wav \
--backend whisperx \
--model small \
--task transcribe \
-o out \
--run-name demo_transcribe \
-f all \
--save-subtitled-videoTranscription + diarization (uv):
uv run python -m transcribe_audio.transcribe \
-i in.wav \
--backend whisperx \
--model small \
--task transcribe \
-o out \
--run-name demo_diarize \
-f all \
--diarize \
--hf_token "$HF_TOKEN" \
--save-subtitled-videoThis project is licensed under the MIT License. See the LICENSE file for details.