Download, transcribe, diarize, and translate audio/video in 12 Indian languages — entirely on your machine.
A Claude Code skill set that gives your AI assistant the ability to download videos from 1000+ sites, transcribe speech to text using state-of-the-art fine-tuned Whisper models (or Alibaba's Qwen3-ASR), identify speakers with pyannote diarization, and translate between languages. No cloud APIs, no data leaving your machine, no API keys.
- Speaker Diarization — Identify who spoke when using pyannote-audio. Add
--diarizeand get[Speaker 1],[Speaker 2]labels in every output file. - Qwen3-ASR Engine — Switch to Alibaba's Qwen3-ASR-1.7B as an alternative ASR engine with
--engine qwen. Supports Hindi + ~30 languages. Drop-in comparison with Whisper, same output formats. - Multi-engine Architecture — Choose the best engine per task. Whisper for Indian languages, Qwen for broader multilingual coverage. Diarization works with both.
Indian language content is exploding — YouTube alone has 500M+ Indian language users. But tooling for working with this content programmatically is fragmented:
- Standard Whisper works well for English but struggles with Telugu, Kannada, or Odia
- Fine-tuned models exist but are scattered across HuggingFace repos and obscure ZIP downloads
- Downloading videos, extracting audio, transcribing, and translating requires stitching together 5+ different tools
- Multi-speaker content (interviews, podcasts, meetings) loses all attribution without diarization
Indic Voice Pipeline solves this. One install, one natural language command, and Claude handles the entire pipeline — from URL to speaker-attributed translated text.
You: "Download this Telugu interview and transcribe it with speaker labels"
Claude: [downloads] → [extracts audio] → [loads fine-tuned Telugu model] → [transcribes] → [diarizes speakers] → [saves .txt + .srt + .json]
Download videos, audio, and playlists from 1000+ websites using yt-dlp.
| Feature | Details |
|---|---|
| Sites | YouTube, Vimeo, Twitter/X, TikTok, Instagram, Reddit, Twitch, and 1000+ more |
| Formats | Video (mp4, webm), Audio-only (mp3, m4a, opus) |
| Quality | 360p, 480p, 720p, 1080p, 4K |
| Extras | Subtitle download, playlist support, format listing |
Transcribe and translate audio/video using OpenAI Whisper + fine-tuned Indian language models, with optional speaker diarization.
| Feature | Details |
|---|---|
| Languages | 99+ languages, with fine-tuned models for 12 Indian languages |
| Models | Standard Whisper (tiny → large) + HuggingFace fine-tuned + AI4Bharat IndicWhisper |
| Output | Plain text (.txt), Subtitles (.srt), Structured JSON (.json) |
| Hardware | Apple Silicon (MPS), NVIDIA GPU (CUDA), or CPU |
| Accuracy | Intelligent 25s chunking with 5s overlap — no words lost at boundaries |
| Diarization | Optional speaker identification via pyannote-audio (--diarize) |
Alternative ASR engine using Alibaba's Qwen3-ASR-1.7B for broader multilingual coverage.
| Feature | Details |
|---|---|
| Languages | Hindi + ~30 languages (Arabic, French, German, Japanese, Korean, Russian, Spanish, Thai, Turkish, Vietnamese, and more) |
| Model | Qwen/Qwen3-ASR-1.7B from HuggingFace (~3.5 GB, downloaded and cached once) |
| Output | Same .txt, .srt, .json formats — drop-in comparison with Whisper |
| Fallback | Auto-falls back to Whisper for unsupported languages (Telugu, Kannada, etc.) |
| Diarization | Full speaker diarization support (--diarize works with Qwen too) |
| License | Apache 2.0 |
| Language | Code | Model Source | Base |
|---|---|---|---|
| Telugu | te | vasista22/whisper-telugu-large-v2 | Whisper Large-v2 (1.5B) |
| Hindi | hi | vasista22/whisper-hindi-large-v2 | Whisper Large-v2 (1.5B) |
| Kannada | kn | vasista22/whisper-kannada-medium | Whisper Medium (769M) |
| Gujarati | gu | vasista22/whisper-gujarati-medium | Whisper Medium (769M) |
| Tamil | ta | vasista22/whisper-tamil-medium | Whisper Medium (769M) |
| Bengali | bn | AI4Bharat IndicWhisper | Whisper Medium (769M) |
| Malayalam | ml | AI4Bharat IndicWhisper | Whisper Medium (769M) |
| Marathi | mr | AI4Bharat IndicWhisper | Whisper Medium (769M) |
| Odia | or | AI4Bharat IndicWhisper | Whisper Medium (769M) |
| Punjabi | pa | AI4Bharat IndicWhisper | Whisper Medium (769M) |
| Sanskrit | sa | AI4Bharat IndicWhisper | Whisper Medium (769M) |
| Urdu | ur | AI4Bharat IndicWhisper | Whisper Medium (769M) |
Model sources:
- vasista22 — IIT Madras Speech Lab, funded by Bhashini / MeitY
- AI4Bharat IndicWhisper — IIT Madras, trained on 10,700+ hours across 12 languages, MIT licensed
- Python 3.10+
- ffmpeg —
brew install ffmpeg(macOS) /sudo apt install ffmpeg(Linux) - Claude Code — Install Claude Code
- HuggingFace Token (optional, only for speaker diarization) — Free read-only token from huggingface.co/settings/tokens
One-line install:
git clone https://github.com/humancto/indic-voice-pipeline.git
cd indic-voice-pipeline && bash install.shManual install:
# Install Python dependencies
pip install -r requirements.txt
# (Optional) Install speaker diarization support
pip install pyannote.audio
# Copy skills to Claude Code
cp -r skills/video-downloader ~/.claude/skills/
cp -r skills/whisper-transcribe ~/.claude/skills/Note: If you skip
pip install pyannote.audio, everything else works normally. Diarization is the only feature that requires it. The pipeline degrades gracefully — transcription always works regardless.
bash uninstall.shOnce installed, just talk to Claude naturally. The skills are triggered automatically based on your intent.
| Flag | What it does | Default | Example |
|---|---|---|---|
--language |
Set source language (skips auto-detection, loads best model) | Auto-detect | --language te |
--model |
Whisper model size: tiny, base, small, medium, large |
large |
--model base |
--engine |
ASR engine: whisper or qwen |
whisper |
--engine qwen |
--diarize |
Enable speaker diarization (who spoke when) | Off | --diarize |
--num-speakers |
Exact speaker count (improves diarization accuracy) | Auto | --num-speakers 2 |
--hf-token |
HuggingFace token for diarization | $HF_TOKEN env var |
--hf-token hf_... |
--output-dir |
Where to save output files | ~/Downloads |
--output-dir ./out |
--hf-model |
Override with any HuggingFace Whisper model | Auto-selected | --hf-model vasista22/whisper-telugu-large-v2 |
Diarization note:
--diarizeis off by default. When enabled, it requires a HuggingFace token (via--hf-tokenor$HF_TOKEN). If no token is found, the pipeline prints a helpful message and continues transcription without speaker labels — it never fails.
"Download this video: https://youtube.com/watch?v=..."
"Download audio only from this URL as MP3"
"Download this entire playlist"
"What formats are available for this video?"
"Transcribe ~/Downloads/speech.mp4"
"Transcribe this Telugu video --language te"
"Transcribe ~/Downloads/podcast.mp3 --model medium"
"Transcribe ~/Downloads/interview.mp4 --diarize"
"Transcribe this Hindi podcast with speaker labels --language hi --diarize"
"Transcribe meeting.wav --diarize --num-speakers 3"
"Transcribe ~/Downloads/speech.mp3 --engine qwen"
"Transcribe this Hindi video --engine qwen --language hi"
"Transcribe meeting.wav --engine qwen --diarize"
"Translate this Hindi audio to English"
"Translate ~/Downloads/telugu_speech.mp4 --language te"
"What language is this audio file?"
"Detect the language of ~/Downloads/unknown_speech.wav"
"Download this Telugu YouTube video and transcribe it"
"Download https://youtube.com/shorts/abc123 and translate to English"
"Download this podcast and transcribe with speaker labels --diarize"
Speaker diarization answers the question: "Who spoke when?" It labels each segment of the transcript with a speaker identity — [Speaker 1], [Speaker 2], etc.
The pipeline uses pyannote/speaker-diarization-3.1, the gold standard for neural speaker diarization. Diarization runs as a post-processing step after transcription, so it works with both the Whisper and Qwen3-ASR engines.
Step 1: Install pyannote.audio
pip install pyannote.audioStep 2: Create a free HuggingFace token (read-only access is sufficient)
Go to huggingface.co/settings/tokens and create a token.
Step 3: Accept the pyannote model licenses
You must accept both gated model licenses on HuggingFace (free, instant approval):
- Speaker Diarization 3.1 — huggingface.co/pyannote/speaker-diarization-3.1 → Click "Agree and access repository"
- Segmentation 3.0 — huggingface.co/pyannote/segmentation-3.0 → Click "Agree and access repository"
Important: Both licenses are required. The diarization pipeline internally depends on the segmentation model. If you only accept the first one, you'll get a
403 Forbiddenerror.
Step 4: Provide your token (choose one method)
# Option A: Pass directly as a flag
--hf-token hf_...
# Option B: Set as an environment variable
export HF_TOKEN="hf_..."Token is required every run, not just for the initial model download. pyannote uses it for authentication on each load. The recommended approach is to set it permanently in your shell profile:
echo 'export HF_TOKEN="hf_your_token_here"' >> ~/.zshrc source ~/.zshrcThe pyannote models (~300 MB total) are downloaded once on first use and cached permanently at
~/.cache/torch/pyannote/.
| Flag | Description | Example |
|---|---|---|
--diarize |
Enable speaker diarization | --diarize |
--hf-token |
HuggingFace token for pyannote model access | --hf-token hf_abc123 |
--num-speakers |
Exact number of speakers (if known) | --num-speakers 2 |
--min-speakers |
Minimum expected speakers | --min-speakers 2 |
--max-speakers |
Maximum expected speakers | --max-speakers 5 |
Plain Text (.txt) with speaker labels:
[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
[Speaker 2] అవును, చాలా మంచి విషయం. ప్రారంభిద్దాం.
[Speaker 1] శ్రీరాముడు అయోధ్యకు తిరిగి వచ్చిన రోజు...
SRT Subtitles (.srt) with speaker labels:
1
00:00:00,000 --> 00:00:05,320
[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
2
00:00:05,320 --> 00:00:08,100
[Speaker 2] అవును, చాలా మంచి విషయం. ప్రారంభిద్దాం.
Structured JSON (.json) with speaker metadata:
{
"file": "/Users/you/Downloads/interview.mp4",
"language": "te",
"task": "transcribe",
"model": "vasista22/whisper-telugu-large-v2",
"diarization": true,
"speakers_found": 2,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 5.32,
"speaker": "Speaker 1",
"text": "నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం."
}
]
}| Platform | Diarization Device | Notes |
|---|---|---|
| Apple Silicon | CPU | MPS is not supported by pyannote; CPU is used automatically |
| NVIDIA GPU | CUDA | Full GPU acceleration |
| CPU-only | CPU | Works, but slower on long audio files |
If pyannote.audio is not installed and you pass --diarize, the pipeline will print a warning and proceed with transcription only — no crash, no error. Speaker labels will simply be absent from the output. Install pyannote whenever you need diarization; uninstall it if you never do.
Qwen3-ASR-1.7B is Alibaba's open-source automatic speech recognition model, offering strong multilingual performance as an alternative to Whisper.
| Scenario | Recommended Engine |
|---|---|
| Telugu, Kannada, Odia, or other Indian languages with fine-tuned Whisper models | Whisper (--engine whisper, default) |
| Hindi with desire for a second opinion or comparison | Either — try both |
| Arabic, French, German, Japanese, Korean, Russian, Spanish, Thai, Turkish, Vietnamese | Qwen (--engine qwen) |
| Benchmarking ASR engines against each other | Both — run once with each engine |
Hindi, Arabic, Cantonese, Dutch, English, French, German, Indonesian, Italian, Japanese, Korean, Malay, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese, and more (~30 total).
Auto-fallback: If you request a language that Qwen3-ASR does not support (e.g., Telugu, Kannada), the pipeline automatically falls back to Whisper. No manual intervention needed.
# Transcribe with Qwen3-ASR
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe audio.mp3 --engine qwen
# Qwen + diarization
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe interview.mp4 --engine qwen --diarize --hf-token hf_...
# Qwen with specific language
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe audio.mp3 --engine qwen --language hi| Property | Value |
|---|---|
| Model | Qwen/Qwen3-ASR-1.7B |
| Parameters | 1.7 billion |
| Size on disk | ~3.5 GB (downloaded once, cached permanently) |
| License | Apache 2.0 |
| Source | HuggingFace |
- Subtitle generation — Auto-generate
.srtsubtitle files for YouTube videos in any Indian language - Content repurposing — Download a Telugu podcast, transcribe it, translate to English, create blog posts
- Multi-language publishing — Transcribe Hindi content, translate to English for wider reach
- Lecture transcription — Transcribe university lectures in Tamil, Kannada, or Hindi
- Oral history preservation — Digitize oral traditions in Sanskrit, Odia, or Punjabi
- Linguistic research — Analyze speech patterns across 12 Indian languages locally
- Interview transcription — Transcribe interviews with
--diarizeto get speaker-attributed text. Know exactly who said what. - Podcast transcription — Multi-speaker podcasts with automatic speaker attribution. Each host and guest is labeled.
- Meeting notes — Transcribe team meetings with speaker identification. Instantly generate attributed minutes.
- Interview processing — Download and transcribe interviews from any platform with full speaker attribution
- Evidence documentation — Transcribe audio/video evidence with timestamps and speaker labels
- Accessibility — Generate subtitles for hearing-impaired audiences
- Voice notes — Convert voice memos in your native language to searchable text
- Religious content — Transcribe pravachans, kirtans, and spiritual discourses
- Family archives — Digitize family recordings in regional languages
- Training data generation — Create parallel corpora (audio + text) for ML models
- ASR benchmarking — Compare transcription quality across Whisper and Qwen3-ASR engines
- Pipeline automation — Build downstream NLP workflows on top of transcriptions
┌──────────────────────────────────────────────────────────────────────┐
│ Claude Code (CLI) │
│ │
│ "Download and transcribe this Telugu interview with speaker labels" │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌──────────────────────────────────────┐ │
│ │ video-downloader │ │ whisper-transcribe │ │
│ │ │ │ │ │
│ │ yt-dlp engine │────▶│ Audio Extraction (ffmpeg, 16kHz) │ │
│ │ 1000+ sites │ │ │ │ │
│ │ Any quality │ │ ▼ │ │
│ │ Audio extract │ │ Engine Selection │ │
│ └─────────────────┘ │ ┌───────┴────────┐ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Whisper Qwen3-ASR │ │
│ │ (default) (--engine qwen) │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Language Detection │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Model Router │ │
│ │ ┌──────┴──────┐ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ vasista22 IndicWhisper / Qwen │ │
│ │ (HuggingFace) (ZIP / HF cache) │ │
│ │ │ │ │ │
│ │ └──────┬──────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Chunked Inference │ │
│ │ 25s windows, 5s overlap │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Smart Merge │ │
│ │ (no words lost) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────┐ │ │
│ │ │ Speaker Diarization │ │ │
│ │ │ (pyannote, optional) │ │ │
│ │ │ --diarize flag │ │ │
│ │ └─────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ .txt .srt .json │ │
│ │ (with speaker labels if diarized) │ │
│ └──────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
URL → yt-dlp Download → Audio Extraction (ffmpeg) → Engine Selection (Whisper / Qwen)
→ Language Detection → Model Router → Chunked Inference → Smart Merge
→ [Optional: Speaker Diarization (pyannote)] → .txt .srt .json
- Audio extraction — Video files have audio extracted via ffmpeg (16kHz mono WAV)
- Engine selection — Whisper (default) or Qwen3-ASR (
--engine qwen) - Model selection — Language code maps to the best available fine-tuned model
- Chunked inference — Long audio is split into 25-second chunks with 5-second overlaps
- Smart merge — Overlapping regions are deduplicated using a 3-tier algorithm:
- Exact suffix-prefix word matching
- Fuzzy anchor word matching
- Heuristic proportional skip
- Speaker diarization (optional) — If
--diarizeis set, pyannote identifies speakers and labels each segment - Multi-format output — Results saved as plain text, SRT subtitles, and structured JSON
User specifies --engine qwen? → Use Qwen3-ASR-1.7B
→ Auto-fallback to Whisper if language unsupported
User specifies --hf-model? → Use that exact Whisper model
User specifies --language? → Check vasista22 (HuggingFace) first
→ Fall back to IndicWhisper (ZIP download)
Neither specified? → Use standard Whisper with auto-detection
The scripts work independently — you don't need Claude Code to use them.
Download a video:
python3 skills/video-downloader/scripts/video_downloader.py download "https://youtube.com/watch?v=..." --output-dir ~/DownloadsTranscribe with Telugu model:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/video.mp4 --language te --output-dir ~/DownloadsTranscribe with speaker diarization:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/interview.mp4 \
--language te --diarize --hf-token hf_... --output-dir ~/DownloadsTranscribe with Qwen3-ASR engine:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/audio.mp3 \
--engine qwen --language hi --output-dir ~/DownloadsTranscribe with Qwen3-ASR + diarization:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/meeting.mp4 \
--engine qwen --diarize --num-speakers 3 --hf-token hf_... --output-dir ~/DownloadsTranslate to English:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py translate ~/Downloads/audio.mp3 --language hi --output-dir ~/DownloadsDetect language:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py detect ~/Downloads/audio.mp3List video formats:
python3 skills/video-downloader/scripts/video_downloader.py formats "https://youtube.com/watch?v=..."Use any Whisper fine-tuned model from HuggingFace:
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe audio.mp3 \
--hf-model "openai/whisper-large-v3"| Model | VRAM | Speed | Accuracy | Best For |
|---|---|---|---|---|
tiny |
~1 GB | Fastest | Low | Quick drafts, clear speech |
base |
~1 GB | Fast | Good | Default — good balance |
small |
~2 GB | Moderate | Better | Noisy audio, accented speech |
medium |
~5 GB | Slow | Great | Non-English, complex audio |
large |
~10 GB | Slowest | Best | Maximum accuracy, rare languages |
| Property | Whisper | Qwen3-ASR-1.7B |
|---|---|---|
| Indian language models | 12 fine-tuned (vasista22 + IndicWhisper) | Hindi only (others auto-fallback) |
| Global languages | 99+ | ~30 |
| Model sizes | tiny (39M) to large (1.5B) | 1.7B (single model) |
| Disk space | Varies by model size | ~3.5 GB |
| License | MIT (Whisper) / varies (fine-tuned) | Apache 2.0 |
| Diarization support | Yes | Yes |
Every transcription produces three files:
నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
With --diarize:
[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
[Speaker 2] అవును, చాలా మంచి విషయం.
1
00:00:00,000 --> 00:00:05,320
నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
2
00:00:05,320 --> 00:00:12,800
శ్రీరాముడు అయోధ్యకు తిరిగి వచ్చిన రోజు...
With --diarize:
1
00:00:00,000 --> 00:00:05,320
[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
2
00:00:05,320 --> 00:00:08,100
[Speaker 2] అవును, చాలా మంచి విషయం. ప్రారంభిద్దాం.
{
"file": "/Users/you/Downloads/video.mp4",
"language": "te",
"task": "transcribe",
"model": "vasista22/whisper-telugu-large-v2",
"text": "నమస్కారం. ఈ రోజు మనం...",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 5.32,
"start_ts": "00:00:00.000",
"end_ts": "00:00:05.320",
"text": "నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం."
}
]
}With --diarize:
{
"file": "/Users/you/Downloads/interview.mp4",
"language": "te",
"task": "transcribe",
"model": "vasista22/whisper-telugu-large-v2",
"diarization": true,
"speakers_found": 2,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 5.32,
"speaker": "Speaker 1",
"start_ts": "00:00:00.000",
"end_ts": "00:00:05.320",
"text": "నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం."
}
]
}indic-voice-pipeline/
├── README.md # This file
├── LICENSE # MIT License
├── install.sh # One-command installer
├── uninstall.sh # Clean uninstaller
├── requirements.txt # Python dependencies
├── skills/
│ ├── video-downloader/
│ │ ├── SKILL.md # Claude Code skill definition
│ │ └── scripts/
│ │ ├── check_deps.py # Dependency checker
│ │ └── video_downloader.py # Download engine (yt-dlp wrapper)
│ └── whisper-transcribe/
│ ├── SKILL.md # Claude Code skill definition
│ └── scripts/
│ ├── check_deps.py # Dependency checker
│ └── whisper_transcribe.py # Transcription engine (Whisper + Qwen + Diarization)
├── docs/
│ ├── MODELS.md # Detailed model documentation
│ └── TROUBLESHOOTING.md # Common issues and fixes
└── examples/
└── pipeline_example.sh # Example end-to-end pipeline
See docs/TROUBLESHOOTING.md for detailed solutions. Quick fixes:
| Problem | Solution |
|---|---|
torch_dtype is deprecated |
Ignore — cosmetic warning, doesn't affect output |
transformers requires torch >= 2.6 |
Pin transformers: pip install transformers==4.46.3 |
numpy >= 2.0 incompatible |
Downgrade: pip install "numpy<2" |
suppress_tokens index error |
Already fixed in the script — update to latest version |
IndicWhisper download fails |
Check internet connection, retry, or download ZIP manually |
MPS out of memory |
Use a smaller model: --model small or --model base |
pyannote not installed warning |
pip install pyannote.audio — only needed for --diarize |
pyannote license not accepted / 403 Forbidden |
Accept both licenses: speaker-diarization-3.1 AND segmentation-3.0 |
HF_TOKEN not set for diarization |
export HF_TOKEN="hf_..." or pass --hf-token hf_... |
| Qwen unsupported language fallback | Expected behavior — pipeline auto-switches to Whisper |
This project stands on the shoulders of remarkable open-source work:
- OpenAI Whisper — The foundation model for speech recognition
- Qwen3-ASR-1.7B — Alibaba's multilingual ASR model (Apache 2.0)
- pyannote-audio — State-of-the-art neural speaker diarization
- vasista22 / IIT Madras Speech Lab — Fine-tuned Whisper models for Telugu, Hindi, Kannada, Gujarati, Tamil. Funded by Bhashini / MeitY (Government of India)
- AI4Bharat / IIT Madras — IndicWhisper models for 12 Indian languages, trained on the Vistaar dataset (10,700+ hours)
- yt-dlp — The backbone for video downloading from 1000+ sites
- HuggingFace Transformers — Model loading and inference infrastructure
- Claude Code — The AI assistant that orchestrates the entire pipeline
Contributions welcome! Areas where help is needed:
- Real-time streaming — Support live audio transcription
- Batch processing — Process entire folders of audio/video files
- Quality benchmarks — WER comparisons across models, languages, and engines
- New language models — Add fine-tuned Whisper models for more Indian languages
- Translation pipeline — Integrate translation with diarized output
MIT License. See LICENSE for details.
The fine-tuned models and dependencies have their own licenses:
- vasista22 models: Check individual model cards on HuggingFace
- AI4Bharat IndicWhisper: MIT License
- Qwen3-ASR-1.7B: Apache 2.0 License
- pyannote-audio: MIT License (model weights require license acceptance on HuggingFace)
Built with care for Indian languages.
By HumanCTO