Skip to content

humancto/indic-voice-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

12 Indian Languages Whisper + Qwen3-ASR Speaker Diarization 100% Local No API Keys MIT License

Indic Voice Pipeline

Download, transcribe, diarize, and translate audio/video in 12 Indian languages — entirely on your machine.

A Claude Code skill set that gives your AI assistant the ability to download videos from 1000+ sites, transcribe speech to text using state-of-the-art fine-tuned Whisper models (or Alibaba's Qwen3-ASR), identify speakers with pyannote diarization, and translate between languages. No cloud APIs, no data leaving your machine, no API keys.


What's New

  • Speaker Diarization — Identify who spoke when using pyannote-audio. Add --diarize and get [Speaker 1], [Speaker 2] labels in every output file.
  • Qwen3-ASR Engine — Switch to Alibaba's Qwen3-ASR-1.7B as an alternative ASR engine with --engine qwen. Supports Hindi + ~30 languages. Drop-in comparison with Whisper, same output formats.
  • Multi-engine Architecture — Choose the best engine per task. Whisper for Indian languages, Qwen for broader multilingual coverage. Diarization works with both.

Why This Exists

Indian language content is exploding — YouTube alone has 500M+ Indian language users. But tooling for working with this content programmatically is fragmented:

  • Standard Whisper works well for English but struggles with Telugu, Kannada, or Odia
  • Fine-tuned models exist but are scattered across HuggingFace repos and obscure ZIP downloads
  • Downloading videos, extracting audio, transcribing, and translating requires stitching together 5+ different tools
  • Multi-speaker content (interviews, podcasts, meetings) loses all attribution without diarization

Indic Voice Pipeline solves this. One install, one natural language command, and Claude handles the entire pipeline — from URL to speaker-attributed translated text.

You:    "Download this Telugu interview and transcribe it with speaker labels"
Claude: [downloads] → [extracts audio] → [loads fine-tuned Telugu model] → [transcribes] → [diarizes speakers] → [saves .txt + .srt + .json]

What's Included

Skill 1: Video Downloader

Download videos, audio, and playlists from 1000+ websites using yt-dlp.

Feature Details
Sites YouTube, Vimeo, Twitter/X, TikTok, Instagram, Reddit, Twitch, and 1000+ more
Formats Video (mp4, webm), Audio-only (mp3, m4a, opus)
Quality 360p, 480p, 720p, 1080p, 4K
Extras Subtitle download, playlist support, format listing

Skill 2: Whisper Transcribe

Transcribe and translate audio/video using OpenAI Whisper + fine-tuned Indian language models, with optional speaker diarization.

Feature Details
Languages 99+ languages, with fine-tuned models for 12 Indian languages
Models Standard Whisper (tiny → large) + HuggingFace fine-tuned + AI4Bharat IndicWhisper
Output Plain text (.txt), Subtitles (.srt), Structured JSON (.json)
Hardware Apple Silicon (MPS), NVIDIA GPU (CUDA), or CPU
Accuracy Intelligent 25s chunking with 5s overlap — no words lost at boundaries
Diarization Optional speaker identification via pyannote-audio (--diarize)

Skill 3: Qwen3-ASR Engine

Alternative ASR engine using Alibaba's Qwen3-ASR-1.7B for broader multilingual coverage.

Feature Details
Languages Hindi + ~30 languages (Arabic, French, German, Japanese, Korean, Russian, Spanish, Thai, Turkish, Vietnamese, and more)
Model Qwen/Qwen3-ASR-1.7B from HuggingFace (~3.5 GB, downloaded and cached once)
Output Same .txt, .srt, .json formats — drop-in comparison with Whisper
Fallback Auto-falls back to Whisper for unsupported languages (Telugu, Kannada, etc.)
Diarization Full speaker diarization support (--diarize works with Qwen too)
License Apache 2.0

Supported Indian Languages

Language Code Model Source Base
Telugutevasista22/whisper-telugu-large-v2Whisper Large-v2 (1.5B)
Hindihivasista22/whisper-hindi-large-v2Whisper Large-v2 (1.5B)
Kannadaknvasista22/whisper-kannada-mediumWhisper Medium (769M)
Gujaratiguvasista22/whisper-gujarati-mediumWhisper Medium (769M)
Tamiltavasista22/whisper-tamil-mediumWhisper Medium (769M)
BengalibnAI4Bharat IndicWhisperWhisper Medium (769M)
MalayalammlAI4Bharat IndicWhisperWhisper Medium (769M)
MarathimrAI4Bharat IndicWhisperWhisper Medium (769M)
OdiaorAI4Bharat IndicWhisperWhisper Medium (769M)
PunjabipaAI4Bharat IndicWhisperWhisper Medium (769M)
SanskritsaAI4Bharat IndicWhisperWhisper Medium (769M)
UrduurAI4Bharat IndicWhisperWhisper Medium (769M)

Model sources:

  • vasista22 — IIT Madras Speech Lab, funded by Bhashini / MeitY
  • AI4Bharat IndicWhisper — IIT Madras, trained on 10,700+ hours across 12 languages, MIT licensed

Quick Start

Prerequisites

Install

One-line install:

git clone https://github.com/humancto/indic-voice-pipeline.git
cd indic-voice-pipeline && bash install.sh

Manual install:

# Install Python dependencies
pip install -r requirements.txt

# (Optional) Install speaker diarization support
pip install pyannote.audio

# Copy skills to Claude Code
cp -r skills/video-downloader ~/.claude/skills/
cp -r skills/whisper-transcribe ~/.claude/skills/

Note: If you skip pip install pyannote.audio, everything else works normally. Diarization is the only feature that requires it. The pipeline degrades gracefully — transcription always works regardless.

Uninstall

bash uninstall.sh

Usage

Once installed, just talk to Claude naturally. The skills are triggered automatically based on your intent.

Quick Reference — Top Flags

Flag What it does Default Example
--language Set source language (skips auto-detection, loads best model) Auto-detect --language te
--model Whisper model size: tiny, base, small, medium, large large --model base
--engine ASR engine: whisper or qwen whisper --engine qwen
--diarize Enable speaker diarization (who spoke when) Off --diarize
--num-speakers Exact speaker count (improves diarization accuracy) Auto --num-speakers 2
--hf-token HuggingFace token for diarization $HF_TOKEN env var --hf-token hf_...
--output-dir Where to save output files ~/Downloads --output-dir ./out
--hf-model Override with any HuggingFace Whisper model Auto-selected --hf-model vasista22/whisper-telugu-large-v2

Diarization note: --diarize is off by default. When enabled, it requires a HuggingFace token (via --hf-token or $HF_TOKEN). If no token is found, the pipeline prints a helpful message and continues transcription without speaker labels — it never fails.

Download a Video

"Download this video: https://youtube.com/watch?v=..."
"Download audio only from this URL as MP3"
"Download this entire playlist"
"What formats are available for this video?"

Transcribe Audio / Video

"Transcribe ~/Downloads/speech.mp4"
"Transcribe this Telugu video --language te"
"Transcribe ~/Downloads/podcast.mp3 --model medium"

Transcribe with Speaker Diarization

"Transcribe ~/Downloads/interview.mp4 --diarize"
"Transcribe this Hindi podcast with speaker labels --language hi --diarize"
"Transcribe meeting.wav --diarize --num-speakers 3"

Transcribe with Qwen3-ASR Engine

"Transcribe ~/Downloads/speech.mp3 --engine qwen"
"Transcribe this Hindi video --engine qwen --language hi"
"Transcribe meeting.wav --engine qwen --diarize"

Translate to English

"Translate this Hindi audio to English"
"Translate ~/Downloads/telugu_speech.mp4 --language te"

Detect Language

"What language is this audio file?"
"Detect the language of ~/Downloads/unknown_speech.wav"

Full Pipeline (Download + Transcribe)

"Download this Telugu YouTube video and transcribe it"
"Download https://youtube.com/shorts/abc123 and translate to English"
"Download this podcast and transcribe with speaker labels --diarize"

Speaker Diarization

Speaker diarization answers the question: "Who spoke when?" It labels each segment of the transcript with a speaker identity — [Speaker 1], [Speaker 2], etc.

How It Works

The pipeline uses pyannote/speaker-diarization-3.1, the gold standard for neural speaker diarization. Diarization runs as a post-processing step after transcription, so it works with both the Whisper and Qwen3-ASR engines.

Setup (One-Time)

Step 1: Install pyannote.audio

pip install pyannote.audio

Step 2: Create a free HuggingFace token (read-only access is sufficient)

Go to huggingface.co/settings/tokens and create a token.

Step 3: Accept the pyannote model licenses

You must accept both gated model licenses on HuggingFace (free, instant approval):

  1. Speaker Diarization 3.1huggingface.co/pyannote/speaker-diarization-3.1 → Click "Agree and access repository"
  2. Segmentation 3.0huggingface.co/pyannote/segmentation-3.0 → Click "Agree and access repository"

Important: Both licenses are required. The diarization pipeline internally depends on the segmentation model. If you only accept the first one, you'll get a 403 Forbidden error.

Step 4: Provide your token (choose one method)

# Option A: Pass directly as a flag
--hf-token hf_...

# Option B: Set as an environment variable
export HF_TOKEN="hf_..."

Token is required every run, not just for the initial model download. pyannote uses it for authentication on each load. The recommended approach is to set it permanently in your shell profile:

echo 'export HF_TOKEN="hf_your_token_here"' >> ~/.zshrc
source ~/.zshrc

The pyannote models (~300 MB total) are downloaded once on first use and cached permanently at ~/.cache/torch/pyannote/.

Diarization Flags

Flag Description Example
--diarize Enable speaker diarization --diarize
--hf-token HuggingFace token for pyannote model access --hf-token hf_abc123
--num-speakers Exact number of speakers (if known) --num-speakers 2
--min-speakers Minimum expected speakers --min-speakers 2
--max-speakers Maximum expected speakers --max-speakers 5

Diarized Output Examples

Plain Text (.txt) with speaker labels:

[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
[Speaker 2] అవును, చాలా మంచి విషయం. ప్రారంభిద్దాం.
[Speaker 1] శ్రీరాముడు అయోధ్యకు తిరిగి వచ్చిన రోజు...

SRT Subtitles (.srt) with speaker labels:

1
00:00:00,000 --> 00:00:05,320
[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.

2
00:00:05,320 --> 00:00:08,100
[Speaker 2] అవును, చాలా మంచి విషయం. ప్రారంభిద్దాం.

Structured JSON (.json) with speaker metadata:

{
  "file": "/Users/you/Downloads/interview.mp4",
  "language": "te",
  "task": "transcribe",
  "model": "vasista22/whisper-telugu-large-v2",
  "diarization": true,
  "speakers_found": 2,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 5.32,
      "speaker": "Speaker 1",
      "text": "నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం."
    }
  ]
}

Hardware Notes for Diarization

Platform Diarization Device Notes
Apple Silicon CPU MPS is not supported by pyannote; CPU is used automatically
NVIDIA GPU CUDA Full GPU acceleration
CPU-only CPU Works, but slower on long audio files

Graceful Degradation

If pyannote.audio is not installed and you pass --diarize, the pipeline will print a warning and proceed with transcription only — no crash, no error. Speaker labels will simply be absent from the output. Install pyannote whenever you need diarization; uninstall it if you never do.


Qwen3-ASR Engine

Qwen3-ASR-1.7B is Alibaba's open-source automatic speech recognition model, offering strong multilingual performance as an alternative to Whisper.

When to Use Qwen3-ASR

Scenario Recommended Engine
Telugu, Kannada, Odia, or other Indian languages with fine-tuned Whisper models Whisper (--engine whisper, default)
Hindi with desire for a second opinion or comparison Either — try both
Arabic, French, German, Japanese, Korean, Russian, Spanish, Thai, Turkish, Vietnamese Qwen (--engine qwen)
Benchmarking ASR engines against each other Both — run once with each engine

Supported Languages (Qwen3-ASR)

Hindi, Arabic, Cantonese, Dutch, English, French, German, Indonesian, Italian, Japanese, Korean, Malay, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese, and more (~30 total).

Auto-fallback: If you request a language that Qwen3-ASR does not support (e.g., Telugu, Kannada), the pipeline automatically falls back to Whisper. No manual intervention needed.

Usage

# Transcribe with Qwen3-ASR
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe audio.mp3 --engine qwen

# Qwen + diarization
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe interview.mp4 --engine qwen --diarize --hf-token hf_...

# Qwen with specific language
python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe audio.mp3 --engine qwen --language hi

Model Details

Property Value
Model Qwen/Qwen3-ASR-1.7B
Parameters 1.7 billion
Size on disk ~3.5 GB (downloaded once, cached permanently)
License Apache 2.0
Source HuggingFace

Use Cases

Content Creators & Media

  • Subtitle generation — Auto-generate .srt subtitle files for YouTube videos in any Indian language
  • Content repurposing — Download a Telugu podcast, transcribe it, translate to English, create blog posts
  • Multi-language publishing — Transcribe Hindi content, translate to English for wider reach

Education & Research

  • Lecture transcription — Transcribe university lectures in Tamil, Kannada, or Hindi
  • Oral history preservation — Digitize oral traditions in Sanskrit, Odia, or Punjabi
  • Linguistic research — Analyze speech patterns across 12 Indian languages locally

Interviews & Podcasts

  • Interview transcription — Transcribe interviews with --diarize to get speaker-attributed text. Know exactly who said what.
  • Podcast transcription — Multi-speaker podcasts with automatic speaker attribution. Each host and guest is labeled.
  • Meeting notes — Transcribe team meetings with speaker identification. Instantly generate attributed minutes.

Journalism & Documentation

  • Interview processing — Download and transcribe interviews from any platform with full speaker attribution
  • Evidence documentation — Transcribe audio/video evidence with timestamps and speaker labels
  • Accessibility — Generate subtitles for hearing-impaired audiences

Personal & Productivity

  • Voice notes — Convert voice memos in your native language to searchable text
  • Religious content — Transcribe pravachans, kirtans, and spiritual discourses
  • Family archives — Digitize family recordings in regional languages

Developers & AI Engineers

  • Training data generation — Create parallel corpora (audio + text) for ML models
  • ASR benchmarking — Compare transcription quality across Whisper and Qwen3-ASR engines
  • Pipeline automation — Build downstream NLP workflows on top of transcriptions

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        Claude Code (CLI)                             │
│                                                                      │
│  "Download and transcribe this Telugu interview with speaker labels"  │
│           │                                                          │
│           ▼                                                          │
│  ┌─────────────────┐     ┌──────────────────────────────────────┐   │
│  │ video-downloader │     │         whisper-transcribe            │   │
│  │                  │     │                                       │   │
│  │  yt-dlp engine   │────▶│  Audio Extraction (ffmpeg, 16kHz)    │   │
│  │  1000+ sites     │     │         │                            │   │
│  │  Any quality     │     │         ▼                            │   │
│  │  Audio extract   │     │  Engine Selection                    │   │
│  └─────────────────┘     │  ┌───────┴────────┐                  │   │
│                           │  │                │                  │   │
│                           │  ▼                ▼                  │   │
│                           │ Whisper          Qwen3-ASR           │   │
│                           │ (default)        (--engine qwen)     │   │
│                           │  │                │                  │   │
│                           │  ▼                ▼                  │   │
│                           │  Language Detection                  │   │
│                           │         │                            │   │
│                           │         ▼                            │   │
│                           │  Model Router                        │   │
│                           │  ┌──────┴──────┐                    │   │
│                           │  │             │                    │   │
│                           │  ▼             ▼                    │   │
│                           │ vasista22    IndicWhisper / Qwen    │   │
│                           │ (HuggingFace) (ZIP / HF cache)     │   │
│                           │  │             │                    │   │
│                           │  └──────┬──────┘                    │   │
│                           │         │                            │   │
│                           │         ▼                            │   │
│                           │  Chunked Inference                   │   │
│                           │  25s windows, 5s overlap             │   │
│                           │         │                            │   │
│                           │         ▼                            │   │
│                           │  Smart Merge                         │   │
│                           │  (no words lost)                     │   │
│                           │         │                            │   │
│                           │         ▼                            │   │
│                           │  ┌─────────────────────────┐        │   │
│                           │  │  Speaker Diarization     │        │   │
│                           │  │  (pyannote, optional)    │        │   │
│                           │  │  --diarize flag          │        │   │
│                           │  └─────────────────────────┘        │   │
│                           │         │                            │   │
│                           │         ▼                            │   │
│                           │  .txt  .srt  .json                   │   │
│                           │  (with speaker labels if diarized)   │   │
│                           └──────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────────┘

Pipeline Flow (Simplified)

URL → yt-dlp Download → Audio Extraction (ffmpeg) → Engine Selection (Whisper / Qwen)
  → Language Detection → Model Router → Chunked Inference → Smart Merge
  → [Optional: Speaker Diarization (pyannote)] → .txt  .srt  .json

How Transcription Works

  1. Audio extraction — Video files have audio extracted via ffmpeg (16kHz mono WAV)
  2. Engine selection — Whisper (default) or Qwen3-ASR (--engine qwen)
  3. Model selection — Language code maps to the best available fine-tuned model
  4. Chunked inference — Long audio is split into 25-second chunks with 5-second overlaps
  5. Smart merge — Overlapping regions are deduplicated using a 3-tier algorithm:
    • Exact suffix-prefix word matching
    • Fuzzy anchor word matching
    • Heuristic proportional skip
  6. Speaker diarization (optional) — If --diarize is set, pyannote identifies speakers and labels each segment
  7. Multi-format output — Results saved as plain text, SRT subtitles, and structured JSON

Model Priority

User specifies --engine qwen?  → Use Qwen3-ASR-1.7B
                                  → Auto-fallback to Whisper if language unsupported

User specifies --hf-model?     → Use that exact Whisper model
User specifies --language?     → Check vasista22 (HuggingFace) first
                                → Fall back to IndicWhisper (ZIP download)
Neither specified?             → Use standard Whisper with auto-detection

Advanced Usage

Standalone Scripts (Without Claude Code)

The scripts work independently — you don't need Claude Code to use them.

Download a video:

python3 skills/video-downloader/scripts/video_downloader.py download "https://youtube.com/watch?v=..." --output-dir ~/Downloads

Transcribe with Telugu model:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/video.mp4 --language te --output-dir ~/Downloads

Transcribe with speaker diarization:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/interview.mp4 \
    --language te --diarize --hf-token hf_... --output-dir ~/Downloads

Transcribe with Qwen3-ASR engine:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/audio.mp3 \
    --engine qwen --language hi --output-dir ~/Downloads

Transcribe with Qwen3-ASR + diarization:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe ~/Downloads/meeting.mp4 \
    --engine qwen --diarize --num-speakers 3 --hf-token hf_... --output-dir ~/Downloads

Translate to English:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py translate ~/Downloads/audio.mp3 --language hi --output-dir ~/Downloads

Detect language:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py detect ~/Downloads/audio.mp3

List video formats:

python3 skills/video-downloader/scripts/video_downloader.py formats "https://youtube.com/watch?v=..."

Custom HuggingFace Models

Use any Whisper fine-tuned model from HuggingFace:

python3 skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe audio.mp3 \
    --hf-model "openai/whisper-large-v3"

Model Sizes

Model VRAM Speed Accuracy Best For
tiny ~1 GB Fastest Low Quick drafts, clear speech
base ~1 GB Fast Good Default — good balance
small ~2 GB Moderate Better Noisy audio, accented speech
medium ~5 GB Slow Great Non-English, complex audio
large ~10 GB Slowest Best Maximum accuracy, rare languages

Engine Comparison

Property Whisper Qwen3-ASR-1.7B
Indian language models 12 fine-tuned (vasista22 + IndicWhisper) Hindi only (others auto-fallback)
Global languages 99+ ~30
Model sizes tiny (39M) to large (1.5B) 1.7B (single model)
Disk space Varies by model size ~3.5 GB
License MIT (Whisper) / varies (fine-tuned) Apache 2.0
Diarization support Yes Yes

Output Formats

Every transcription produces three files:

Plain Text (.txt)

నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.

With --diarize:

[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.
[Speaker 2] అవును, చాలా మంచి విషయం.

SRT Subtitles (.srt)

1
00:00:00,000 --> 00:00:05,320
నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.

2
00:00:05,320 --> 00:00:12,800
శ్రీరాముడు అయోధ్యకు తిరిగి వచ్చిన రోజు...

With --diarize:

1
00:00:00,000 --> 00:00:05,320
[Speaker 1] నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం.

2
00:00:05,320 --> 00:00:08,100
[Speaker 2] అవును, చాలా మంచి విషయం. ప్రారంభిద్దాం.

Structured JSON (.json)

{
  "file": "/Users/you/Downloads/video.mp4",
  "language": "te",
  "task": "transcribe",
  "model": "vasista22/whisper-telugu-large-v2",
  "text": "నమస్కారం. ఈ రోజు మనం...",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 5.32,
      "start_ts": "00:00:00.000",
      "end_ts": "00:00:05.320",
      "text": "నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం."
    }
  ]
}

With --diarize:

{
  "file": "/Users/you/Downloads/interview.mp4",
  "language": "te",
  "task": "transcribe",
  "model": "vasista22/whisper-telugu-large-v2",
  "diarization": true,
  "speakers_found": 2,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 5.32,
      "speaker": "Speaker 1",
      "start_ts": "00:00:00.000",
      "end_ts": "00:00:05.320",
      "text": "నమస్కారం. ఈ రోజు మనం రామాయణం గురించి మాట్లాడుకుందాం."
    }
  ]
}

File Structure

indic-voice-pipeline/
├── README.md                              # This file
├── LICENSE                                # MIT License
├── install.sh                             # One-command installer
├── uninstall.sh                           # Clean uninstaller
├── requirements.txt                       # Python dependencies
├── skills/
│   ├── video-downloader/
│   │   ├── SKILL.md                       # Claude Code skill definition
│   │   └── scripts/
│   │       ├── check_deps.py              # Dependency checker
│   │       └── video_downloader.py        # Download engine (yt-dlp wrapper)
│   └── whisper-transcribe/
│       ├── SKILL.md                       # Claude Code skill definition
│       └── scripts/
│           ├── check_deps.py              # Dependency checker
│           └── whisper_transcribe.py      # Transcription engine (Whisper + Qwen + Diarization)
├── docs/
│   ├── MODELS.md                          # Detailed model documentation
│   └── TROUBLESHOOTING.md                 # Common issues and fixes
└── examples/
    └── pipeline_example.sh                # Example end-to-end pipeline

Troubleshooting

See docs/TROUBLESHOOTING.md for detailed solutions. Quick fixes:

Problem Solution
torch_dtype is deprecated Ignore — cosmetic warning, doesn't affect output
transformers requires torch >= 2.6 Pin transformers: pip install transformers==4.46.3
numpy >= 2.0 incompatible Downgrade: pip install "numpy<2"
suppress_tokens index error Already fixed in the script — update to latest version
IndicWhisper download fails Check internet connection, retry, or download ZIP manually
MPS out of memory Use a smaller model: --model small or --model base
pyannote not installed warning pip install pyannote.audio — only needed for --diarize
pyannote license not accepted / 403 Forbidden Accept both licenses: speaker-diarization-3.1 AND segmentation-3.0
HF_TOKEN not set for diarization export HF_TOKEN="hf_..." or pass --hf-token hf_...
Qwen unsupported language fallback Expected behavior — pipeline auto-switches to Whisper

Credits & Acknowledgments

This project stands on the shoulders of remarkable open-source work:


Contributing

Contributions welcome! Areas where help is needed:

  • Real-time streaming — Support live audio transcription
  • Batch processing — Process entire folders of audio/video files
  • Quality benchmarks — WER comparisons across models, languages, and engines
  • New language models — Add fine-tuned Whisper models for more Indian languages
  • Translation pipeline — Integrate translation with diarized output

License

MIT License. See LICENSE for details.

The fine-tuned models and dependencies have their own licenses:

  • vasista22 models: Check individual model cards on HuggingFace
  • AI4Bharat IndicWhisper: MIT License
  • Qwen3-ASR-1.7B: Apache 2.0 License
  • pyannote-audio: MIT License (model weights require license acceptance on HuggingFace)

Built with care for Indian languages.
By HumanCTO

About

HumanCTO's Indic Voice Pipeline — Download, transcribe, and translate audio/video in 12 Indian languages. 100% local, no API keys. Claude Code skills powered by OpenAI Whisper + vasista22 + AI4Bharat IndicWhisper fine-tuned models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors