Video Transcription and Analysis

A local video processing tool that combines GPU-accelerated audio transcription (using MLX Whisper for Apple Silicon) with visual analysis (using Ollama vision models) and advanced correction features. Process videos entirely on your local machine with no cloud dependencies.

⚡ GPU-Accelerated: Optimized for Apple Silicon (M1/M2/M3) with MLX - process 1-hour videos in ~25 minutes with medium model quality!

Features

Core Transcription

GPU-Accelerated Whisper: MLX-optimized for Apple Silicon (3-4x faster than CPU)
Speaker Diarization: Automatic "who spoke when" detection with MPS acceleration
Multiple Output Formats: JSON, TXT, SRT subtitles

Advanced Features ✨

🎯 Speaker Identification: Train the system to recognize specific voices
📚 Custom Vocabulary: Add domain-specific terms for better recognition
📝 Dictionary Corrections: Automatic pattern-based error fixes
🤖 LLM Corrections: Context-aware corrections via local Ollama
🎓 Learning System: Learn corrections from your manual edits

Visual Analysis

Frame-by-Frame Analysis: Using Ollama vision models
Combined Summaries: Integrate audio and visual insights
Flexible Sampling: Configure FPS and frame limits

Installation

Prerequisites

Python 3.10+

Ollama (for visual analysis and LLM corrections):

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull llama3.2-vision    # For visual analysis
ollama pull llama3.2           # For LLM corrections

FFmpeg (for audio extraction):

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

HuggingFace Token (for speaker diarization):
- Create account at https://huggingface.co
- Get token from https://huggingface.co/settings/tokens
- Accept pyannote model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1

Install Package

# Clone repository
git clone https://github.com/alexanderv-sifi/video-transcription-analysis.git
cd video-transcription-analysis

# Install with uv (recommended) or pip
uv pip install -e .
# OR
pip install -e .

# Set up environment
cp .env.example .env
# Edit .env and add your HF_TOKEN

Quick Start

Basic Transcription

# Simple transcription with speaker diarization
python examples/transcribe_only.py /path/to/video.mp4

Output:

SPEAKER_00: Welcome to the meeting. Let's discuss the AdsWizz platform.
SPEAKER_01: Thanks. I have questions about TLS 1.2 requirements...

Speaker Identification

Train the system to recognize specific speakers by voice.

1. Enroll Speakers

# Enroll a speaker with 3-5 audio samples (10-30 seconds each)
python examples/enroll_speaker.py enroll "Alexander" \
    sample1.wav sample2.wav sample3.wav

# List enrolled speakers
python examples/enroll_speaker.py list

# View database stats
python examples/enroll_speaker.py stats

2. Use Speaker Identification

# Transcribe with speaker identification
python examples/transcribe_only.py video.mp4 --speaker-db speakers.json

Output:

Alexander: Welcome to the meeting. Let's discuss the AdsWizz platform.
Sarah: Thanks. I have questions about TLS 1.2 requirements...

How It Works

Diarization: First detects who spoke when (SPEAKER_00, SPEAKER_01, etc.)
Identification: Matches each speaker against enrolled voice profiles
Replacement: Replaces generic labels with actual names

Technology: Uses SpeechBrain ECAPA-TDNN embeddings (1.71% EER on VoxCeleb) for state-of-the-art speaker recognition.

Custom Vocabulary

Improve recognition of domain-specific terms, company names, and technical jargon.

Manage Vocabulary

# Add custom terms
python examples/manage_vocabulary.py add "AdsWizz" --category companies
python examples/manage_vocabulary.py add "SFTP" --category technical_terms
python examples/manage_vocabulary.py add "TLS 1.2" --category technical_terms

# List all vocabulary
python examples/manage_vocabulary.py list

# Preview the prompt that will be built
python examples/manage_vocabulary.py preview

Use Vocabulary

# Transcribe with custom vocabulary
python examples/transcribe_only.py video.mp4 --vocabulary vocabulary.yaml

How It Works: Vocabulary terms are used to build an initial_prompt for Whisper, which significantly improves recognition of specialized terminology.

Dictionary Corrections

Apply automatic pattern-based corrections to fix common transcription errors.

Manage Corrections

# Add correction rules
python examples/manage_corrections.py add "as was" "AdsWizz"
python examples/manage_corrections.py add "Ed wiz" "AdsWizz"
python examples/manage_corrections.py add "TLS 1\.2" "TLS 1.2"

# List all rules
python examples/manage_corrections.py list

# Test corrections
python examples/manage_corrections.py test "We use as was for the Ed wiz platform"
# Output: "We use AdsWizz for the AdsWizz platform"

Use Corrections

# Transcribe with dictionary corrections
python examples/transcribe_only.py video.mp4 --corrections corrections.yaml

How It Works: After transcription, text is processed through regex-based pattern matching with word boundaries to avoid partial replacements.

LLM Corrections

Use a local LLM (via Ollama) for context-aware corrections that simple patterns can't handle.

Enable LLM Corrections

# Make sure Ollama is running with llama3.2
ollama pull llama3.2

# Transcribe with LLM corrections
python examples/transcribe_only.py video.mp4 \
    --corrections corrections.yaml \
    --enable-llm-corrections \
    --domain-context "Ad tech platform discussion about security and APIs"

How It Works:

Runs after dictionary corrections (tier 3)
Chunks long transcripts at speaker boundaries
Uses low temperature (0.1) for consistency
Preserves speaker labels and structure
Slow but powerful - adds 2-5 minutes per hour of transcript

Use When: You need context-aware corrections for ambiguous terms, grammar fixes, or domain-specific understanding.

Learning System

Automatically learn corrections from your manual edits.

Learn from Edits

# 1. Transcribe a video
python examples/transcribe_only.py meeting.mp4

# 2. Manually edit the transcript (fix errors)
#    output/meeting_transcript.txt -> output/meeting_edited.txt

# 3. Learn from your edits
python examples/learn_corrections.py \
    output/meeting_transcript.txt \
    output/meeting_edited.txt

# The tool will:
# - Analyze differences between original and edited
# - Suggest new vocabulary terms
# - Suggest new dictionary rules
# - Ask you to approve each suggestion

Auto-approve Mode

# Automatically accept all suggestions
python examples/learn_corrections.py \
    original.txt edited.txt \
    --auto-approve

How It Works:

Uses word-level diffing to identify changes
Filters noise (whitespace, capitalization)
Groups similar patterns
Calculates confidence scores
Distinguishes vocabulary terms from correction rules

Complete Workflow Example

Here's a complete end-to-end workflow combining all features:

# 1. SETUP: Enroll your voice
python examples/enroll_speaker.py enroll "Alexander" \
    voice1.wav voice2.wav voice3.wav

# 2. SETUP: Add company-specific terms
python examples/manage_vocabulary.py add "AdsWizz" --category companies
python examples/manage_vocabulary.py add "Simplifi" --category companies

# 3. INITIAL TRANSCRIPTION: Basic transcription
python examples/transcribe_only.py meeting.mp4 \
    --speaker-db speakers.json \
    --vocabulary vocabulary.yaml

# 4. CORRECTION: Manually fix remaining errors
# Edit: output/meeting_transcript.txt

# 5. LEARNING: Learn from your edits
python examples/learn_corrections.py \
    output/meeting_transcript.txt \
    output/meeting_edited.txt

# 6. FUTURE TRANSCRIPTIONS: Now fully optimized!
python examples/transcribe_only.py future_meeting.mp4 \
    --speaker-db speakers.json \
    --vocabulary vocabulary.yaml \
    --corrections corrections.yaml

Four-Tier Correction System

The system applies corrections in this order:

1. Vocabulary (initial_prompt) → Guides Whisper during transcription
                ↓
2. Dictionary Rules → Fast pattern-based fixes
                ↓
3. LLM Corrections (optional) → Context-aware fixes
                ↓
4. Manual Learning → Improve 1 & 2 from your edits

Philosophy: Start fast and deterministic, add intelligence only where needed.

CLI Tools Reference

transcribe_only.py

Main transcription tool with all features.

# All features combined
python examples/transcribe_only.py video.mp4 \
    --model medium \
    --speaker-db speakers.json \
    --vocabulary vocabulary.yaml \
    --corrections corrections.yaml \
    --enable-llm-corrections \
    --domain-context "Technical discussion about ad tech"

# Without diarization
python examples/transcribe_only.py video.mp4 --no-diarization

enroll_speaker.py

Manage speaker voice profiles.

# Commands
enroll_speaker.py enroll <name> <audio1> <audio2> ...
enroll_speaker.py list
enroll_speaker.py remove <name>
enroll_speaker.py info <name>
enroll_speaker.py stats

manage_vocabulary.py

Manage custom vocabulary.

# Commands
manage_vocabulary.py add <term> [--category <cat>]
manage_vocabulary.py remove <term>
manage_vocabulary.py list [--category <cat>]
manage_vocabulary.py clear --category <cat>
manage_vocabulary.py preview

manage_corrections.py

Manage dictionary correction rules.

# Commands
manage_corrections.py add <pattern> <replacement> [--case-sensitive]
manage_corrections.py remove <pattern>
manage_corrections.py list
manage_corrections.py clear
manage_corrections.py test <text>

learn_corrections.py

Learn from manual edits.

# Commands
learn_corrections.py <original.txt> <edited.txt> [--auto-approve] [--dry-run]

Whisper Model Selection (MLX on Apple Silicon)

Model	Size	Time (1h video)	WER	Quality	Recommended
tiny	39M	~8 min	7.5%	Basic	⚠️ Testing only
base	74M	~12 min	5.0%	Good	Quick drafts
small	244M	~15 min	3.4%	Better	⭐ Balance
medium	769M	~25 min	2.9%	Great	✅ Default
large	1550M	~43 min	3.0%	Same	❌ Slower
turbo	809M	~20 min	~3.0%	Great	Speed

Note: Times for M1 Max with MLX. WER = Word Error Rate (lower is better).

⚠️ Important: Tiny Model Limitations

DO NOT use the tiny model for production or real transcriptions! The tiny model has serious quality issues:

Hallucination prone: Often produces repetitive output like "then then then then..."
High error rate: 7.5% WER means ~1 in 13 words is wrong
Poor technical terms: Struggles with domain-specific vocabulary
Unreliable output: May produce unusable transcripts

The tiny model is ONLY suitable for:

Quick system testing
Verifying your installation works
Development/debugging

For actual transcriptions, use at minimum the small model (3.4% WER, ~15 min/hour).

Performance Tips

For Quality

--model medium \
--speaker-db speakers.json \
--vocabulary vocabulary.yaml \
--corrections corrections.yaml \
--enable-llm-corrections

For Speed

--model small \
--corrections corrections.yaml
# Skip LLM corrections (saves 2-5 min/hour)

For Testing

--model tiny
# ~8 minutes for 1-hour video

Output Files

output/
├── video_transcript.txt          # Plain text with speaker labels
├── video_transcript.srt          # SRT subtitles
├── video_transcript.json         # Detailed JSON with segments
├── video_analysis.json           # Frame-by-frame (if using processor)
├── video_analysis.md             # Frame-by-frame markdown
└── video_combined_summary.txt    # Combined audio + visual

Troubleshooting

Speaker Diarization Not Working

Set HF_TOKEN in .env
Accept conditions at https://huggingface.co/pyannote/speaker-diarization-3.1
Check logs for authentication errors

Speaker Identification Low Confidence

Provide 3-5 diverse audio samples per speaker
Use 10-30 second samples with clear speech
Vary recording conditions
Avoid background noise

LLM Corrections Failing

Ensure Ollama is running: ollama list
Check model is pulled: ollama pull llama3.2
Verify Ollama URL (default: http://localhost:11434)

Learning Not Finding Patterns

Lower --min-occurrences threshold
Lower --min-confidence threshold
Ensure edits are consistent (fix same error the same way)

Architecture

src/video_transcription/
├── transcriber.py       # Core transcription with all integrations
├── speaker_db.py        # Speaker enrollment & identification
├── vocabulary.py        # Custom vocabulary management
├── corrector.py         # Dictionary-based corrections
├── llm_corrector.py     # LLM-based corrections
├── learning.py          # Learning from manual edits
├── analyzer.py          # Visual frame analysis
└── processor.py         # Combined audio + visual

examples/
├── transcribe_only.py           # Main CLI tool
├── enroll_speaker.py            # Speaker management
├── manage_vocabulary.py         # Vocabulary management
├── manage_corrections.py        # Corrections management
├── learn_corrections.py         # Learning from edits
├── process_video.py             # Combined processing
└── analyze_only.py              # Visual analysis only

Development

# Install with dev dependencies
uv pip install -e ".[dev]"

# Format code
ruff format .

# Lint
ruff check .

# Type check
mypy src/

License

MIT License

Acknowledgments

MLX Whisper - GPU-accelerated transcription
SpeechBrain - Speaker recognition
pyannote.audio - Speaker diarization
Ollama - Local LLM inference
OpenAI Whisper - Original Whisper models
FFmpeg - Media processing

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.serena		.serena
examples		examples
src/video_transcription		src/video_transcription
tests		tests
.env.example		.env.example
.gitignore		.gitignore
IMPLEMENTATION_STATUS.md		IMPLEMENTATION_STATUS.md
README.md		README.md
SPEAKER_ATTRIBUTION_IMPROVEMENTS.md		SPEAKER_ATTRIBUTION_IMPROVEMENTS.md
TEST_RESULTS.md		TEST_RESULTS.md
justfile		justfile
pyproject.toml		pyproject.toml
setup_diarization.md		setup_diarization.md
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Video Transcription and Analysis

Features

Core Transcription

Advanced Features ✨

Visual Analysis

Installation

Prerequisites

Install Package

Quick Start

Basic Transcription

Speaker Identification

1. Enroll Speakers

2. Use Speaker Identification

How It Works

Custom Vocabulary

Manage Vocabulary

Use Vocabulary

Dictionary Corrections

Manage Corrections

Use Corrections

LLM Corrections

Enable LLM Corrections

Learning System

Learn from Edits

Auto-approve Mode

Complete Workflow Example

Four-Tier Correction System

CLI Tools Reference

transcribe_only.py

enroll_speaker.py

manage_vocabulary.py

manage_corrections.py

learn_corrections.py

Whisper Model Selection (MLX on Apple Silicon)

⚠️ Important: Tiny Model Limitations

Performance Tips

For Quality

For Speed

For Testing

Output Files

Troubleshooting

Speaker Diarization Not Working

Speaker Identification Low Confidence

LLM Corrections Failing

Learning Not Finding Patterns

Architecture

Development

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages