A comprehensive Python-based platform for processing meeting recordings, transcripts, and subtitles with advanced speaker recognition capabilities using voice fingerprinting.
- Features
- System Architecture
- Requirements
- Installation
- Quick Start
- Core Components
- Usage Guide
- Technical Details
- API Reference
- Performance Optimization
- Troubleshooting
- Contributing
- License
-
Multi-Source Input Support
- Panopto video URLs and IDs
- Audio/video files (MP4, MP3, WAV, M4A, etc.)
- Subtitle files (VTT, SRT, SBV)
- Excel transcripts
- Batch directory processing
-
Advanced Speaker Recognition
- Voice fingerprinting using ECAPA-TDNN neural networks
- ChromaDB vector database for speaker embeddings
- Automatic speaker identification across meetings
- Confidence scoring and speaker profile management
-
Intelligent Processing Pipeline
- WhisperX for speech-to-text transcription
- Speaker diarization with pyannote
- NLP-based timestamp refinement
- AI-powered meeting summarization (GPT-4)
-
Rich Output Formats
- Time-synced transcripts with speaker labels
- Interactive HTML summaries with video links
- Color-coded Excel files with speaker analytics
- Detailed speaker participation metrics
┌─────────────────────────────────────────────────────────────┐
│ Input Layer │
├─────────────────────────────────────────────────────────────┤
│ Panopto URLs │ Media Files │ Subtitles │ Excel │ Directory │
└────────┬──────┴──────┬──────┴─────┬─────┴───┬───┴─────┬────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Input Detector & Router │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────────┴────────────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Media Processor │ │ Subtitle Processor │
│ - WhisperX │ │ - Format parsing │
│ - Diarization │ │ - Speaker mapping │
│ - Voice embedding │ │ - Time alignment │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
└──────────────┬──────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ Speaker Recognition Engine │
│ - SpeechBrain ECAPA-TDNN │
│ - ChromaDB Vector Store │
│ - Similarity Matching │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Transcript Processor │
│ - Format conversion (VTT→TXT→XLSX) │
│ - NLP timestamp refinement │
│ - AI summarization (OpenAI GPT-4) │
│ - HTML/Markdown generation │
└─────────────────────────────────────────────────────────────┘
- Python 3.8+
- CUDA-capable GPU (recommended for WhisperX)
- 16GB+ RAM (32GB recommended for large files)
- 10GB+ free disk space for models
- OpenAI API Key: For meeting summarization
- Hugging Face Token: For speaker diarization (get from pyannote/speaker-diarization-3.1)
git clone https://github.com/yourusername/transcript-processing-platform.git
cd transcript-processing-platform
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install main requirements
pip install -r requirements.txt
# Install ctranslate2 separately (specific version required)
pip install ctranslate2==4.5.0
cp .env.template .env
# Edit .env with your API keys and preferences
The following models will be downloaded automatically on first use:
- WhisperX large-v2 model (~3GB)
- Pyannote speaker diarization model (~200MB)
- SpeechBrain ECAPA-TDNN model (~100MB)
# Basic processing with speaker recognition
python cli.py meeting_recording.mp4
# View results
ls output/meeting_recording/
# List identified speakers
python speaker_cli.py list
# Rename generic labels
python speaker_cli.py rename "Speaker_1" "John Doe"
Automatically identifies input types using pattern matching and file extension analysis.
Orchestrates the entire workflow, routing inputs to appropriate processors.
- Integrates WhisperX for transcription
- Performs speaker diarization
- Extracts voice embeddings for each speaker segment
- Manages ChromaDB vector database
- Computes speaker embeddings using ECAPA-TDNN
- Performs similarity matching with configurable thresholds
- Converts between formats (VTT → TXT → XLSX)
- Applies NLP-based timestamp refinement
- Generates AI summaries and HTML outputs
# Default: with speaker recognition
python cli.py recording.mp4
# Without speaker recognition (faster)
python cli.py recording.mp4 --no-speaker-recognition
# Custom output directory
python cli.py recording.mp4 --output ./my_transcripts/
# Requires original audio for speaker recognition
python cli.py transcript.srt --audio-path original_recording.mp4
# Without audio (no speaker recognition)
python cli.py transcript.srt
# Note: No speaker recognition available (no audio access)
python cli.py "https://mit.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=VIDEO_ID"
# Process entire directory
python cli.py ./recordings/ --speaker-recognition
python speaker_cli.py list
# By name
python speaker_cli.py rename "Speaker_1" "Alice Johnson"
# By ID prefix
python speaker_cli.py rename "a3f2d8" "Alice Johnson"
# Merge multiple entries into one
python speaker_cli.py merge "John" "J. Doe" "John Doe" --target "John Doe"
# Check speaker in audio segment
python speaker_cli.py verify audio.wav --start 10 --end 30
# Export for backup/sharing
python speaker_cli.py export speakers_backup.json
# Import and merge
python speaker_cli.py import speakers_backup.json --merge
# Import and replace
python speaker_cli.py import speakers_backup.json
# GPU Configuration
WHISPER_DEVICE=cuda # or 'cpu' for CPU-only
WHISPER_COMPUTE_TYPE=float16 # or 'int8' for faster/lower quality
WHISPER_BATCH_SIZE=16 # Reduce for less GPU memory
# Model Selection
WHISPER_MODEL=large-v2 # Options: tiny, base, small, medium, large, large-v2
# Processing Settings
BATCH_SIZE_MINUTES=40 # Chunk size for summarization
GPT_MODEL=gpt-4o # Or gpt-3.5-turbo for cost savings
# Directories
OUTPUT_DIR=./output
SPEAKER_DB_DIR=./speaker_database
# Use shared/team database
python cli.py recording.mp4 --speaker-db-path /shared/team_speakers/
Edit speaker_id/fingerprint_manager.py
:
def __init__(self, ..., similarity_threshold: float = 0.85):
# 0.85 = 85% similarity required
# Lower values: more lenient matching
# Higher values: stricter matching
The system uses ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network):
- 192-dimensional speaker embeddings
- Trained on VoxCeleb dataset
- Cosine similarity for speaker matching
- ChromaDB for efficient vector search
- Audio Segmentation: Extract speaker-specific segments from diarization
- Feature Extraction: Convert audio to 192-dim embedding vectors
- Vector Storage: Store in ChromaDB with metadata
- Similarity Search: Compare new embeddings against database
- Identity Resolution: Apply threshold and confidence scoring
- Format Normalization: Convert all inputs to standardized VTT
- Timestamp Refinement: Use NLTK for sentence boundary detection
- Speaker Attribution: Map diarization results to transcript segments
- Topic Detection: Identify topic changes using TF-IDF
- AI Summarization: Generate hierarchical summaries with GPT-4
from processors.pipeline_processor import PipelineProcessor
# Initialize pipeline
pipeline = PipelineProcessor(
output_base_dir=Path("./output"),
enable_speaker_recognition=True
)
# Process media file
result = pipeline.process(
"meeting.mp4",
language="en",
speaker_recognition=True,
auto_add_speakers=True,
skip_refinement=False
)
# Access results
print(result['success']) # Processing status
print(result['output_dir']) # Output location
print(result['speakers']) # Identified speakers
print(result['files']) # Generated files
from speaker_id.fingerprint_manager import SpeakerFingerprintManager
# Initialize manager
manager = SpeakerFingerprintManager(
db_path=Path("./speaker_db"),
similarity_threshold=0.85
)
# Extract embedding from audio
embedding = manager.extract_embedding(
"audio.wav",
start_time=10.0,
end_time=20.0
)
# Identify speaker
result = manager.identify_speaker(
embedding,
return_all_matches=True
)
# Add new speaker
speaker_id = manager.add_speaker(
name="John Doe",
embedding=embedding,
metadata={"department": "Engineering"}
)
# Reduce batch size for limited GPU memory
WHISPER_BATCH_SIZE=8 # Default: 16
# Use INT8 quantization
WHISPER_COMPUTE_TYPE=int8 # Faster, slightly lower quality
- CPU-only: ~5x slower than GPU
- Small model: ~10x faster than large-v2
- Skip diarization: ~2x faster
- Skip refinement: ~1.5x faster
# Process overnight with logging
nohup python cli.py ./recordings/ > processing.log 2>&1 &
# Monitor progress
tail -f processing.log
# Solution 1: Reduce batch size
export WHISPER_BATCH_SIZE=4
# Solution 2: Use CPU
export WHISPER_DEVICE=cpu
# Solution 3: Use smaller model
export WHISPER_MODEL=base
- Ensure audio quality is good (low noise, clear speech)
- Use segments longer than 3 seconds
- Process multiple recordings to build better profiles
- Adjust similarity threshold if needed
# Reinstall with specific CUDA version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install missing dependencies
pip install speechbrain chromadb scipy nltk
- Ensure you have proper permissions
- Check if video is public/unlisted
- Verify URL format is correct
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Install in development mode
pip install -e .
# Install dev dependencies
pip install pytest black flake8 mypy
- Follow PEP 8
- Use type hints
- Add docstrings for all public methods
- Run
black .
before committing
# Run unit tests
pytest tests/
# Run with coverage
pytest --cov=processors tests/
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature
) - Commit changes (
git commit -m 'Add amazing feature'
) - Push to branch (
git push origin feature/amazing-feature
) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- WhisperX: Faster Whisper transcription with word-level timestamps
- Pyannote: State-of-the-art speaker diarization
- SpeechBrain: ECAPA-TDNN speaker recognition model
- ChromaDB: Efficient vector database for embeddings
- OpenAI: GPT-4 for intelligent summarization
If you use this platform in your research, please cite:
@software{unified_transcript_platform,
title = {Unified Transcript Processing Platform with Speaker Recognition},
year = {2025},
url = {https://github.com/yourusername/transcript-processing-platform}
}