Faster Whisper XXL GUI

A comprehensive, user-friendly graphical interface for audio and video transcription with advanced speaker diarization capabilities. Built on top of faster-whisper-xxl.exe, (https://github.com/Purfview/whisper-standalone-win) this application makes high-quality transcription, translation, and diarization easy and accessable. Faster-whisper-xxl is a modified version of OpenAi Whisper (see Credits and Acknowledgments below). Faster-whisper-xxl is STAND ALONE and does not require network access (other than to downlaod langague models)

Download .exe by clicking link under "Releases"

🎯 Overview

Faster Whisper XXL GUI provides an intuitive interface for transcribing audio and video files with great accuracy. Whether you need to transcribe a single interview or process hundreds of files in batch, this application offers the tools and flexibility to get the job done efficiently.

Key Highlights

Easy Setup: Simple installation process with guided setup on first launch
Maximum Accuracy: Optimized presets for interview-quality transcriptions
Speaker Identification: Advanced diarization to identify who said what
Batch Processing: Handle multiple files with individual settings
Audio Quality Analysis: Get intelligent suggestions for optimal settings
Multiple Output Formats: SRT, VTT, TXT, and JSON formats
Post-Processing Tools: Speaker replacement and timestamp removal utilities
Non-Commercial Use: This application is for non-commercial use only

✨ Features

Core Transcription Features

Multi-Format Support: Process audio and video files in virtually any format (MP3, MP4, WAV, M4A, FLAC, etc.)
Language Support: Transcribe in 99+ languages or translate to English
Multiple Models: Choose from tiny to large-v3-turbo models based on your accuracy/speed needs
Real-Time Progress: Watch transcription progress with live output and status updates
Drag & Drop: Simply drag files or folders into the application
Batch Processing: Process multiple files with a queue system and per-file settings

Speaker Diarization (Speaker Identification)

Identify different speakers in your audio - perfect for interviews, phone calls, meetings, and multi-speaker content.

Multiple Diarization Methods:
- pyannote_v3.1 (Recommended) - Latest and most accurate
- pyannote_v3.0 - Stable fallback option
- reverb_v2 - Best for audio with echo/reverb
- reverb_v1 - Legacy method for compatibility
Speaker Count Optimization: Set exact speaker count for dramatically improved accuracy
Speaker Replacement Tool: After transcription, replace generic labels (SPEAKER_00, SPEAKER_01) with actual names
GPU Acceleration: Optional GPU support for 5-10x faster processing
Min/Max Speaker Range: Set speaker count ranges when exact count is unknown

Audio Processing & Quality

Audio Quality Analysis: Analyze your audio files and receive intelligent suggestions for optimal filter settings
- Noise level detection (low/medium/high)
- Volume level assessment
- Quality score (0-100)
- Personalized filter recommendations
Audio Filters:
- Speech Normalization: Amplify quiet speech to make it more audible
- Loudness Normalization: Normalize to EBU R128 broadcast standard
- Low/High Pass Filter: Remove frequencies outside speech range (50Hz-7800Hz)
- Denoise: Reduce background noise (adjustable intensity 0-97)
- Tempo Adjustment: Adjust playback speed for fast or slow speech
Voice Activity Detection (VAD): Automatically filter out non-speech segments

User Experience Features

Preset Configurations: Five optimized presets for different use cases
- Standard: Maximum accuracy for clean interview recordings
- Turbo: Speed-optimized for quick transcriptions
- Diarize: Optimized for speaker identification with maximum accuracy
- Phone Conversation Audio: Optimized for low-quality/noisy phone recordings
- Custom: Full manual control
Comprehensive Help System:
- Tooltips on hover for quick explanations
- Detailed help dialogs (click "?" buttons)
- Context-sensitive help that adapts to your selections
- Best Practices guide shown on first run
Settings Validation: Pre-processing warnings for suboptimal configurations
GPU Detection: Automatic detection and notification when GPU is available
Model Management: Check which models are downloaded and get download information
Command Preview: See the exact command that will be executed before processing

Post-Processing Tools

Speaker Replacement Dialog:
- Review full diarized transcript
- See all identified speakers in a table
- Replace generic labels with actual names
- Preview changes before saving
- Save updated transcript with custom speaker names
Timestamp Removal Tool:
- Remove timestamps from transcriptions
- Clean up text for easier reading
- Preserve speaker labels if diarization was used

Advanced Features

Custom Subtitle Formatting:
- Control maximum line width
- Set maximum lines per segment
- Configure comma break percentage
- Enable sentence mode for better segmentation
Word-Level Timestamps: Precise timing for each word
Karaoke-Style Subtitles: Highlight words as they're spoken (VTT format)
Processing Queue: Manage multiple files with individual settings
Recursive Folder Processing: Process entire folder structures
File Validation: Automatic verification of input files before processing

📥 Installation

System Requirements

Operating System: Windows 10 or later
Disk Space: At least 2GB free space (for runtime files and models)
RAM: 4GB minimum, 8GB+ recommended
GPU: Optional but recommended - NVIDIA GPU with CUDA support for 5-10x faster processing

Installation Steps

Download GUI: Download FasterWhisperXXLGUI.exe from the Releases page
Run GUI: Double-click FasterWhisperXXLGUI.exe to launch the application
Download faster-whisper-xxl (First Run Only):
- On first launch, a dialog will appear prompting you to download faster-whisper-xxl.exe
- Click "Download faster-whisper-xxl" to open the download page or download direclty from (https://github.com/Purfview/whisper-standalone-win)
- Download the file: Faster-Whisper-XXL_r245.4_windows.7z from the GitHub repository
- The file will download to your default download location (usually Downloads folder)
Install faster-whisper-xxl:
- After downloading, click "I've Downloaded It - Install Now" in the GUI dialog
- The application will help you locate the downloaded .7z file
- Files will be automatically extracted and installed to %LOCALAPPDATA%\FasterWhisperGUI\
- This is a one-time setup - the application will remember the installation
Models: Models will be automatically downloaded when needed to %LOCALAPPDATA%\FasterWhisperGUI\_models\
- Models are not preloaded and will be downloaded automatically the 1st time a command for a certain language is used.

Note: The GUI executable does not include faster-whisper-xxl.exe - it must be downloaded separately from the Purfview GitHub repository as per the developer's requirements.

📖 Usage Guide

Quick Start

Launch: Double-click FasterWhisperXXLGUI.exe
Select Files:
- Click "Select Files/Folder..." button, or
- Drag and drop files/folders into the application
Choose a Preset (optional but recommended):
- Select from the dropdown: Standard, Turbo, Diarize, Phone Conversation Audio, or Custom
- Presets configure optimal settings for different use cases
Configure Settings (if needed):
- Modify any settings after selecting a preset
- Use "?" buttons for detailed help on any option
Start Processing: Click "Start Processing" button
Post-Processing (if diarization enabled):
- Speaker Replacement dialog opens automatically
- Replace speaker labels with actual names
- Save the updated transcript

Detailed Workflow

For Single File Transcription

Select File: Click "Select Files/Folder..." and choose your audio/video file
Choose Preset: Select appropriate preset (Standard for most cases)
Optional - Analyze Audio: Click "Analyze Audio Quality" for filter suggestions
Configure Language: Select the language (don't use Auto-detect for best accuracy)
Choose Output Format: Select SRT, VTT, TXT, and/or JSON
Start Processing: Click "Start Processing"
Review Output: Check the output area for progress and results
Open Output: Click "Open Output Folder" when complete

For Speaker Diarization (Interviews, Meetings)

Select File(s): Choose your audio/video file(s)
Choose Preset: Select "Diarize" preset
Set Speaker Count: ⭐ CRITICAL - Set the exact number of speakers if known (dramatically improves accuracy)
Configure Language: Specify the language explicitly
Choose Output Formats: Select TXT, SRT, and/or VTT (speaker labels included)
Start Processing: Click "Start Processing"
Speaker Replacement:
- Dialog opens automatically after processing
- Review all identified speakers in the table
- Replace generic labels (SPEAKER_00, SPEAKER_01) with actual names
- Preview changes
- Save updated transcript

For Batch Processing (Multiple Files)

Select Multiple Files or Folder: Choose multiple files or a folder containing files
Configure Settings: Set your preferred settings
Start Processing: Click "Start Processing"
Queue Settings Dialog:
- Choose "Apply same settings to all files" or
- Choose "Configure different settings for each file"
Edit Individual Settings (if needed): Click "Edit Settings" for any file
Monitor Progress: Processing Queue window shows progress for all files
Review Results: Each file gets its own output files

Presets Explained

⭐ Standard Preset (Recommended for Most Users)

Best For: General transcription, clean interview recordings, podcasts, meetings
Model: large-v2 (excellent accuracy, reliable, fewer hallucinations than v3)
Optimization: Maximum accuracy settings for controlled interview environments
Settings:
- Beam Size: 10 (maximum accuracy)
- Patience: 5.0 (maximum completeness)
- Temperature: 0.0 (deterministic, consistent results)
- VAD: Enabled with pyannote_v3 (most accurate method)
- Audio Filters: Loudness normalization only (minimal processing for clean audio)
When to Use: Most common use case - clean audio, interviews, general transcription needs

⚡ Turbo Preset

Best For: Quick transcriptions when speed is priority
Model: turbo (fast while maintaining good accuracy)
Settings: Balanced for speed and accuracy
- Beam Size: 5 (default, balanced)
- Patience: 2.0 (default, faster)
- VAD: Enabled with silero_v4_fw (faster method)
When to Use: When you need fast results and can accept slightly lower accuracy

🎤 Diarize Preset

Best For: Interviews, meetings, podcasts with multiple speakers
Model: large-v2 (best for speaker identification accuracy)
Optimization: Maximum accuracy for speaker identification
Settings:
- Diarization: Enabled with pyannote_v3.1 (latest, most accurate method)
- Beam Size: 10 (maximum accuracy)
- Patience: 5.0 (maximum completeness)
- Audio Filters: Loudness normalization only
Output: TXT, SRT, VTT formats (multiple formats for speaker labeling)
⚠️ IMPORTANT: Set the exact number of speakers if known (dramatically improves accuracy)
When to Use: When you need to identify who is speaking

📞 Phone Conversation Audio Preset

Best For: Phone calls, low-quality recordings, noisy audio
Model: large-v2 (robust for degraded audio)
Settings: Optimized for challenging audio conditions
- Beam Size: 8 (higher for noisy conditions)
- Patience: 4.0 (better for degraded audio)
- Audio Filters: Aggressive preprocessing
  - Speech normalization (amplifies quiet speech)
  - Loudness normalization
  - Low/high pass filter (removes non-speech frequencies)
  - Denoise: 15 (moderate noise reduction)
⚠️ IMPORTANT: For phone calls, set speaker count to 2 if known
When to Use: Phone calls, recordings with background noise, low-quality audio

🔧 Custom Preset

Best For: Advanced users who want full control
Settings: No pre-configured settings - configure everything manually
When to Use: When you want to set all options yourself from scratch

Audio Quality Analysis

The Audio Quality Analysis feature helps you determine the best settings for your specific audio file.

Select File: Choose your audio file
Click "Analyze Audio Quality": Button in the File Selection section
Review Results:
- Quality Score (0-100)
- Noise Level (Low/Medium/High)
- Volume Level (Low/Normal/High)
- Suggested filter settings
Apply Suggestions: Manually enable suggested filters or use a preset

Example Suggestions:

"Clean audio detected - using minimal filters for best accuracy"
"Low volume detected - consider enabling Speech Normalization"
"Some noise detected - consider enabling Denoise filter if quality is poor"

Output Formats

SRT (SubRip Subtitle)

Standard subtitle format, widely compatible with video players
Format: Sequential subtitle blocks with timestamps
Includes speaker labels when diarization is enabled
Best for: Video editing software, YouTube, general video players

VTT (WebVTT)

Web Video Text Tracks format
Similar to SRT but web-optimized
Supports word-level highlighting (karaoke effect)
Includes speaker labels when diarization is enabled
Best for: Web videos, HTML5 video players

TXT (Plain Text with Timestamps)

Plain text format with timestamps
Format: [HH:MM:SS.mmm --> HH:MM:SS.mmm] Text
Includes speaker labels when diarization is enabled
Example: [00:38.120 --> 00:41.240] [SPEAKER_01]: We came over to your house earlier this
Best for: Reading, editing, general text processing

JSON (Detailed Metadata)

Comprehensive data format with all information
Includes: timestamps, word-level data, speaker information (separate from text)
Best for: Programmatic processing, detailed analysis
Note: Speaker labels are in metadata, not inline with text

🎯 Tips for Maximum Accuracy

General Transcription Accuracy

Specify Language: Don't use "Auto-detect" - specify the language explicitly for best results
Use Larger Models: large-v2 or large-v3-turbo provide better accuracy than smaller models
Enable GPU: If you have an NVIDIA GPU, enable CUDA for faster processing (allows using larger models)
Use Appropriate Preset: Choose the preset that matches your use case
Audio Quality: Use "Analyze Audio Quality" to get suggestions for your specific audio

Speaker Diarization Accuracy

⭐ Set Exact Speaker Count: The SINGLE MOST IMPORTANT setting - if you know there are exactly 2, 3, 4, etc. speakers, set "Number of Speakers" to that exact number. This dramatically improves accuracy.
Specify Language: Don't use "Auto-detect" - specify the language explicitly
Use Large Model: large-v2 model provides best accuracy for diarization
Use Latest Method: pyannote_v3.1 is the most accurate diarization method
Enable GPU: Diarization is computationally intensive - GPU provides 5-10x speedup
Clean Audio: For clean recordings (iPhone, quiet rooms), use minimal audio filters

Audio Filter Strategy

Clean Audio (iPhone recordings, quiet rooms): Use minimal filters (loudness normalization only)
Noisy Audio (phone calls, background noise): Use aggressive filters (denoise, frequency filtering)
Low Volume: Enable speech normalization
Inconsistent Volume: Enable loudness normalization

Model Selection Guide

Maximum Accuracy: large-v2 or large-v3-turbo
Balanced: large-v2 (recommended for most users)
Speed Priority: turbo
Resource Constrained: small or tiny (lower accuracy)

⌨️ Keyboard Shortcuts

Ctrl+O: Open files
Ctrl+S: Start processing
Ctrl+C: Cancel processing
F1: Show help menu
Esc: Close dialogs

🐛 Troubleshooting

Application Won't Start

Insufficient Disk Space: Ensure at least 2GB free space
Antivirus Blocking: Check if Windows Defender or antivirus is blocking the executable
Permission Errors: Try running as Administrator
First Run Delay: First launch takes longer as files are extracted - wait for extraction dialog to complete

No Output Generated

Check Output Area: Look for error messages in the output text area
Invalid Files: Verify your input files are valid audio/video formats
Disk Space: Ensure sufficient disk space for output files
File Permissions: Check that you have write permissions to the output folder

Slow Processing

Enable GPU: If you have NVIDIA GPU, enable CUDA in device settings
Use Smaller Model: Try turbo instead of large-v2 for faster processing
Reduce Beam Size: Lower beam size in Advanced Options (trades accuracy for speed)
Disable Audio Filters: If not needed, disable filters to speed up preprocessing

Poor Transcription Quality

Use Larger Model: Try large-v2 or large-v3-turbo for better accuracy
Specify Language: Don't use "Auto-detect" - specify language explicitly
Enable Audio Filters: Use Audio Quality Analysis for suggestions
Check Audio Quality: Poor source audio will result in poor transcriptions
For Diarization: Set exact speaker count - this is critical for accuracy

GPU Not Detected

Check Drivers: Ensure NVIDIA drivers are installed and up to date
Verify CUDA: Run nvidia-smi in command prompt to verify CUDA is available
Auto Mode: Application will automatically use CPU if GPU is not available
Device Selection: Ensure "Device" is set to "Auto" or "CUDA" in Advanced Options

Models Not Downloading

Internet Connection: Check your internet connection
Disk Space: Ensure sufficient space in %LOCALAPPDATA%\FasterWhisperGUI\_models\
Firewall: Check if firewall is blocking the download
Automatic Download: Models download automatically when first used - be patient

"faster-whisper-xxl.exe not found" Error

First Run: Complete the installation process by downloading and installing faster-whisper-xxl.exe
Installation Required: If you see this error, restart the application to trigger the installation dialog
Manual Installation: If the installation dialog doesn't appear, ensure faster-whisper-xxl.exe is in %LOCALAPPDATA%\FasterWhisperGUI\
Corrupted Files: Delete %LOCALAPPDATA%\FasterWhisperGUI\ and restart (you'll need to reinstall)
Permissions: Ensure you have read/write permissions to AppData folder

Post-Processing Dialogs Not Appearing

Check Output Files: If output files were generated, dialogs should appear
Non-Zero Exit Code: Dialogs appear even if process returns warnings (as long as outputs are generated)
Manual Access: Use "Identify & Replace Speakers" or "Remove Timestamps" buttons if dialogs don't auto-open

📁 File Locations

Installed Files (Installed on First Run)

Location: C:\Users\[YourUsername]\AppData\Local\FasterWhisperGUI\
Contents:
- faster-whisper-xxl.exe - Core transcription engine (downloaded and installed separately)
- ffmpeg.exe - Audio/video processing (included with faster-whisper-xxl)
- ffprobe.exe - Media file analysis (included with faster-whisper-xxl, optional)
- _xxl_data\ - Runtime dependencies (included with faster-whisper-xxl)

Downloaded Models

Location: C:\Users\[YourUsername]\AppData\Local\FasterWhisperGUI\_models\
Contents: Whisper models (downloaded automatically when needed)
Size: Models can be several GB each
Models Available:
- tiny, small, medium, large-v1, large-v2, large-v3, large-v3-turbo

Configuration

Location: C:\Users\[YourUsername]\.faster_whisper_gui_config.json
Contents: User preferences (show best practices dialog, etc.)

🔧 Advanced Configuration

Custom Subtitle Formatting

When "Standard Preset" is disabled, you can customize subtitle formatting:

Max Line Width: Maximum characters per subtitle line (default: 70)
Max Line Count: Maximum lines per subtitle segment (default: 3)
Max Comma Percentage: Percentage of line width before breaking at commas (default: 90%)
Sentence Mode: Split subtitles at sentence boundaries (auto-enabled with diarization)

Advanced Transcription Options

Temperature: Controls randomness (0.0 = deterministic, recommended)
Beam Size: Number of candidates to consider (1-10, higher = more accurate but slower)
Patience: How long to wait before finalizing segments (higher = more patient)
Best Of: Number of candidates when sampling (default: 5)
Length Penalty: Token length penalty coefficient (default: 1.0)
Repetition Penalty: Penalty for repeating tokens (default: 1.0)

Most users should leave these at default values unless experiencing specific issues.

🙏 Credits & Acknowledgments

This application is built on top of excellent open-source projects:

faster-whisper-xxl.exe by Purfview - The core transcription engine that powers this application. Note: faster-whisper-xxl.exe must be downloaded separately from the GitHub repository - it is not bundled with this GUI application. Special thanks to Purfview for creating and maintaining the standalone Windows executable.
Reverb by Rev.com - Used for speaker diarization functionality. The Reverb model is used under the Rev Model Non-Production License (see LICENSE_REVERB.txt).
Faster Whisper by Guillaume Klein - The optimized Whisper implementation that provides fast and accurate transcription.
OpenAI Whisper by OpenAI - The state-of-the-art speech recognition model that makes accurate transcription possible.
PyQt6 - The GUI framework that provides the user interface.
FFmpeg - Audio and video processing capabilities.

📝 License

Non-Commercial Use Only

This application is for NON-COMMERCIAL USE ONLY.

This GUI application is provided as-is for non-commercial use. Commercial use is not permitted.

License Files

This application includes the following license files:

LICENSE_NOTICE.txt - General license notice and attribution information
LICENSE_REVERB.txt - License for Reverb diarization model (Rev Model Non-Production License)

Third-Party Licenses

faster-whisper-xxl.exe by Purfview - Must be downloaded separately. Please refer to the Purfview repository for licensing information.
Faster Whisper by Guillaume Klein - Licensed under the MIT License. For full license terms, see: https://github.com/guillaumekln/faster-whisper/blob/master/LICENSE
OpenAI Whisper by OpenAI - Licensed under the MIT License. For full license terms, see: https://github.com/openai/whisper/blob/main/LICENSE
Reverb by Rev.com - Used for speaker diarization. Licensed under the Rev Model Non-Production License. See LICENSE_REVERB.txt for full terms. This license restricts use to non-production environments and non-commercial purposes.

Usage Restrictions

This GUI application: Non-commercial use only
faster-whisper-xxl.exe: Subject to Purfview's license terms
Faster Whisper: MIT License (see link above)
OpenAI Whisper: MIT License (see link above)
Reverb: Subject to Rev Model Non-Production License (non-production, non-commercial use only)

For commercial use or production environments, please contact the respective license holders for appropriate licensing.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

📧 Support

For issues, questions, or feature requests, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE_NOTICE.txt		LICENSE_NOTICE.txt
LICENSE_REVERB.txt		LICENSE_REVERB.txt
README.md		README.md
audio_analyzer.py		audio_analyzer.py
command_builder.py		command_builder.py
command_preview_dialog.py		command_preview_dialog.py
file_info_extractor.py		file_info_extractor.py
file_settings_dialog.py		file_settings_dialog.py
gui_main.py		gui_main.py
help_texts.py		help_texts.py
model_checker.py		model_checker.py
presets.py		presets.py
process_manager.py		process_manager.py
processing_queue.py		processing_queue.py
queue_settings_dialog.py		queue_settings_dialog.py
queue_window.py		queue_window.py
requirements.txt		requirements.txt
runtime_extractor.py		runtime_extractor.py
speaker_replacement_dialog.py		speaker_replacement_dialog.py
timestamp_removal_dialog.py		timestamp_removal_dialog.py

koebbe14/Faster-Whisper-XXL-GUI

Folders and files

Latest commit

History

Repository files navigation