A comprehensive, user-friendly graphical interface for audio and video transcription with advanced speaker diarization capabilities. Built on top of faster-whisper-xxl.exe, (https://github.com/Purfview/whisper-standalone-win) this application makes high-quality transcription, translation, and diarization easy and accessable. Faster-whisper-xxl is a modified version of OpenAi Whisper (see Credits and Acknowledgments below). Faster-whisper-xxl is STAND ALONE and does not require network access (other than to downlaod langague models)
Faster Whisper XXL GUI provides an intuitive interface for transcribing audio and video files with great accuracy. Whether you need to transcribe a single interview or process hundreds of files in batch, this application offers the tools and flexibility to get the job done efficiently.
- Easy Setup: Simple installation process with guided setup on first launch
- Maximum Accuracy: Optimized presets for interview-quality transcriptions
- Speaker Identification: Advanced diarization to identify who said what
- Batch Processing: Handle multiple files with individual settings
- Audio Quality Analysis: Get intelligent suggestions for optimal settings
- Multiple Output Formats: SRT, VTT, TXT, and JSON formats
- Post-Processing Tools: Speaker replacement and timestamp removal utilities
- Non-Commercial Use: This application is for non-commercial use only
- Multi-Format Support: Process audio and video files in virtually any format (MP3, MP4, WAV, M4A, FLAC, etc.)
- Language Support: Transcribe in 99+ languages or translate to English
- Multiple Models: Choose from tiny to large-v3-turbo models based on your accuracy/speed needs
- Real-Time Progress: Watch transcription progress with live output and status updates
- Drag & Drop: Simply drag files or folders into the application
- Batch Processing: Process multiple files with a queue system and per-file settings
Identify different speakers in your audio - perfect for interviews, phone calls, meetings, and multi-speaker content.
-
Multiple Diarization Methods:
pyannote_v3.1(Recommended) - Latest and most accuratepyannote_v3.0- Stable fallback optionreverb_v2- Best for audio with echo/reverbreverb_v1- Legacy method for compatibility
-
Speaker Count Optimization: Set exact speaker count for dramatically improved accuracy
-
Speaker Replacement Tool: After transcription, replace generic labels (SPEAKER_00, SPEAKER_01) with actual names
-
GPU Acceleration: Optional GPU support for 5-10x faster processing
-
Min/Max Speaker Range: Set speaker count ranges when exact count is unknown
-
Audio Quality Analysis: Analyze your audio files and receive intelligent suggestions for optimal filter settings
- Noise level detection (low/medium/high)
- Volume level assessment
- Quality score (0-100)
- Personalized filter recommendations
-
Audio Filters:
- Speech Normalization: Amplify quiet speech to make it more audible
- Loudness Normalization: Normalize to EBU R128 broadcast standard
- Low/High Pass Filter: Remove frequencies outside speech range (50Hz-7800Hz)
- Denoise: Reduce background noise (adjustable intensity 0-97)
- Tempo Adjustment: Adjust playback speed for fast or slow speech
-
Voice Activity Detection (VAD): Automatically filter out non-speech segments
-
Preset Configurations: Five optimized presets for different use cases
- Standard: Maximum accuracy for clean interview recordings
- Turbo: Speed-optimized for quick transcriptions
- Diarize: Optimized for speaker identification with maximum accuracy
- Phone Conversation Audio: Optimized for low-quality/noisy phone recordings
- Custom: Full manual control
-
Comprehensive Help System:
- Tooltips on hover for quick explanations
- Detailed help dialogs (click "?" buttons)
- Context-sensitive help that adapts to your selections
- Best Practices guide shown on first run
-
Settings Validation: Pre-processing warnings for suboptimal configurations
-
GPU Detection: Automatic detection and notification when GPU is available
-
Model Management: Check which models are downloaded and get download information
-
Command Preview: See the exact command that will be executed before processing
-
Speaker Replacement Dialog:
- Review full diarized transcript
- See all identified speakers in a table
- Replace generic labels with actual names
- Preview changes before saving
- Save updated transcript with custom speaker names
-
Timestamp Removal Tool:
- Remove timestamps from transcriptions
- Clean up text for easier reading
- Preserve speaker labels if diarization was used
-
Custom Subtitle Formatting:
- Control maximum line width
- Set maximum lines per segment
- Configure comma break percentage
- Enable sentence mode for better segmentation
-
Word-Level Timestamps: Precise timing for each word
-
Karaoke-Style Subtitles: Highlight words as they're spoken (VTT format)
-
Processing Queue: Manage multiple files with individual settings
-
Recursive Folder Processing: Process entire folder structures
-
File Validation: Automatic verification of input files before processing
- Operating System: Windows 10 or later
- Disk Space: At least 2GB free space (for runtime files and models)
- RAM: 4GB minimum, 8GB+ recommended
- GPU: Optional but recommended - NVIDIA GPU with CUDA support for 5-10x faster processing
-
Download GUI: Download
FasterWhisperXXLGUI.exefrom the Releases page -
Run GUI: Double-click
FasterWhisperXXLGUI.exeto launch the application -
Download faster-whisper-xxl (First Run Only):
- On first launch, a dialog will appear prompting you to download
faster-whisper-xxl.exe - Click "Download faster-whisper-xxl" to open the download page or download direclty from (https://github.com/Purfview/whisper-standalone-win)
- Download the file:
Faster-Whisper-XXL_r245.4_windows.7zfrom the GitHub repository - The file will download to your default download location (usually
Downloadsfolder)
- On first launch, a dialog will appear prompting you to download
-
Install faster-whisper-xxl:
- After downloading, click "I've Downloaded It - Install Now" in the GUI dialog
- The application will help you locate the downloaded
.7zfile - Files will be automatically extracted and installed to
%LOCALAPPDATA%\FasterWhisperGUI\ - This is a one-time setup - the application will remember the installation
-
Models: Models will be automatically downloaded when needed to
%LOCALAPPDATA%\FasterWhisperGUI\_models\- Models are not preloaded and will be downloaded automatically the 1st time a command for a certain language is used.
Note: The GUI executable does not include faster-whisper-xxl.exe - it must be downloaded separately from the Purfview GitHub repository as per the developer's requirements.
-
Launch: Double-click
FasterWhisperXXLGUI.exe -
Select Files:
- Click "Select Files/Folder..." button, or
- Drag and drop files/folders into the application
-
Choose a Preset (optional but recommended):
- Select from the dropdown: Standard, Turbo, Diarize, Phone Conversation Audio, or Custom
- Presets configure optimal settings for different use cases
-
Configure Settings (if needed):
- Modify any settings after selecting a preset
- Use "?" buttons for detailed help on any option
-
Start Processing: Click "Start Processing" button
-
Post-Processing (if diarization enabled):
- Speaker Replacement dialog opens automatically
- Replace speaker labels with actual names
- Save the updated transcript
- Select File: Click "Select Files/Folder..." and choose your audio/video file
- Choose Preset: Select appropriate preset (Standard for most cases)
- Optional - Analyze Audio: Click "Analyze Audio Quality" for filter suggestions
- Configure Language: Select the language (don't use Auto-detect for best accuracy)
- Choose Output Format: Select SRT, VTT, TXT, and/or JSON
- Start Processing: Click "Start Processing"
- Review Output: Check the output area for progress and results
- Open Output: Click "Open Output Folder" when complete
- Select File(s): Choose your audio/video file(s)
- Choose Preset: Select "Diarize" preset
- Set Speaker Count: β CRITICAL - Set the exact number of speakers if known (dramatically improves accuracy)
- Configure Language: Specify the language explicitly
- Choose Output Formats: Select TXT, SRT, and/or VTT (speaker labels included)
- Start Processing: Click "Start Processing"
- Speaker Replacement:
- Dialog opens automatically after processing
- Review all identified speakers in the table
- Replace generic labels (SPEAKER_00, SPEAKER_01) with actual names
- Preview changes
- Save updated transcript
- Select Multiple Files or Folder: Choose multiple files or a folder containing files
- Configure Settings: Set your preferred settings
- Start Processing: Click "Start Processing"
- Queue Settings Dialog:
- Choose "Apply same settings to all files" or
- Choose "Configure different settings for each file"
- Edit Individual Settings (if needed): Click "Edit Settings" for any file
- Monitor Progress: Processing Queue window shows progress for all files
- Review Results: Each file gets its own output files
- Best For: General transcription, clean interview recordings, podcasts, meetings
- Model:
large-v2(excellent accuracy, reliable, fewer hallucinations than v3) - Optimization: Maximum accuracy settings for controlled interview environments
- Settings:
- Beam Size: 10 (maximum accuracy)
- Patience: 5.0 (maximum completeness)
- Temperature: 0.0 (deterministic, consistent results)
- VAD: Enabled with
pyannote_v3(most accurate method) - Audio Filters: Loudness normalization only (minimal processing for clean audio)
- When to Use: Most common use case - clean audio, interviews, general transcription needs
- Best For: Quick transcriptions when speed is priority
- Model:
turbo(fast while maintaining good accuracy) - Settings: Balanced for speed and accuracy
- Beam Size: 5 (default, balanced)
- Patience: 2.0 (default, faster)
- VAD: Enabled with
silero_v4_fw(faster method)
- When to Use: When you need fast results and can accept slightly lower accuracy
- Best For: Interviews, meetings, podcasts with multiple speakers
- Model:
large-v2(best for speaker identification accuracy) - Optimization: Maximum accuracy for speaker identification
- Settings:
- Diarization: Enabled with
pyannote_v3.1(latest, most accurate method) - Beam Size: 10 (maximum accuracy)
- Patience: 5.0 (maximum completeness)
- Audio Filters: Loudness normalization only
- Diarization: Enabled with
- Output: TXT, SRT, VTT formats (multiple formats for speaker labeling)
β οΈ IMPORTANT: Set the exact number of speakers if known (dramatically improves accuracy)- When to Use: When you need to identify who is speaking
- Best For: Phone calls, low-quality recordings, noisy audio
- Model:
large-v2(robust for degraded audio) - Settings: Optimized for challenging audio conditions
- Beam Size: 8 (higher for noisy conditions)
- Patience: 4.0 (better for degraded audio)
- Audio Filters: Aggressive preprocessing
- Speech normalization (amplifies quiet speech)
- Loudness normalization
- Low/high pass filter (removes non-speech frequencies)
- Denoise: 15 (moderate noise reduction)
β οΈ IMPORTANT: For phone calls, set speaker count to 2 if known- When to Use: Phone calls, recordings with background noise, low-quality audio
- Best For: Advanced users who want full control
- Settings: No pre-configured settings - configure everything manually
- When to Use: When you want to set all options yourself from scratch
The Audio Quality Analysis feature helps you determine the best settings for your specific audio file.
- Select File: Choose your audio file
- Click "Analyze Audio Quality": Button in the File Selection section
- Review Results:
- Quality Score (0-100)
- Noise Level (Low/Medium/High)
- Volume Level (Low/Normal/High)
- Suggested filter settings
- Apply Suggestions: Manually enable suggested filters or use a preset
Example Suggestions:
- "Clean audio detected - using minimal filters for best accuracy"
- "Low volume detected - consider enabling Speech Normalization"
- "Some noise detected - consider enabling Denoise filter if quality is poor"
- Standard subtitle format, widely compatible with video players
- Format: Sequential subtitle blocks with timestamps
- Includes speaker labels when diarization is enabled
- Best for: Video editing software, YouTube, general video players
- Web Video Text Tracks format
- Similar to SRT but web-optimized
- Supports word-level highlighting (karaoke effect)
- Includes speaker labels when diarization is enabled
- Best for: Web videos, HTML5 video players
- Plain text format with timestamps
- Format:
[HH:MM:SS.mmm --> HH:MM:SS.mmm] Text - Includes speaker labels when diarization is enabled
- Example:
[00:38.120 --> 00:41.240] [SPEAKER_01]: We came over to your house earlier this - Best for: Reading, editing, general text processing
- Comprehensive data format with all information
- Includes: timestamps, word-level data, speaker information (separate from text)
- Best for: Programmatic processing, detailed analysis
- Note: Speaker labels are in metadata, not inline with text
- Specify Language: Don't use "Auto-detect" - specify the language explicitly for best results
- Use Larger Models:
large-v2orlarge-v3-turboprovide better accuracy than smaller models - Enable GPU: If you have an NVIDIA GPU, enable CUDA for faster processing (allows using larger models)
- Use Appropriate Preset: Choose the preset that matches your use case
- Audio Quality: Use "Analyze Audio Quality" to get suggestions for your specific audio
-
β Set Exact Speaker Count: The SINGLE MOST IMPORTANT setting - if you know there are exactly 2, 3, 4, etc. speakers, set "Number of Speakers" to that exact number. This dramatically improves accuracy.
-
Specify Language: Don't use "Auto-detect" - specify the language explicitly
-
Use Large Model:
large-v2model provides best accuracy for diarization -
Use Latest Method:
pyannote_v3.1is the most accurate diarization method -
Enable GPU: Diarization is computationally intensive - GPU provides 5-10x speedup
-
Clean Audio: For clean recordings (iPhone, quiet rooms), use minimal audio filters
- Clean Audio (iPhone recordings, quiet rooms): Use minimal filters (loudness normalization only)
- Noisy Audio (phone calls, background noise): Use aggressive filters (denoise, frequency filtering)
- Low Volume: Enable speech normalization
- Inconsistent Volume: Enable loudness normalization
- Maximum Accuracy:
large-v2orlarge-v3-turbo - Balanced:
large-v2(recommended for most users) - Speed Priority:
turbo - Resource Constrained:
smallortiny(lower accuracy)
Ctrl+O: Open filesCtrl+S: Start processingCtrl+C: Cancel processingF1: Show help menuEsc: Close dialogs
- Insufficient Disk Space: Ensure at least 2GB free space
- Antivirus Blocking: Check if Windows Defender or antivirus is blocking the executable
- Permission Errors: Try running as Administrator
- First Run Delay: First launch takes longer as files are extracted - wait for extraction dialog to complete
- Check Output Area: Look for error messages in the output text area
- Invalid Files: Verify your input files are valid audio/video formats
- Disk Space: Ensure sufficient disk space for output files
- File Permissions: Check that you have write permissions to the output folder
- Enable GPU: If you have NVIDIA GPU, enable CUDA in device settings
- Use Smaller Model: Try
turboinstead oflarge-v2for faster processing - Reduce Beam Size: Lower beam size in Advanced Options (trades accuracy for speed)
- Disable Audio Filters: If not needed, disable filters to speed up preprocessing
- Use Larger Model: Try
large-v2orlarge-v3-turbofor better accuracy - Specify Language: Don't use "Auto-detect" - specify language explicitly
- Enable Audio Filters: Use Audio Quality Analysis for suggestions
- Check Audio Quality: Poor source audio will result in poor transcriptions
- For Diarization: Set exact speaker count - this is critical for accuracy
- Check Drivers: Ensure NVIDIA drivers are installed and up to date
- Verify CUDA: Run
nvidia-smiin command prompt to verify CUDA is available - Auto Mode: Application will automatically use CPU if GPU is not available
- Device Selection: Ensure "Device" is set to "Auto" or "CUDA" in Advanced Options
- Internet Connection: Check your internet connection
- Disk Space: Ensure sufficient space in
%LOCALAPPDATA%\FasterWhisperGUI\_models\ - Firewall: Check if firewall is blocking the download
- Automatic Download: Models download automatically when first used - be patient
- First Run: Complete the installation process by downloading and installing faster-whisper-xxl.exe
- Installation Required: If you see this error, restart the application to trigger the installation dialog
- Manual Installation: If the installation dialog doesn't appear, ensure faster-whisper-xxl.exe is in
%LOCALAPPDATA%\FasterWhisperGUI\ - Corrupted Files: Delete
%LOCALAPPDATA%\FasterWhisperGUI\and restart (you'll need to reinstall) - Permissions: Ensure you have read/write permissions to AppData folder
- Check Output Files: If output files were generated, dialogs should appear
- Non-Zero Exit Code: Dialogs appear even if process returns warnings (as long as outputs are generated)
- Manual Access: Use "Identify & Replace Speakers" or "Remove Timestamps" buttons if dialogs don't auto-open
- Location:
C:\Users\[YourUsername]\AppData\Local\FasterWhisperGUI\ - Contents:
faster-whisper-xxl.exe- Core transcription engine (downloaded and installed separately)ffmpeg.exe- Audio/video processing (included with faster-whisper-xxl)ffprobe.exe- Media file analysis (included with faster-whisper-xxl, optional)_xxl_data\- Runtime dependencies (included with faster-whisper-xxl)
- Location:
C:\Users\[YourUsername]\AppData\Local\FasterWhisperGUI\_models\ - Contents: Whisper models (downloaded automatically when needed)
- Size: Models can be several GB each
- Models Available:
tiny,small,medium,large-v1,large-v2,large-v3,large-v3-turbo
- Location:
C:\Users\[YourUsername]\.faster_whisper_gui_config.json - Contents: User preferences (show best practices dialog, etc.)
When "Standard Preset" is disabled, you can customize subtitle formatting:
- Max Line Width: Maximum characters per subtitle line (default: 70)
- Max Line Count: Maximum lines per subtitle segment (default: 3)
- Max Comma Percentage: Percentage of line width before breaking at commas (default: 90%)
- Sentence Mode: Split subtitles at sentence boundaries (auto-enabled with diarization)
- Temperature: Controls randomness (0.0 = deterministic, recommended)
- Beam Size: Number of candidates to consider (1-10, higher = more accurate but slower)
- Patience: How long to wait before finalizing segments (higher = more patient)
- Best Of: Number of candidates when sampling (default: 5)
- Length Penalty: Token length penalty coefficient (default: 1.0)
- Repetition Penalty: Penalty for repeating tokens (default: 1.0)
Most users should leave these at default values unless experiencing specific issues.
This application is built on top of excellent open-source projects:
-
faster-whisper-xxl.exe by Purfview - The core transcription engine that powers this application. Note: faster-whisper-xxl.exe must be downloaded separately from the GitHub repository - it is not bundled with this GUI application. Special thanks to Purfview for creating and maintaining the standalone Windows executable.
-
Reverb by Rev.com - Used for speaker diarization functionality. The Reverb model is used under the Rev Model Non-Production License (see LICENSE_REVERB.txt).
-
Faster Whisper by Guillaume Klein - The optimized Whisper implementation that provides fast and accurate transcription.
-
OpenAI Whisper by OpenAI - The state-of-the-art speech recognition model that makes accurate transcription possible.
-
PyQt6 - The GUI framework that provides the user interface.
-
FFmpeg - Audio and video processing capabilities.
This application is for NON-COMMERCIAL USE ONLY.
This GUI application is provided as-is for non-commercial use. Commercial use is not permitted.
This application includes the following license files:
- LICENSE_NOTICE.txt - General license notice and attribution information
- LICENSE_REVERB.txt - License for Reverb diarization model (Rev Model Non-Production License)
-
faster-whisper-xxl.exe by Purfview - Must be downloaded separately. Please refer to the Purfview repository for licensing information.
-
Faster Whisper by Guillaume Klein - Licensed under the MIT License. For full license terms, see: https://github.com/guillaumekln/faster-whisper/blob/master/LICENSE
-
OpenAI Whisper by OpenAI - Licensed under the MIT License. For full license terms, see: https://github.com/openai/whisper/blob/main/LICENSE
-
Reverb by Rev.com - Used for speaker diarization. Licensed under the Rev Model Non-Production License. See LICENSE_REVERB.txt for full terms. This license restricts use to non-production environments and non-commercial purposes.
- This GUI application: Non-commercial use only
- faster-whisper-xxl.exe: Subject to Purfview's license terms
- Faster Whisper: MIT License (see link above)
- OpenAI Whisper: MIT License (see link above)
- Reverb: Subject to Rev Model Non-Production License (non-production, non-commercial use only)
For commercial use or production environments, please contact the respective license holders for appropriate licensing.
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
For issues, questions, or feature requests, please open an issue on GitHub.