This project provides a high-performance pipeline for transcribing YouTube videos, playlists, and channels. It leverages faster-whisper for efficient audio transcription and yt-dlp for robust video/audio downloading. The pipeline is optimized for Apple Silicon (M1/M2/M3) Macs but also supports other platforms. It features batch processing, automatic resource management, error handling, and detailed progress tracking. The core design principle is to provide a fast, reliable, and user-friendly tool for generating accurate transcriptions of YouTube content.
- Batch Processing: Processes multiple YouTube URLs from a CSV file, supporting videos, playlists, and channels.
- Optimized for Apple Silicon:
- Utilizes
int8quantization for CPU on Apple Silicon, providing a balance between speed and accuracy. - Automatic adjustment of OpenMP threads for optimal CPU utilization.
- Utilizes
- Resource-Aware Processing:
- Dynamically adjusts the number of worker processes based on available system memory.
- Implements a memory monitoring thread to prevent excessive memory usage.
- Thread-Safe Concurrency: Uses
multiprocessingandthreadingfor concurrent downloads and transcriptions, maximizing throughput. - Detailed Progress Tracking: Provides real-time progress updates via
tqdmprogress bars for both downloads and transcriptions. - Fault Tolerance:
- Implements retry mechanisms for failed downloads.
- Tracks failed URLs and errors in a separate JSON file (
failed_urls.json). - Handles various exceptions gracefully, logging errors and continuing processing.
- Duplicate Handling: Skips videos for which transcripts already exist, avoiding redundant processing. Checks both a
processed.jsonfile and the output directory. - Configuration Flexibility: Allows customization of key parameters through a
config.pyfile (e.g., number of workers, Whisper model, download settings). - Structured Output: Saves transcripts in a well-defined JSON format, including metadata like video ID, channel, title, upload date, and the transcript itself.
- Comprehensive Logging: Uses
logurufor detailed logging, including error tracking and debugging information. Logs are saved to a file and rotated.
- Python: Requires Python 3.8 or higher.
- FFmpeg: Ensure FFmpeg is installed and available in your system's PATH. FFmpeg is used for audio extraction and processing.
- Homebrew (macOS): Recommended for installing dependencies on macOS.
-
Clone the Repository:
git clone <repository-url> cd youtube-transcriptions
-
Install Dependencies:
pip install -r requirements.txt
-
Apple Silicon (M1/M2/M3) Specific Setup (Optional but Recommended):
- Install OpenMP for optimal performance:
brew install libomp
- This project automatically sets the
OMP_NUM_THREADSenvironment variable.
- Install OpenMP for optimal performance:
-
Create a CSV Input File:
Create a CSV file (e.g.,
urls.csv) containing the YouTube URLs you want to process. The CSV must have aurlcolumn. Atypecolumn is optional (for documentation; the script auto-detects the type).url,type https://www.youtube.com/watch?v=dQw4w9WgXcQ,video https://www.youtube.com/playlist?list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRgv3,playlist https://www.youtube.com/c/SomeChannel,channel
-
Run the Transcription Pipeline:
Use the
transcribe.pyscript to start the process.- Using a CSV file:
python transcribe.py --input urls.csv
- Using a single YouTube URL (overrides CSV):
python transcribe.py --url <YOUTUBE_URL>
- You can set the
CSV_FILEorYOUTUBE_URLvariables inconfig.py.
- Using a CSV file:
-
Output:
Transcripts are saved as JSON files in the
outputdirectory. The filename format is:YYYY-MM-DD@Video-Title@video_id.jsonbatch_stats.json: Contains overall processing statistics.failed_urls.json: Lists any URLs that failed to download or transcribe, along with error details.processed.json: Keeps track of successfully processed video IDs to prevent reprocessing.
The config.py file allows you to customize the pipeline's behavior. Here are some key configuration options:
| Parameter | Description | Default Value |
|---|---|---|
OUTPUT_DIR |
Directory to save the output transcripts. | output |
TEMP_DIR |
Directory for temporary files (audio downloads). | temp |
MAX_CONCURRENT_DOWNLOADS |
Maximum number of concurrent audio downloads. | 8 |
DOWNLOAD_RETRY_ATTEMPTS |
Number of times to retry a failed download. | 5 |
DOWNLOAD_TIMEOUT |
Download timeout in seconds. | 120 |
DOWNLOAD_RATE_LIMIT |
Optional download rate limit (e.g., "50M" for 50MB/s). | None (defaults to 10MB/s) |
NUM_WORKERS |
Number of worker processes for transcription. Adjust based on your CPU cores. | 4 |
BATCH_SIZE |
(Currently unused) Intended for future batching of transcription tasks. | 4 |
CSV_FILE |
Path to the CSV file containing YouTube URLs. Can be overridden by the --input command-line argument. |
None |
YOUTUBE_URL |
A single YouTube URL to transcribe. Overrides CSV_FILE if set. |
A default playlist URL |
WHISPER_MODEL |
The Whisper model size to use. Options: tiny, base, small, medium, large-v3. Smaller models are faster but less accurate. |
small |
WHISPER_DEVICE |
The device to use for Whisper inference. cpu is recommended for M1/M2/M3 Macs with int8 quantization. Use cuda for NVIDIA GPUs. |
cpu |
WHISPER_COMPUTE_TYPE |
Computation type for Whisper. int8 is recommended for CPU usage on Apple Silicon for best performance. |
int8 |
MAX_MEMORY_PERCENT |
Memory usage threshold (percentage) that triggers a reduction in worker processes. | 85 |
MEMORY_THRESHOLD |
Same as MAX_MEMORY_PERCENT. |
85 |
YTDL_FORMAT |
Format string for yt-dlp. bestaudio/best selects the best available audio quality. |
bestaudio/best |
YTDL_OPTS |
Dictionary of yt-dlp options. See yt-dlp documentation for details. Includes settings for audio extraction, retries, timeouts, and rate limiting. The outtmpl is automatically set based on TEMP_DIR. |
See config.py for default settings. |
startLine: 12
endLine: 99transcribe.py: The main entry point for the application. Handles command-line arguments, sets up logging, and initiates the batch processing.
startLine: 1
endLine: 52batch_processor.py: Contains the core logic for processing batches of YouTube URLs. Manages downloading, transcription, and statistics tracking. Implements the multiprocessing and threading strategy.
startLine: 1
endLine: 517config.py: Defines configuration parameters and settings for the pipeline. Usespydanticfor type-safe configuration management.
startLine: 1
endLine: 100utils.py: Provides utility functions for various tasks, such as extracting video IDs, sanitizing filenames, managing processed video lists, and saving transcripts.
startLine: 1
endLine: 221output/: Directory where the generated transcripts (JSON files) and logs are stored. Excluded from Git.temp/: Directory for temporary audio files downloaded by yt-dlp. Excluded from Git.
- Multiprocessing: Uses
multiprocessing.ProcessPoolExecutorto run multiple transcription tasks concurrently, leveraging multiple CPU cores. Theworker_initializerfunction ensures each process has its own Whisper model instance.
startLine: 101
endLine: 118startLine: 370
endLine: 374- Threading: Employs
threading.ThreadPoolExecutorfor concurrent audio downloads, improving I/O-bound performance.
startLine: 369
endLine: 374- Apple Silicon (M1/M2/M3) Optimizations:
WHISPER_DEVICE = "cpu"andWHISPER_COMPUTE_TYPE = "int8": Forces Whisper to use the CPU with int8 quantization, which is significantly faster than MPS (Metal Performance Shaders) on Apple Silicon for this specific task.OMP_NUM_THREADS: Automatically sets the number of OpenMP threads to half the number of CPU cores, preventing thread oversubscription and improving performance.PYTORCH_ENABLE_MPS_FALLBACK=1: Allows PyTorch to use CPU if MPS is unavailable.CT2_USE_MKL=0: Disables MKL, as it can conflict with OpenMP.
startLine: 71
endLine: 74- Memory Monitoring: A dedicated thread monitors memory usage and logs warnings if it exceeds a configured threshold. This helps prevent out-of-memory errors.
startLine: 573
endLine: 590- Efficient Metadata Extraction: Extracts metadata for all URLs in a single
yt-dlpcall, reducing overhead.
startLine: 172
endLine: 241- Pipelined Processing: Downloads and transcriptions are pipelined. As soon as an audio file is downloaded, its transcription is submitted to the process pool, maximizing resource utilization.
startLine: 389
endLine: 447- Optimized Transcript Generation: Uses list comprehensions for efficient in-memory transcript generation.
startLine: 545
endLine: 550- Rate Limiting: The
DOWNLOAD_RATE_LIMIToption inconfig.pyallows you to control the download speed to avoid overwhelming the network or triggering rate limits from YouTube.
-
Memory Issues:
- If you encounter out-of-memory errors, reduce
NUM_WORKERSinconfig.py. The pipeline automatically reduces workers if memory usage exceedsMAX_MEMORY_PERCENT. - Monitor memory usage using the built-in memory monitoring thread (logs warnings).
- If you encounter out-of-memory errors, reduce
-
Download Failures:
- Check the
failed_urls.jsonfile in theoutputdirectory for details on failed downloads. - Ensure
yt-dlpis up-to-date (pip install --upgrade yt-dlp). - Network connectivity issues can cause downloads to fail.
- Check the
-
Transcription Errors:
- Check the
transcription.logfile in theoutputdirectory for detailed error messages. - Ensure the Whisper model is correctly specified in
config.py. - If using a GPU, ensure you have the necessary CUDA drivers and libraries installed.
- Check the
-
M1/M2/M3 Specific Issues:
- Ensure
libompis installed (brew install libomp).
- Ensure
-
Empty Transcripts:
- If a transcript is empty, check the logs. It may be that no speech was detected, or the audio quality was too poor.
- Batching for Transcription: Implement actual batching of audio segments for Whisper processing (currently,
BATCH_SIZEis unused). This could further improve performance, especially on GPUs. - Web Interface: Develop a web interface for easier interaction and management of transcription tasks.
- Support for Other Languages: Extend the pipeline to support transcription in languages other than English.
- Speaker Diarization: Integrate speaker diarization to identify different speakers in the audio.
- Improved Error Handling: Implement more granular error handling and reporting, potentially with automatic retries for specific transcription errors.
- Configuration via Environment Variables: Allow overriding configuration settings using environment variables for easier deployment in containerized environments.