YouTube Transcription Pipeline

Project Overview

This project provides a high-performance pipeline for transcribing YouTube videos, playlists, and channels. It leverages faster-whisper for efficient audio transcription and yt-dlp for robust video/audio downloading. The pipeline is optimized for Apple Silicon (M1/M2/M3) Macs but also supports other platforms. It features batch processing, automatic resource management, error handling, and detailed progress tracking. The core design principle is to provide a fast, reliable, and user-friendly tool for generating accurate transcriptions of YouTube content.

Features

Batch Processing: Processes multiple YouTube URLs from a CSV file, supporting videos, playlists, and channels.
Optimized for Apple Silicon:
- Utilizes int8 quantization for CPU on Apple Silicon, providing a balance between speed and accuracy.
- Automatic adjustment of OpenMP threads for optimal CPU utilization.
Resource-Aware Processing:
- Dynamically adjusts the number of worker processes based on available system memory.
- Implements a memory monitoring thread to prevent excessive memory usage.
Thread-Safe Concurrency: Uses multiprocessing and threading for concurrent downloads and transcriptions, maximizing throughput.
Detailed Progress Tracking: Provides real-time progress updates via tqdm progress bars for both downloads and transcriptions.
Fault Tolerance:
- Implements retry mechanisms for failed downloads.
- Tracks failed URLs and errors in a separate JSON file (failed_urls.json).
- Handles various exceptions gracefully, logging errors and continuing processing.
Duplicate Handling: Skips videos for which transcripts already exist, avoiding redundant processing. Checks both a processed.json file and the output directory.
Configuration Flexibility: Allows customization of key parameters through a config.py file (e.g., number of workers, Whisper model, download settings).
Structured Output: Saves transcripts in a well-defined JSON format, including metadata like video ID, channel, title, upload date, and the transcript itself.
Comprehensive Logging: Uses loguru for detailed logging, including error tracking and debugging information. Logs are saved to a file and rotated.

Installation & Setup

Prerequisites

Python: Requires Python 3.8 or higher.
FFmpeg: Ensure FFmpeg is installed and available in your system's PATH. FFmpeg is used for audio extraction and processing.
Homebrew (macOS): Recommended for installing dependencies on macOS.

Installation Steps

Clone the Repository:

git clone <repository-url>
cd youtube-transcriptions

Install Dependencies:
```
pip install -r requirements.txt
```
Apple Silicon (M1/M2/M3) Specific Setup (Optional but Recommended):
- Install OpenMP for optimal performance:
```
brew install libomp
```
- This project automatically sets the OMP_NUM_THREADS environment variable.

Usage Instructions

Create a CSV Input File:

Create a CSV file (e.g., urls.csv) containing the YouTube URLs you want to process. The CSV must have a url column. A type column is optional (for documentation; the script auto-detects the type).
```
url,type
https://www.youtube.com/watch?v=dQw4w9WgXcQ,video
https://www.youtube.com/playlist?list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRgv3,playlist
https://www.youtube.com/c/SomeChannel,channel
```
Run the Transcription Pipeline:

Use the transcribe.py script to start the process.
- Using a CSV file:
```
python transcribe.py --input urls.csv
```
- Using a single YouTube URL (overrides CSV):
```
python transcribe.py --url <YOUTUBE_URL>
```
- You can set the CSV_FILE or YOUTUBE_URL variables in config.py.
Output:

Transcripts are saved as JSON files in the output directory. The filename format is:
```
YYYY-MM-DD@Video-Title@video_id.json
```
- batch_stats.json: Contains overall processing statistics.
- failed_urls.json: Lists any URLs that failed to download or transcribe, along with error details.
- processed.json: Keeps track of successfully processed video IDs to prevent reprocessing.

Configuration Details (`config.py`)

The config.py file allows you to customize the pipeline's behavior. Here are some key configuration options:

Parameter	Description	Default Value
`OUTPUT_DIR`	Directory to save the output transcripts.	`output`
`TEMP_DIR`	Directory for temporary files (audio downloads).	`temp`
`MAX_CONCURRENT_DOWNLOADS`	Maximum number of concurrent audio downloads.	`8`
`DOWNLOAD_RETRY_ATTEMPTS`	Number of times to retry a failed download.	`5`
`DOWNLOAD_TIMEOUT`	Download timeout in seconds.	`120`
`DOWNLOAD_RATE_LIMIT`	Optional download rate limit (e.g., "50M" for 50MB/s).	`None` (defaults to 10MB/s)
`NUM_WORKERS`	Number of worker processes for transcription. Adjust based on your CPU cores.	`4`
`BATCH_SIZE`	(Currently unused) Intended for future batching of transcription tasks.	`4`
`CSV_FILE`	Path to the CSV file containing YouTube URLs. Can be overridden by the `--input` command-line argument.	`None`
`YOUTUBE_URL`	A single YouTube URL to transcribe. Overrides `CSV_FILE` if set.	A default playlist URL
`WHISPER_MODEL`	The Whisper model size to use. Options: `tiny`, `base`, `small`, `medium`, `large-v3`. Smaller models are faster but less accurate.	`small`
`WHISPER_DEVICE`	The device to use for Whisper inference. `cpu` is recommended for M1/M2/M3 Macs with `int8` quantization. Use `cuda` for NVIDIA GPUs.	`cpu`
`WHISPER_COMPUTE_TYPE`	Computation type for Whisper. `int8` is recommended for CPU usage on Apple Silicon for best performance.	`int8`
`MAX_MEMORY_PERCENT`	Memory usage threshold (percentage) that triggers a reduction in worker processes.	`85`
`MEMORY_THRESHOLD`	Same as `MAX_MEMORY_PERCENT`.	`85`
`YTDL_FORMAT`	Format string for yt-dlp. `bestaudio/best` selects the best available audio quality.	`bestaudio/best`
`YTDL_OPTS`	Dictionary of yt-dlp options. See yt-dlp documentation for details. Includes settings for audio extraction, retries, timeouts, and rate limiting. The `outtmpl` is automatically set based on `TEMP_DIR`.	See `config.py` for default settings.

startLine: 12
endLine: 99

File Structure & Key Components

transcribe.py: The main entry point for the application. Handles command-line arguments, sets up logging, and initiates the batch processing.

startLine: 1
endLine: 52

batch_processor.py: Contains the core logic for processing batches of YouTube URLs. Manages downloading, transcription, and statistics tracking. Implements the multiprocessing and threading strategy.

startLine: 1
endLine: 517

config.py: Defines configuration parameters and settings for the pipeline. Uses pydantic for type-safe configuration management.

startLine: 1
endLine: 100

utils.py: Provides utility functions for various tasks, such as extracting video IDs, sanitizing filenames, managing processed video lists, and saving transcripts.

startLine: 1
endLine: 221

output/: Directory where the generated transcripts (JSON files) and logs are stored. Excluded from Git.
temp/: Directory for temporary audio files downloaded by yt-dlp. Excluded from Git.

Performance Optimizations

Multiprocessing: Uses multiprocessing.ProcessPoolExecutor to run multiple transcription tasks concurrently, leveraging multiple CPU cores. The worker_initializer function ensures each process has its own Whisper model instance.

startLine: 101
endLine: 118

startLine: 370
endLine: 374

Threading: Employs threading.ThreadPoolExecutor for concurrent audio downloads, improving I/O-bound performance.

startLine: 369
endLine: 374

Apple Silicon (M1/M2/M3) Optimizations:
- WHISPER_DEVICE = "cpu" and WHISPER_COMPUTE_TYPE = "int8": Forces Whisper to use the CPU with int8 quantization, which is significantly faster than MPS (Metal Performance Shaders) on Apple Silicon for this specific task.
- OMP_NUM_THREADS: Automatically sets the number of OpenMP threads to half the number of CPU cores, preventing thread oversubscription and improving performance.
- PYTORCH_ENABLE_MPS_FALLBACK=1: Allows PyTorch to use CPU if MPS is unavailable.
- CT2_USE_MKL=0: Disables MKL, as it can conflict with OpenMP.

startLine: 71
endLine: 74

Memory Monitoring: A dedicated thread monitors memory usage and logs warnings if it exceeds a configured threshold. This helps prevent out-of-memory errors.

startLine: 573
endLine: 590

Efficient Metadata Extraction: Extracts metadata for all URLs in a single yt-dlp call, reducing overhead.

startLine: 172
endLine: 241

Pipelined Processing: Downloads and transcriptions are pipelined. As soon as an audio file is downloaded, its transcription is submitted to the process pool, maximizing resource utilization.

startLine: 389
endLine: 447

Optimized Transcript Generation: Uses list comprehensions for efficient in-memory transcript generation.

startLine: 545
endLine: 550

Rate Limiting: The DOWNLOAD_RATE_LIMIT option in config.py allows you to control the download speed to avoid overwhelming the network or triggering rate limits from YouTube.

Common Issues & Troubleshooting

Memory Issues:
- If you encounter out-of-memory errors, reduce NUM_WORKERS in config.py. The pipeline automatically reduces workers if memory usage exceeds MAX_MEMORY_PERCENT.
- Monitor memory usage using the built-in memory monitoring thread (logs warnings).
Download Failures:
- Check the failed_urls.json file in the output directory for details on failed downloads.
- Ensure yt-dlp is up-to-date (pip install --upgrade yt-dlp).
- Network connectivity issues can cause downloads to fail.
Transcription Errors:
- Check the transcription.log file in the output directory for detailed error messages.
- Ensure the Whisper model is correctly specified in config.py.
- If using a GPU, ensure you have the necessary CUDA drivers and libraries installed.
M1/M2/M3 Specific Issues:
- Ensure libomp is installed (brew install libomp).
Empty Transcripts:
- If a transcript is empty, check the logs. It may be that no speech was detected, or the audio quality was too poor.

Future Enhancements

Batching for Transcription: Implement actual batching of audio segments for Whisper processing (currently, BATCH_SIZE is unused). This could further improve performance, especially on GPUs.
Web Interface: Develop a web interface for easier interaction and management of transcription tasks.
Support for Other Languages: Extend the pipeline to support transcription in languages other than English.
Speaker Diarization: Integrate speaker diarization to identify different speakers in the audio.
Improved Error Handling: Implement more granular error handling and reporting, potentially with automatic retries for specific transcription errors.
Configuration via Environment Variables: Allow overriding configuration settings using environment variables for easier deployment in containerized environments.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
batch_processor.py		batch_processor.py
config.py		config.py
example_urls.csv		example_urls.csv
requirements.txt		requirements.txt
transcribe.py		transcribe.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Transcription Pipeline

Project Overview

Features

Installation & Setup

Prerequisites

Installation Steps

Usage Instructions

Configuration Details (`config.py`)

File Structure & Key Components

Performance Optimizations

Common Issues & Troubleshooting

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YouTube Transcription Pipeline

Project Overview

Features

Installation & Setup

Prerequisites

Installation Steps

Usage Instructions

Configuration Details (config.py)

File Structure & Key Components

Performance Optimizations

Common Issues & Troubleshooting

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Configuration Details (`config.py`)

Packages