Skip to content

felipauskas/youtube-transcriptions

Repository files navigation

YouTube Transcription Pipeline

Project Overview

This project provides a high-performance pipeline for transcribing YouTube videos, playlists, and channels. It leverages faster-whisper for efficient audio transcription and yt-dlp for robust video/audio downloading. The pipeline is optimized for Apple Silicon (M1/M2/M3) Macs but also supports other platforms. It features batch processing, automatic resource management, error handling, and detailed progress tracking. The core design principle is to provide a fast, reliable, and user-friendly tool for generating accurate transcriptions of YouTube content.

Features

  • Batch Processing: Processes multiple YouTube URLs from a CSV file, supporting videos, playlists, and channels.
  • Optimized for Apple Silicon:
    • Utilizes int8 quantization for CPU on Apple Silicon, providing a balance between speed and accuracy.
    • Automatic adjustment of OpenMP threads for optimal CPU utilization.
  • Resource-Aware Processing:
    • Dynamically adjusts the number of worker processes based on available system memory.
    • Implements a memory monitoring thread to prevent excessive memory usage.
  • Thread-Safe Concurrency: Uses multiprocessing and threading for concurrent downloads and transcriptions, maximizing throughput.
  • Detailed Progress Tracking: Provides real-time progress updates via tqdm progress bars for both downloads and transcriptions.
  • Fault Tolerance:
    • Implements retry mechanisms for failed downloads.
    • Tracks failed URLs and errors in a separate JSON file (failed_urls.json).
    • Handles various exceptions gracefully, logging errors and continuing processing.
  • Duplicate Handling: Skips videos for which transcripts already exist, avoiding redundant processing. Checks both a processed.json file and the output directory.
  • Configuration Flexibility: Allows customization of key parameters through a config.py file (e.g., number of workers, Whisper model, download settings).
  • Structured Output: Saves transcripts in a well-defined JSON format, including metadata like video ID, channel, title, upload date, and the transcript itself.
  • Comprehensive Logging: Uses loguru for detailed logging, including error tracking and debugging information. Logs are saved to a file and rotated.

Installation & Setup

Prerequisites

  1. Python: Requires Python 3.8 or higher.
  2. FFmpeg: Ensure FFmpeg is installed and available in your system's PATH. FFmpeg is used for audio extraction and processing.
  3. Homebrew (macOS): Recommended for installing dependencies on macOS.

Installation Steps

  1. Clone the Repository:

    git clone <repository-url>
    cd youtube-transcriptions
  2. Install Dependencies:

    pip install -r requirements.txt
  3. Apple Silicon (M1/M2/M3) Specific Setup (Optional but Recommended):

    • Install OpenMP for optimal performance:
      brew install libomp
    • This project automatically sets the OMP_NUM_THREADS environment variable.

Usage Instructions

  1. Create a CSV Input File:

    Create a CSV file (e.g., urls.csv) containing the YouTube URLs you want to process. The CSV must have a url column. A type column is optional (for documentation; the script auto-detects the type).

    url,type
    https://www.youtube.com/watch?v=dQw4w9WgXcQ,video
    https://www.youtube.com/playlist?list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRgv3,playlist
    https://www.youtube.com/c/SomeChannel,channel
  2. Run the Transcription Pipeline:

    Use the transcribe.py script to start the process.

    • Using a CSV file:
      python transcribe.py --input urls.csv
    • Using a single YouTube URL (overrides CSV):
      python transcribe.py --url <YOUTUBE_URL>
    • You can set the CSV_FILE or YOUTUBE_URL variables in config.py.
  3. Output:

    Transcripts are saved as JSON files in the output directory. The filename format is:

    YYYY-MM-DD@Video-Title@video_id.json
    
    • batch_stats.json: Contains overall processing statistics.
    • failed_urls.json: Lists any URLs that failed to download or transcribe, along with error details.
    • processed.json: Keeps track of successfully processed video IDs to prevent reprocessing.

Configuration Details (config.py)

The config.py file allows you to customize the pipeline's behavior. Here are some key configuration options:

Parameter Description Default Value
OUTPUT_DIR Directory to save the output transcripts. output
TEMP_DIR Directory for temporary files (audio downloads). temp
MAX_CONCURRENT_DOWNLOADS Maximum number of concurrent audio downloads. 8
DOWNLOAD_RETRY_ATTEMPTS Number of times to retry a failed download. 5
DOWNLOAD_TIMEOUT Download timeout in seconds. 120
DOWNLOAD_RATE_LIMIT Optional download rate limit (e.g., "50M" for 50MB/s). None (defaults to 10MB/s)
NUM_WORKERS Number of worker processes for transcription. Adjust based on your CPU cores. 4
BATCH_SIZE (Currently unused) Intended for future batching of transcription tasks. 4
CSV_FILE Path to the CSV file containing YouTube URLs. Can be overridden by the --input command-line argument. None
YOUTUBE_URL A single YouTube URL to transcribe. Overrides CSV_FILE if set. A default playlist URL
WHISPER_MODEL The Whisper model size to use. Options: tiny, base, small, medium, large-v3. Smaller models are faster but less accurate. small
WHISPER_DEVICE The device to use for Whisper inference. cpu is recommended for M1/M2/M3 Macs with int8 quantization. Use cuda for NVIDIA GPUs. cpu
WHISPER_COMPUTE_TYPE Computation type for Whisper. int8 is recommended for CPU usage on Apple Silicon for best performance. int8
MAX_MEMORY_PERCENT Memory usage threshold (percentage) that triggers a reduction in worker processes. 85
MEMORY_THRESHOLD Same as MAX_MEMORY_PERCENT. 85
YTDL_FORMAT Format string for yt-dlp. bestaudio/best selects the best available audio quality. bestaudio/best
YTDL_OPTS Dictionary of yt-dlp options. See yt-dlp documentation for details. Includes settings for audio extraction, retries, timeouts, and rate limiting. The outtmpl is automatically set based on TEMP_DIR. See config.py for default settings.
startLine: 12
endLine: 99

File Structure & Key Components

  • transcribe.py: The main entry point for the application. Handles command-line arguments, sets up logging, and initiates the batch processing.
startLine: 1
endLine: 52
  • batch_processor.py: Contains the core logic for processing batches of YouTube URLs. Manages downloading, transcription, and statistics tracking. Implements the multiprocessing and threading strategy.
startLine: 1
endLine: 517
  • config.py: Defines configuration parameters and settings for the pipeline. Uses pydantic for type-safe configuration management.
startLine: 1
endLine: 100
  • utils.py: Provides utility functions for various tasks, such as extracting video IDs, sanitizing filenames, managing processed video lists, and saving transcripts.
startLine: 1
endLine: 221
  • output/: Directory where the generated transcripts (JSON files) and logs are stored. Excluded from Git.
  • temp/: Directory for temporary audio files downloaded by yt-dlp. Excluded from Git.

Performance Optimizations

  • Multiprocessing: Uses multiprocessing.ProcessPoolExecutor to run multiple transcription tasks concurrently, leveraging multiple CPU cores. The worker_initializer function ensures each process has its own Whisper model instance.
startLine: 101
endLine: 118
startLine: 370
endLine: 374
  • Threading: Employs threading.ThreadPoolExecutor for concurrent audio downloads, improving I/O-bound performance.
startLine: 369
endLine: 374
  • Apple Silicon (M1/M2/M3) Optimizations:
    • WHISPER_DEVICE = "cpu" and WHISPER_COMPUTE_TYPE = "int8": Forces Whisper to use the CPU with int8 quantization, which is significantly faster than MPS (Metal Performance Shaders) on Apple Silicon for this specific task.
    • OMP_NUM_THREADS: Automatically sets the number of OpenMP threads to half the number of CPU cores, preventing thread oversubscription and improving performance.
    • PYTORCH_ENABLE_MPS_FALLBACK=1: Allows PyTorch to use CPU if MPS is unavailable.
    • CT2_USE_MKL=0: Disables MKL, as it can conflict with OpenMP.
startLine: 71
endLine: 74
  • Memory Monitoring: A dedicated thread monitors memory usage and logs warnings if it exceeds a configured threshold. This helps prevent out-of-memory errors.
startLine: 573
endLine: 590
  • Efficient Metadata Extraction: Extracts metadata for all URLs in a single yt-dlp call, reducing overhead.
startLine: 172
endLine: 241
  • Pipelined Processing: Downloads and transcriptions are pipelined. As soon as an audio file is downloaded, its transcription is submitted to the process pool, maximizing resource utilization.
startLine: 389
endLine: 447
  • Optimized Transcript Generation: Uses list comprehensions for efficient in-memory transcript generation.
startLine: 545
endLine: 550
  • Rate Limiting: The DOWNLOAD_RATE_LIMIT option in config.py allows you to control the download speed to avoid overwhelming the network or triggering rate limits from YouTube.

Common Issues & Troubleshooting

  1. Memory Issues:

    • If you encounter out-of-memory errors, reduce NUM_WORKERS in config.py. The pipeline automatically reduces workers if memory usage exceeds MAX_MEMORY_PERCENT.
    • Monitor memory usage using the built-in memory monitoring thread (logs warnings).
  2. Download Failures:

    • Check the failed_urls.json file in the output directory for details on failed downloads.
    • Ensure yt-dlp is up-to-date (pip install --upgrade yt-dlp).
    • Network connectivity issues can cause downloads to fail.
  3. Transcription Errors:

    • Check the transcription.log file in the output directory for detailed error messages.
    • Ensure the Whisper model is correctly specified in config.py.
    • If using a GPU, ensure you have the necessary CUDA drivers and libraries installed.
  4. M1/M2/M3 Specific Issues:

    • Ensure libomp is installed (brew install libomp).
  5. Empty Transcripts:

    • If a transcript is empty, check the logs. It may be that no speech was detected, or the audio quality was too poor.

Future Enhancements

  • Batching for Transcription: Implement actual batching of audio segments for Whisper processing (currently, BATCH_SIZE is unused). This could further improve performance, especially on GPUs.
  • Web Interface: Develop a web interface for easier interaction and management of transcription tasks.
  • Support for Other Languages: Extend the pipeline to support transcription in languages other than English.
  • Speaker Diarization: Integrate speaker diarization to identify different speakers in the audio.
  • Improved Error Handling: Implement more granular error handling and reporting, potentially with automatic retries for specific transcription errors.
  • Configuration via Environment Variables: Allow overriding configuration settings using environment variables for easier deployment in containerized environments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages