Skip to content

Podcast2PDF: Transcribing Podcasts into PDFs with Speaker Diarization

License

Notifications You must be signed in to change notification settings

guanqun-yang/Podcast2PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Podcast2PDF: Transcribing Podcasts into PDFs with Speaker Diarization

Project logo

Overview

This project provides an end-to-end workflow for transforming raw audio and video into highly readable, speaker-aware PDF and Google Doc transcripts.

The pipeline automatically handles:

  1. Media Acquisition: Downloads content from podcast RSS feeds or video URLs (using yt-dlp).
  2. Speech Processing: Transcribes and diarizes the audio (identifying who spoke when) using the powerful Whisper model.
  3. Output Formatting: Produces a cleanly formatted, speaker-aware PDF and simultaneously writes the document to a Google Doc for easy review, collaboration, or archiving.

Prerequisites

1. Install yt-dlp (Video/Audio Extractor)

To reliably extract high-quality audio from video sources like YouTube, this project uses yt-dlp, the current, actively maintained successor to youtube-dl.

# Download yt-dlp
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp\
-o /usr/local/bin/yt-dlp \
&& sudo chmod 755 /usr/local/bin/yt-dlp

# Check installation success
yt-dlp --version

2. Set Up Google Docs API Credentials

This project leverages the guanqun-yang/seedwriter library to push the diarized transcript directly to a Google Doc. This requires proper API authentication:

You must have your Google API credential and token files stored in the following location:

~/.local/share/podcast2pdf/credentials.json
~/.local/share/podcast2pdf/token_docs.json

3.\ Set up the OpenAI API Key

You need to have an OpenAI API key to use the Whisper model for transcription. Set it as an environment variable in ~/.bashrc or ~/.zshrc:

OPENAI_API_KEY=<OPENAI_API_KEY>

Workflow

A. One-Step Quick Start

For the fastest results, run the entire pipeline with a single command, passing the URL directly:

podcast2pdf <VIDEO_OR_RSS_URL>

B. Step-by-Step Manual Workflow

If you need more control or debugging visibility, follow these steps:

Step 1: Download Media Audio

  • For Podcasts (via RSS):

    1. Find the podcast show link (e.g., in Apple Podcasts).
    2. Use a tool like getrssfeed.com or a library like Python's feedparser to extract the underlying RSS Feed URL.
    3. Download the audio files:
      npx podcast-dl --limit 5 --url <RSS_FEED_URL>
  • For YouTube/Video URLs (via yt-dlp): Download the highest quality audio and convert it to MP3:

    yt-dlp -x --audio-format mp3 --audio-quality 0 <VIDEO_URL>

    This will save a high-quality .mp3 file.

Step 2: Transcribe and Diarize Audio

The final stage uses OpenAI's models for state-of-the-art transcription and speaker diarization.

  1. Environment Setup: Create and activate a dedicated environment for the project:
    conda create --name Podcast2PDF python==3.12
    conda activate Podcast2PDF
    uv pip install -e .
  2. Run Transcription:
    # Example for an MP3 file
    podcast2pdf mp3_audio.mp3 transcript.pdf --verbose
    # Example for an M4A file
    podcast2pdf m4a_audio.m4a transcript.pdf --verbose

Note on Pricing: Transcription costs are based on the token count of the resulting text, not the audio file size or duration. Consult the OpenAI documentation for the latest pricing.

Alternative: Consider fully managed services like AssemblyAI, which offer a comparable cost (approx. $0.27 per hour) for building a potentially no-code pipeline.

Technical Background

Why RSS Matters for Podcasts

Unlike music streaming, where platforms like Spotify and Apple Music host and serve all media from their own servers, podcast distribution is built on a decentralized, open ecosystem powered by RSS. Although RSS is no longer widely used for blogs and news delivery, it remains the backbone of podcasting—allowing creators to publish once and automatically reach listeners across many different podcast apps.

  • Creators upload episodes to independent hosting services (e.g., Libsyn, Megaphone, Podbean), which generate and manage each show’s RSS feed.
  • Podcast apps subscribe to the RSS feed, detect new episodes, and display them to listeners.

Landscape of Audio Transcription Models

Among major AI labs, OpenAI maintains a notable advantage in speech-to-text technology. While most competitors focus primarily on text-based LLMs, OpenAI has invested deeply in audio understanding:

  • Whisper has become the de-facto industry standard for open-source transcription: robust across accents, noisy environments, and rapid speech.
  • Whisper v3 API pushes accuracy even further, supporting 100+ languages with strong real-world performance.

For other companies:

  • AssemblyAI and Deepgram build products on top of existing APIs.
  • Google, Amazon, Meta, and Microsoft have models with less adoption.
  • Anthropic, Cohere, and Mistral do not have transcription models at all.

About

Podcast2PDF: Transcribing Podcasts into PDFs with Speaker Diarization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages