This project provides an end-to-end workflow for transforming raw audio and video into highly readable, speaker-aware PDF and Google Doc transcripts.
The pipeline automatically handles:
- Media Acquisition: Downloads content from podcast RSS feeds or video URLs (using
yt-dlp). - Speech Processing: Transcribes and diarizes the audio (identifying who spoke when) using the powerful Whisper model.
- Output Formatting: Produces a cleanly formatted, speaker-aware PDF and simultaneously writes the document to a Google Doc for easy review, collaboration, or archiving.
To reliably extract high-quality audio from video sources like YouTube, this project uses yt-dlp, the current, actively maintained successor to youtube-dl.
# Download yt-dlp
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp\
-o /usr/local/bin/yt-dlp \
&& sudo chmod 755 /usr/local/bin/yt-dlp
# Check installation success
yt-dlp --versionThis project leverages the guanqun-yang/seedwriter library to push the diarized transcript directly to a Google Doc. This requires proper API authentication:
You must have your Google API credential and token files stored in the following location:
~/.local/share/podcast2pdf/credentials.json
~/.local/share/podcast2pdf/token_docs.json
You need to have an OpenAI API key to use the Whisper model for transcription. Set it as an environment variable in ~/.bashrc or ~/.zshrc:
OPENAI_API_KEY=<OPENAI_API_KEY>For the fastest results, run the entire pipeline with a single command, passing the URL directly:
podcast2pdf <VIDEO_OR_RSS_URL>If you need more control or debugging visibility, follow these steps:
-
For Podcasts (via RSS):
- Find the podcast show link (e.g., in Apple Podcasts).
- Use a tool like
getrssfeed.comor a library like Python'sfeedparserto extract the underlying RSS Feed URL. - Download the audio files:
npx podcast-dl --limit 5 --url <RSS_FEED_URL>
-
For YouTube/Video URLs (via
yt-dlp): Download the highest quality audio and convert it to MP3:yt-dlp -x --audio-format mp3 --audio-quality 0 <VIDEO_URL>
This will save a high-quality
.mp3file.
The final stage uses OpenAI's models for state-of-the-art transcription and speaker diarization.
- Environment Setup: Create and activate a dedicated environment for the project:
conda create --name Podcast2PDF python==3.12 conda activate Podcast2PDF uv pip install -e . - Run Transcription:
# Example for an MP3 file podcast2pdf mp3_audio.mp3 transcript.pdf --verbose # Example for an M4A file podcast2pdf m4a_audio.m4a transcript.pdf --verbose
Note on Pricing: Transcription costs are based on the token count of the resulting text, not the audio file size or duration. Consult the OpenAI documentation for the latest pricing.
Alternative: Consider fully managed services like AssemblyAI, which offer a comparable cost (approx. $0.27 per hour) for building a potentially no-code pipeline.
Unlike music streaming, where platforms like Spotify and Apple Music host and serve all media from their own servers, podcast distribution is built on a decentralized, open ecosystem powered by RSS. Although RSS is no longer widely used for blogs and news delivery, it remains the backbone of podcasting—allowing creators to publish once and automatically reach listeners across many different podcast apps.
- Creators upload episodes to independent hosting services (e.g., Libsyn, Megaphone, Podbean), which generate and manage each show’s RSS feed.
- Podcast apps subscribe to the RSS feed, detect new episodes, and display them to listeners.
Among major AI labs, OpenAI maintains a notable advantage in speech-to-text technology. While most competitors focus primarily on text-based LLMs, OpenAI has invested deeply in audio understanding:
- Whisper has become the de-facto industry standard for open-source transcription: robust across accents, noisy environments, and rapid speech.
- Whisper v3 API pushes accuracy even further, supporting 100+ languages with strong real-world performance.
For other companies:
- AssemblyAI and Deepgram build products on top of existing APIs.
- Google, Amazon, Meta, and Microsoft have models with less adoption.
- Anthropic, Cohere, and Mistral do not have transcription models at all.
