Podcast RAG System

A Python-based Retrieval-Augmented Generation (RAG) system for intelligent search and question-answering over podcast libraries. The system transcribes audio using Whisper, extracts metadata with AI, and enables semantic search with automatic citations.

Features

Automatic transcription of MP3 files using OpenAI Whisper
AI-powered metadata extraction (titles, hosts, guests, summaries, keywords)
Vector embeddings and semantic search using Gemini File Search
Natural language queries with source citations
Scheduled batch processing
Dry-run mode and comprehensive logging

Installation

This project uses uv for fast, reliable dependency management. You can also use traditional pip if preferred.

Prerequisites

Python 3.11+ (recommended)
Install ffmpeg:
- Linux:
```
sudo apt-get install ffmpeg
```
- macOS (using Homebrew):
```
brew install ffmpeg
```
- Windows:
  - Download and install ffmpeg from https://ffmpeg.org/download.html.

Install uv (recommended):

curl -LsSf https://astral.sh/uv/install.sh | sh
# Or on macOS:
brew install uv

Install

# Clone and enter the repository
git clone https://github.com/allenhutchison/podcast-rag
cd podcast-rag

# Install all dependencies (creates .venv automatically)
uv sync

# Optional: Activate the virtual environment
# (not required if using 'uv run' commands)
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Configuration

Configuration is managed via environment variables. Set them using:

A .env file in the project root
A secrets manager (Doppler, 1Password, Vault)
Shell environment variables

Required variables:

GEMINI_API_KEY=your_gemini_api_key_here
PODCAST_DOWNLOAD_DIRECTORY=/path/to/your/podcasts

Test the installation:

uv run python -m src.cli podcast status

See docs/configuration.md for the full configuration reference.

Docker Deployment

Pre-built Docker images are available on Docker Hub for easy deployment:

allenhutchison/podcast-rag - Encoding backend (transcription + metadata extraction)
allenhutchison/podcast-rag-web - Web interface for querying podcasts

Homelab Deployment (Recommended)

Run both the encoding backend and web service together using docker-compose:

Create .env file with required environment variables (see docs/configuration.md):
```
GEMINI_API_KEY=your_key_here
PODCAST_DOWNLOAD_DIRECTORY=/data/podcasts
```
Edit docker-compose.yml to set your podcast directory path in the volume mounts.

Start services:

# Pull latest images
docker-compose pull

# Start both services in background
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Access web interface:
- Open http://localhost:8080 in your browser
- Start querying your podcast library!

What's running:

podcast-rag: Processes new podcasts every hour
podcast-rag-web: Serves web UI for real-time queries with streaming responses

Shared resources:

Podcast directory: Source audio files (read-only)
Database: Episode metadata and File Search references

Cloud Run Deployment (Web Service Only)

Deploy the web interface to Google Cloud Run for public access:

Build and push web image (or use pre-built from Docker Hub):

gcloud builds submit --tag gcr.io/YOUR_PROJECT/podcast-rag-web

Deploy to Cloud Run:

gcloud run deploy podcast-rag-web \
  --image gcr.io/YOUR_PROJECT/podcast-rag-web \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars GEMINI_API_KEY=your_key,GEMINI_FILE_SEARCH_STORE_NAME=podcast-transcripts

Access your deployment:
- Cloud Run provides a public URL
- Web service connects to your Gemini File Search store

Note: Cloud Run deployment uses Dockerfile.web which excludes ffmpeg (~100MB) for faster startup. The homelab encoding backend must run separately to process podcasts.

Building Images Yourself

Images are automatically built via GitHub Actions when you create a release:

# Create and push a tag
git tag v1.0.0
git push origin v1.0.0

# Or create release via GitHub UI
# This triggers builds for both images

Manual build:

# Build encoding backend
docker build -t podcast-rag -f Dockerfile .

# Build web service
docker build -t podcast-rag-web -f Dockerfile.web .

Image Details

Image	Base	Size	Contains	Use Case
`podcast-rag`	python:3.12-slim	~1.5GB	ffmpeg, whisper, all dependencies	Homelab encoding backend
`podcast-rag-web`	python:3.12-slim	~500MB	Web server, no ffmpeg	Cloud Run or homelab web UI

Both images:

Use multi-stage builds for smaller size
Run as non-root user (UID 1000)
Include health checks

Usage

All commands can be run with uv run or directly if you've activated the virtual environment.

Run the processing pipeline:

# Run the pipeline (continuous processing optimized for GPU)
uv run poe pipeline

# Or using the CLI directly
uv run python -m src.cli podcast pipeline

Check status:

uv run python -m src.cli podcast status

Query the RAG system:

uv run poe query --query "your question here"

# Or directly
uv run python -m src.rag --query "your question here"

Manage File Search store:

# List all documents in the store
python scripts/file_search_utils.py --action list

# Find duplicate files
python scripts/file_search_utils.py --action find-duplicates

# Delete duplicates (keeps oldest by default)
python scripts/file_search_utils.py --action delete-duplicates

# Delete all files (with confirmation)
python scripts/file_search_utils.py --action delete-all

Logging

The tool uses Python's built-in logging for tracking progress and errors. By default, logs are displayed in the console at INFO level.

Set log level:

# Available levels: DEBUG, INFO, WARNING, ERROR
python src/scheduler.py --log-level DEBUG
python -m src.rag --query "your question" --log-level ERROR

Error Reporting

When processing fails for any episodes, the pipeline will log detailed error information showing:

Episodes that failed to transcribe
Episodes that failed metadata extraction
Episodes that failed indexing

Use python -m src.cli podcast status to view current processing status and any failures.

Testing

Run tests using the poe task runner:

# Run all tests
uv run poe test

# Run tests with coverage
uv run poe cov

Contributing

Contributions are welcome! Please submit a pull request with any improvements or bug fixes. Ensure all tests pass before submitting your PR.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 435 Commits
.github/workflows		.github/workflows
alembic		alembic
docs		docs
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.web		Dockerfile.web
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Podcast RAG System

Features

Installation

Prerequisites

Install

Configuration

Docker Deployment

Homelab Deployment (Recommended)

Cloud Run Deployment (Web Service Only)

Building Images Yourself

Image Details

Usage

Run the processing pipeline:

Check status:

Query the RAG system:

Manage File Search store:

Logging

Set log level:

Error Reporting

Testing

Contributing

License

About

Uh oh!

Releases 14

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Podcast RAG System

Features

Installation

Prerequisites

Install

Configuration

Docker Deployment

Homelab Deployment (Recommended)

Cloud Run Deployment (Web Service Only)

Building Images Yourself

Image Details

Usage

Run the processing pipeline:

Check status:

Query the RAG system:

Manage File Search store:

Logging

Set log level:

Error Reporting

Testing

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages