A Python-based Retrieval-Augmented Generation (RAG) system for intelligent search and question-answering over podcast libraries. The system transcribes audio using Whisper, extracts metadata with AI, and enables semantic search with automatic citations.
- Automatic transcription of MP3 files using OpenAI Whisper
- AI-powered metadata extraction (titles, hosts, guests, summaries, keywords)
- Vector embeddings and semantic search using Gemini File Search
- Natural language queries with source citations
- Scheduled batch processing
- Dry-run mode and comprehensive logging
This project uses uv for fast, reliable dependency management. You can also use traditional pip if preferred.
-
Python 3.11+ (recommended)
-
Install
ffmpeg:- Linux:
sudo apt-get install ffmpeg
- macOS (using Homebrew):
brew install ffmpeg
- Windows:
- Download and install
ffmpegfrom https://ffmpeg.org/download.html.
- Download and install
- Linux:
-
Install
uv(recommended):curl -LsSf https://astral.sh/uv/install.sh | sh # Or on macOS: brew install uv
# Clone and enter the repository
git clone https://github.com/allenhutchison/podcast-rag
cd podcast-rag
# Install all dependencies (creates .venv automatically)
uv sync
# Optional: Activate the virtual environment
# (not required if using 'uv run' commands)
source .venv/bin/activate # On Windows: .venv\Scripts\activateConfiguration is managed via environment variables. Set them using:
- A
.envfile in the project root - A secrets manager (Doppler, 1Password, Vault)
- Shell environment variables
Required variables:
GEMINI_API_KEY=your_gemini_api_key_here
PODCAST_DOWNLOAD_DIRECTORY=/path/to/your/podcastsTest the installation:
uv run python -m src.cli podcast statusSee docs/configuration.md for the full configuration reference.
Pre-built Docker images are available on Docker Hub for easy deployment:
allenhutchison/podcast-rag- Encoding backend (transcription + metadata extraction)allenhutchison/podcast-rag-web- Web interface for querying podcasts
Run both the encoding backend and web service together using docker-compose:
-
Create
.envfile with required environment variables (see docs/configuration.md):GEMINI_API_KEY=your_key_here PODCAST_DOWNLOAD_DIRECTORY=/data/podcasts
-
Edit
docker-compose.ymlto set your podcast directory path in the volume mounts. -
Start services:
# Pull latest images docker-compose pull # Start both services in background docker-compose up -d # View logs docker-compose logs -f # Stop services docker-compose down
-
Access web interface:
- Open http://localhost:8080 in your browser
- Start querying your podcast library!
What's running:
- podcast-rag: Processes new podcasts every hour
- podcast-rag-web: Serves web UI for real-time queries with streaming responses
Shared resources:
- Podcast directory: Source audio files (read-only)
- Database: Episode metadata and File Search references
Deploy the web interface to Google Cloud Run for public access:
-
Build and push web image (or use pre-built from Docker Hub):
gcloud builds submit --tag gcr.io/YOUR_PROJECT/podcast-rag-web
-
Deploy to Cloud Run:
gcloud run deploy podcast-rag-web \ --image gcr.io/YOUR_PROJECT/podcast-rag-web \ --platform managed \ --region us-central1 \ --allow-unauthenticated \ --set-env-vars GEMINI_API_KEY=your_key,GEMINI_FILE_SEARCH_STORE_NAME=podcast-transcripts
-
Access your deployment:
- Cloud Run provides a public URL
- Web service connects to your Gemini File Search store
Note: Cloud Run deployment uses Dockerfile.web which excludes ffmpeg (~100MB) for faster startup. The homelab encoding backend must run separately to process podcasts.
Images are automatically built via GitHub Actions when you create a release:
# Create and push a tag
git tag v1.0.0
git push origin v1.0.0
# Or create release via GitHub UI
# This triggers builds for both imagesManual build:
# Build encoding backend
docker build -t podcast-rag -f Dockerfile .
# Build web service
docker build -t podcast-rag-web -f Dockerfile.web .| Image | Base | Size | Contains | Use Case |
|---|---|---|---|---|
podcast-rag |
python:3.12-slim | ~1.5GB | ffmpeg, whisper, all dependencies | Homelab encoding backend |
podcast-rag-web |
python:3.12-slim | ~500MB | Web server, no ffmpeg | Cloud Run or homelab web UI |
Both images:
- Use multi-stage builds for smaller size
- Run as non-root user (UID 1000)
- Include health checks
All commands can be run with uv run or directly if you've activated the virtual environment.
# Run the pipeline (continuous processing optimized for GPU)
uv run poe pipeline
# Or using the CLI directly
uv run python -m src.cli podcast pipelineuv run python -m src.cli podcast statusuv run poe query --query "your question here"
# Or directly
uv run python -m src.rag --query "your question here"# List all documents in the store
python scripts/file_search_utils.py --action list
# Find duplicate files
python scripts/file_search_utils.py --action find-duplicates
# Delete duplicates (keeps oldest by default)
python scripts/file_search_utils.py --action delete-duplicates
# Delete all files (with confirmation)
python scripts/file_search_utils.py --action delete-allThe tool uses Python's built-in logging for tracking progress and errors. By default, logs are displayed in the console at INFO level.
# Available levels: DEBUG, INFO, WARNING, ERROR
python src/scheduler.py --log-level DEBUG
python -m src.rag --query "your question" --log-level ERRORWhen processing fails for any episodes, the pipeline will log detailed error information showing:
- Episodes that failed to transcribe
- Episodes that failed metadata extraction
- Episodes that failed indexing
Use python -m src.cli podcast status to view current processing status and any failures.
Run tests using the poe task runner:
# Run all tests
uv run poe test
# Run tests with coverage
uv run poe covContributions are welcome! Please submit a pull request with any improvements or bug fixes. Ensure all tests pass before submitting your PR.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.