podcastai

Turn any URL / YouTube video / PDF into a visual podcast video — AI dialogue, stock imagery, background music, and a HyperFrames-rendered MP4. All for roughly $0.01 of LLM tokens (gen-podcast's Gemini usage).

How it works

source URL/PDF
     │
     ▼
gen-podcast (Gemini → multi-role dialogue → Edge TTS → mp3 + vtt)
     │
     ▼
Pexels / Pixabay (one image per dialogue segment, from English keywords)
     │
     ▼
Pixabay Music (royalty-free BGM; ducked under narration)
     │
     ▼
HyperFrames (HTML/CSS/GSAP composition → deterministic headless-Chrome render)
     │
     ▼
final.mp4

Architecture

podcastai is instruction-driven: the AI agent reads a pipeline manifest + stage director skills and drives the production state machine stage by stage. Python exists for tools and persistence only — no orchestration logic lives in code. This matches the OpenMontage "agent-first" contract.

Read AGENT_GUIDE.md to see exactly what the agent does.

Quick start

1. Install

git clone <this-repo>
cd podcastai
make setup          # creates venv, installs Python deps, warms HyperFrames cache

Also install the Node/ffmpeg prerequisites:

Node.js ≥ 22 (https://nodejs.org)
ffmpeg on PATH (brew install ffmpeg / apt install ffmpeg)

2. Configure

Copy .env.example → .env and set:

GOOGLE_API_KEY=...       # Gemini — gen-podcast's LLM
PEXELS_API_KEY=...       # (or) PIXABAY_API_KEY — at least one

Note: pixabay_music needs no API key. GOOGLE_API_KEY is required because gen-podcast uses Gemini for dialogue generation.

3. Verify

make preflight           # list configured tools
make hyperframes-doctor  # verify Node/ffmpeg/npx + hyperframes npm package

4. Run end-to-end demo

make demo
# or with custom input:
.venv/bin/python render_demo.py \
    --url "https://en.wikipedia.org/wiki/Podcast" \
    --language zh \
    --playbook flat-motion-graphics \
    --project-name podcast-intro-zh

Output lands at projects/<project-name>/renders/final.mp4.

Using as an AI-agent-driven project

Open the project in Claude Code / Cursor / Codex. Say something like:

"Turn https://arxiv.org/abs/2401.02669 into a Chinese podcast video with professional visuals, 8 minutes or less."

The agent will:

Read AGENT_GUIDE.md.
Run preflight, report available tools.
Propose a plan (voices, playbook, duration target) and wait for approval.
Execute the 7 stages, checkpointing at creative stages.
Deliver projects/<name>/renders/final.mp4.

Project layout

podcastai/
├── AGENT_GUIDE.md                    # Read-this-first agent contract
├── PROJECT_CONTEXT.md                # Architecture deep-dive
├── pipeline_defs/
│   └── podcast-visualizer.yaml       # The one pipeline
├── skills/
│   ├── core/                         # Layer 2 — hyperframes, podcast-audio, stock-media
│   ├── meta/                         # reviewer, checkpoint-protocol, onboarding
│   └── pipelines/podcast-visualizer/ # 8 stage director skills
├── tools/
│   ├── podcast/podcast_gen.py        # Wraps gen-podcast CLI
│   ├── graphics/                     # pexels / pixabay / image_selector
│   ├── audio/                        # pixabay_music / audio_mixer
│   ├── video/hyperframes_compose.py  # HyperFrames scaffold + lint + validate + render
│   └── subtitle/subtitle_gen.py      # VTT aggregation
├── styles/                           # clean-professional / flat-motion-graphics playbooks
├── lib/                              # checkpoint, pipeline_loader, style bridge
├── schemas/                          # JSON schemas for all artifacts
├── projects/                         # (gitignored) run workspaces
├── render_demo.py                    # End-to-end URL → MP4 driver
└── Makefile                          # setup, preflight, demo, hyperframes-doctor

Dependencies

Role	Tool	Required?
Dialogue generation	Gemini (via `gen-podcast`)	yes — `GOOGLE_API_KEY`
TTS	Edge TTS (via `gen-podcast`, no key)	implicit
Images	Pexels or Pixabay	one of `PEXELS_API_KEY` / `PIXABAY_API_KEY`
Music	Pixabay Music scraper	no key
Render	HyperFrames npm + Node.js ≥ 22 + ffmpeg	yes (local)

Relationship to OpenMontage

podcastai is a focused subset of the OpenMontage architecture — same instruction-driven contract, same tool-contract base class, same checkpoint/reviewer meta skills, same HyperFrames integration. Scoped down to one pipeline, one render runtime, stock media only.

If you want:

avatar/lip-sync presenters → use OpenMontage's avatar-spokesperson
Remotion React scenes → use OpenMontage's animated-explainer
word-level burned captions → use OpenMontage's remotion_caption_burn

podcastai handles the "podcast → visual companion video" case opinionated and cheap.

License

TBD (likely Apache-2.0 to match gen-podcast + OpenMontage).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

podcastai

How it works

Architecture

Quick start

1. Install

2. Configure

3. Verify

4. Run end-to-end demo

Using as an AI-agent-driven project

Project layout

Dependencies

Relationship to OpenMontage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
lib		lib
pipeline_defs		pipeline_defs
schemas		schemas
skills		skills
styles		styles
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
AGENT_GUIDE.md		AGENT_GUIDE.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
PROJECT_CONTEXT.md		PROJECT_CONTEXT.md
README.md		README.md
config.yaml		config.yaml
render_demo.py		render_demo.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

podcastai

How it works

Architecture

Quick start

1. Install

2. Configure

3. Verify

4. Run end-to-end demo

Using as an AI-agent-driven project

Project layout

Dependencies

Relationship to OpenMontage

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages