[Feature Proposal] /bmad-podcast Skill - Autonomous Markdown-to-Kokoro Audio Pipeline

Hey everyone! Following up on the Discord conversation about a local NotebookLM-style connector. 

I've been experimenting in my workspace with intercepting markdown specs and routing them into a high-speed local TTS engine, effectively creating an autonomous "War Room" podcast briefing. It works incredibly well and aligns perfectly with the idea for a `/bmad-podcast <filename>` skill.

Here is a breakdown of the dual-layer architecture I'm using, along with the core Python script, so you can see how it might plug natively into the BMAD framework.

## 🎯 The Architecture

**Layer 1: The Scriptwriter (Gemini 2.5 Flash)**
We take the raw markdown file (e.g., an implementation plan or tech spec) and pass it to Gemini with a strict system prompt. The prompt forces the LLM to write a 2-person dialogue (`[S1]` and `[S2]`) evaluating the architecture, explicitly forbidding standard "podcast fluff" and focusing on a constructive, critical briefing.

**Layer 2: The TTS Synthesis (Kokoro-82M)**
We pipe the resulting script straight into [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M), an insanely fast open-weights TTS model running locally. Because Kokoro is so lightweight, a local GPU renders the audio chunks almost instantly. A simple regex router sends `[S1]` blocks to a professional male voice and `[S2]` to a female voice, then concatenates them into a single `.wav`.

## 💻 The Core Implementation
Here is the core logical loop from the standalone `kokoro_narrator.py` script that handles the orchestration. It could easily be adapted into a standard BMAD Skill class.

```python
import soundfile as sf
import numpy as np
from kokoro import KPipeline
import re
from google import genai
from google.genai import types

# 1. The Prompt Logic (Enforcing Speaker Tags)
SYSTEM_PROMPT = """You are an elite engineering duo. Translate this dry spec into an engaging, analytical strategic breakdown.
CRITICAL RULES:
1. FORMAT: strictly alternate between [S1] (Puck, lead architect) and [S2] (River, strategist).
2. NEVER say "S1" or "S2" in spoken dialogue. Tags exist ONLY at the very beginning of a block.
3. Keep it critical and analytical. Avoid "welcome to the podcast" fluff."""

# 2. Text-to-Script Generation
def rewrite_to_podcast(api_key: str, text: str) -> list:
    client = genai.Client(api_key=api_key)
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=text,
        config=types.GenerateContentConfig(system_instruction=SYSTEM_PROMPT, temperature=0.8)
    )
    
    # Split using regex to keep the delimiters (S1/S2) attached to their text
    raw_script = response.text
    parts = re.split(r'(\[S[12]\])', raw_script)
    blocks = []
    current_speaker = None
    
    for part in parts:
        part = part.strip()
        if not part: continue
        if part in ['[S1]', '[S2]']: current_speaker = part
        else:
            if current_speaker: blocks.append(f"{current_speaker} {part}")
            else: blocks.append(part)
    return blocks

# 3. Text-to-Speech Synthesis via Kokoro
def generate_speech(blocks: list, output_file: str):
    pipeline = KPipeline(lang_code='a') 
    audio_chunks = []
    sample_rate = 24000
    
    for i, block in enumerate(blocks, 1):
        if not block: continue
        
        # Route voices based on the generated LLM tags
        current_voice = 'am_puck' 
        if block.startswith('[S2]'): current_voice = 'af_river'
            
        # Clean speaker tags so they aren't read aloud by the TTS Engine
        spoken_text = block.replace('[S1]', '').replace('[S2]', '').replace('*', '').strip()
        if not spoken_text: continue
            
        generator = pipeline(spoken_text, voice=current_voice, speed=1, split_pattern=r'\n+')
        for _, _, audio in generator:
            audio_chunks.append(audio)
            
    if audio_chunks:
        final_audio = np.concatenate(audio_chunks)
        sf.write(output_file, final_audio, sample_rate)
```

## 🚀 Converting to a BMAD Skill
To turn this into a native `/bmad-podcast <filename>` command:
1. The skill would extract the `target_file` argument from the user's prompt.
2. Read the markdown contents.
3. Execute the 2-stage inference pass (Script Generation -> TTS Generation).
4. Save the `.wav` output to `.bmad/outputs/` or a staging folder.
5. Return the file path to the user in the CLI/chat.

Hopefully this concept helps accelerate a native NotebookLM-style feature! Let me know if you have any questions about the Kokoro integration or the speaker-routing prompts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Proposal] /bmad-podcast Skill - Autonomous Markdown-to-Kokoro Audio Pipeline #1731

🎯 The Architecture

💻 The Core Implementation

🚀 Converting to a BMAD Skill

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature Proposal] /bmad-podcast Skill - Autonomous Markdown-to-Kokoro Audio Pipeline #1731

Description

🎯 The Architecture

💻 The Core Implementation

🚀 Converting to a BMAD Skill

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions