Skip to content

[Feature Proposal] /bmad-podcast Skill - Autonomous Markdown-to-Kokoro Audio PipelineΒ #1731

@noel4nopun

Description

@noel4nopun

Hey everyone! Following up on the Discord conversation about a local NotebookLM-style connector.

I've been experimenting in my workspace with intercepting markdown specs and routing them into a high-speed local TTS engine, effectively creating an autonomous "War Room" podcast briefing. It works incredibly well and aligns perfectly with the idea for a /bmad-podcast <filename> skill.

Here is a breakdown of the dual-layer architecture I'm using, along with the core Python script, so you can see how it might plug natively into the BMAD framework.

🎯 The Architecture

Layer 1: The Scriptwriter (Gemini 2.5 Flash)
We take the raw markdown file (e.g., an implementation plan or tech spec) and pass it to Gemini with a strict system prompt. The prompt forces the LLM to write a 2-person dialogue ([S1] and [S2]) evaluating the architecture, explicitly forbidding standard "podcast fluff" and focusing on a constructive, critical briefing.

Layer 2: The TTS Synthesis (Kokoro-82M)
We pipe the resulting script straight into Kokoro-82M, an insanely fast open-weights TTS model running locally. Because Kokoro is so lightweight, a local GPU renders the audio chunks almost instantly. A simple regex router sends [S1] blocks to a professional male voice and [S2] to a female voice, then concatenates them into a single .wav.

πŸ’» The Core Implementation

Here is the core logical loop from the standalone kokoro_narrator.py script that handles the orchestration. It could easily be adapted into a standard BMAD Skill class.

import soundfile as sf
import numpy as np
from kokoro import KPipeline
import re
from google import genai
from google.genai import types

# 1. The Prompt Logic (Enforcing Speaker Tags)
SYSTEM_PROMPT = """You are an elite engineering duo. Translate this dry spec into an engaging, analytical strategic breakdown.
CRITICAL RULES:
1. FORMAT: strictly alternate between [S1] (Puck, lead architect) and [S2] (River, strategist).
2. NEVER say "S1" or "S2" in spoken dialogue. Tags exist ONLY at the very beginning of a block.
3. Keep it critical and analytical. Avoid "welcome to the podcast" fluff."""

# 2. Text-to-Script Generation
def rewrite_to_podcast(api_key: str, text: str) -> list:
    client = genai.Client(api_key=api_key)
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=text,
        config=types.GenerateContentConfig(system_instruction=SYSTEM_PROMPT, temperature=0.8)
    )
    
    # Split using regex to keep the delimiters (S1/S2) attached to their text
    raw_script = response.text
    parts = re.split(r'(\[S[12]\])', raw_script)
    blocks = []
    current_speaker = None
    
    for part in parts:
        part = part.strip()
        if not part: continue
        if part in ['[S1]', '[S2]']: current_speaker = part
        else:
            if current_speaker: blocks.append(f"{current_speaker} {part}")
            else: blocks.append(part)
    return blocks

# 3. Text-to-Speech Synthesis via Kokoro
def generate_speech(blocks: list, output_file: str):
    pipeline = KPipeline(lang_code='a') 
    audio_chunks = []
    sample_rate = 24000
    
    for i, block in enumerate(blocks, 1):
        if not block: continue
        
        # Route voices based on the generated LLM tags
        current_voice = 'am_puck' 
        if block.startswith('[S2]'): current_voice = 'af_river'
            
        # Clean speaker tags so they aren't read aloud by the TTS Engine
        spoken_text = block.replace('[S1]', '').replace('[S2]', '').replace('*', '').strip()
        if not spoken_text: continue
            
        generator = pipeline(spoken_text, voice=current_voice, speed=1, split_pattern=r'\n+')
        for _, _, audio in generator:
            audio_chunks.append(audio)
            
    if audio_chunks:
        final_audio = np.concatenate(audio_chunks)
        sf.write(output_file, final_audio, sample_rate)

πŸš€ Converting to a BMAD Skill

To turn this into a native /bmad-podcast <filename> command:

  1. The skill would extract the target_file argument from the user's prompt.
  2. Read the markdown contents.
  3. Execute the 2-stage inference pass (Script Generation -> TTS Generation).
  4. Save the .wav output to .bmad/outputs/ or a staging folder.
  5. Return the file path to the user in the CLI/chat.

Hopefully this concept helps accelerate a native NotebookLM-style feature! Let me know if you have any questions about the Kokoro integration or the speaker-routing prompts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions