-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Hey everyone! Following up on the Discord conversation about a local NotebookLM-style connector.
I've been experimenting in my workspace with intercepting markdown specs and routing them into a high-speed local TTS engine, effectively creating an autonomous "War Room" podcast briefing. It works incredibly well and aligns perfectly with the idea for a /bmad-podcast <filename> skill.
Here is a breakdown of the dual-layer architecture I'm using, along with the core Python script, so you can see how it might plug natively into the BMAD framework.
π― The Architecture
Layer 1: The Scriptwriter (Gemini 2.5 Flash)
We take the raw markdown file (e.g., an implementation plan or tech spec) and pass it to Gemini with a strict system prompt. The prompt forces the LLM to write a 2-person dialogue ([S1] and [S2]) evaluating the architecture, explicitly forbidding standard "podcast fluff" and focusing on a constructive, critical briefing.
Layer 2: The TTS Synthesis (Kokoro-82M)
We pipe the resulting script straight into Kokoro-82M, an insanely fast open-weights TTS model running locally. Because Kokoro is so lightweight, a local GPU renders the audio chunks almost instantly. A simple regex router sends [S1] blocks to a professional male voice and [S2] to a female voice, then concatenates them into a single .wav.
π» The Core Implementation
Here is the core logical loop from the standalone kokoro_narrator.py script that handles the orchestration. It could easily be adapted into a standard BMAD Skill class.
import soundfile as sf
import numpy as np
from kokoro import KPipeline
import re
from google import genai
from google.genai import types
# 1. The Prompt Logic (Enforcing Speaker Tags)
SYSTEM_PROMPT = """You are an elite engineering duo. Translate this dry spec into an engaging, analytical strategic breakdown.
CRITICAL RULES:
1. FORMAT: strictly alternate between [S1] (Puck, lead architect) and [S2] (River, strategist).
2. NEVER say "S1" or "S2" in spoken dialogue. Tags exist ONLY at the very beginning of a block.
3. Keep it critical and analytical. Avoid "welcome to the podcast" fluff."""
# 2. Text-to-Script Generation
def rewrite_to_podcast(api_key: str, text: str) -> list:
client = genai.Client(api_key=api_key)
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=text,
config=types.GenerateContentConfig(system_instruction=SYSTEM_PROMPT, temperature=0.8)
)
# Split using regex to keep the delimiters (S1/S2) attached to their text
raw_script = response.text
parts = re.split(r'(\[S[12]\])', raw_script)
blocks = []
current_speaker = None
for part in parts:
part = part.strip()
if not part: continue
if part in ['[S1]', '[S2]']: current_speaker = part
else:
if current_speaker: blocks.append(f"{current_speaker} {part}")
else: blocks.append(part)
return blocks
# 3. Text-to-Speech Synthesis via Kokoro
def generate_speech(blocks: list, output_file: str):
pipeline = KPipeline(lang_code='a')
audio_chunks = []
sample_rate = 24000
for i, block in enumerate(blocks, 1):
if not block: continue
# Route voices based on the generated LLM tags
current_voice = 'am_puck'
if block.startswith('[S2]'): current_voice = 'af_river'
# Clean speaker tags so they aren't read aloud by the TTS Engine
spoken_text = block.replace('[S1]', '').replace('[S2]', '').replace('*', '').strip()
if not spoken_text: continue
generator = pipeline(spoken_text, voice=current_voice, speed=1, split_pattern=r'\n+')
for _, _, audio in generator:
audio_chunks.append(audio)
if audio_chunks:
final_audio = np.concatenate(audio_chunks)
sf.write(output_file, final_audio, sample_rate)π Converting to a BMAD Skill
To turn this into a native /bmad-podcast <filename> command:
- The skill would extract the
target_fileargument from the user's prompt. - Read the markdown contents.
- Execute the 2-stage inference pass (Script Generation -> TTS Generation).
- Save the
.wavoutput to.bmad/outputs/or a staging folder. - Return the file path to the user in the CLI/chat.
Hopefully this concept helps accelerate a native NotebookLM-style feature! Let me know if you have any questions about the Kokoro integration or the speaker-routing prompts.