TLDR: Turn Any Video Into a Document
This repository contains files to build and enhance an AWS Kiro CLI custom agent that takes video URLs from S3, ScreenPal, YouTube, or Twitch and produces a directory with audio transcription, visual analysis, and a unified markdown document.
This agent follows the three-layer Kiro CLI architecture:
Governance and Standards - Loaded directly into agent reasoning context:
- Code style conventions (named functions, JSDoc requirements)
- JavaScript safety standards (type safety, defensive programming)
- MCP health standards (server reliability, timeout management)
- Video processing standards (quality thresholds, format validation)
Domain-Specific Reference - Queried via /knowledge search:
- Video processing workflows and examples
- Platform-specific API documentation
- Troubleshooting guides and best practices
- MCP server configuration patterns
Real-Time Documentation - Triggered via use context7:
- Latest official documentation from source
- Fresh API references and examples
- Current installation and setup procedures
- Live troubleshooting and error resolution
Why This Matters:
- Steering docs ensure consistent, safe video processing
- Knowledge base provides searchable reference materials
- Context7 delivers fresh documentation for fast-moving domains
- Clean separation prevents context pollution while ensuring accuracy
Estimated costs: ~Took about 90 Kiro credits to develop, ~I recommend 30 credits to deploy, as I haven't tested redeploying. ~Takes about 3 credits to create a document from a short video, haven't experimented with a longer video. Fits within the 50 credit monthly free tier, especially with claude-haiku-4.5 (0.4x credit multiplier as of January 2026).
Development time: ~6 hours for planning and building, ~5 minutes to process a short video, screenshots, and generate a unified report.
MCP Servers:
- video-transcriber: Audio transcription with Whisper
- vision-server: Frame analysis with Moondream2 VLM
- ffmpeg-mcp: Frame extraction with scene detection
This repository contains agent development documentation that can confuse the screenpal-video-transcriber agent. Run the agent from a different directory to avoid the agent thinking it's creating an agent rather than being the agent.
For production use, consider moving development documentation to a separate repository.
___
/ _ \
| / \ |
| \_/ |
\___/ ___
_|_|_/[_]\__==_
[---------------]
| O /---\ |
| | | |
| \___/ |
[---------------]
[___]
| |\\
| | \\
[ ] \\_
/|_|\ ( \
//| |\\ \ \
// | | \\ \ \
// |_| \\ \_\
// | | \\
//\ | | /\\
// \ | | / \\
// \ | | / \\
// \|_|/ \\
// [_] \\
// H \\
// H \\
// H \\
// H \ // H \ // \ // \
The screenpal-video-transcriber agent provides a complete unified workflow for any supported video platform:
- Platform Detection: Auto-detects ScreenPal, YouTube, Twitch, or S3 from URL patterns
- Audio Transcription: Extract and transcribe speech using OpenAI Whisper
- Visual Analysis: Extract video frames at scene changes and analyze with Moondream2 VLM
- Unified Document: Automatically correlate audio and visual data by timestamp with platform metadata
Output: Three integrated files created in ~/Downloads/video-transcripts-{timestamp}/:
{video-id}-UNIFIED.json- Structured data combining audio segments with visual frames{video-id}-UNIFIED.md- Human-readable synchronized walkthrough{video-id}-frames/- Extracted PNG frames for reference
The agent automatically handles all three steps in one request:
- Extracts audio and transcribes to text
- Extracts video frames and analyzes visuals
- Creates unified document correlating audio + visual by timestamp
No manual tool selection needed - just provide a ScreenPal URL.
The agent orchestrates a unified three-stage pipeline:
Stage 1: Audio Extraction & Transcription
- yt-dlp extracts audio stream from video URL (supports ScreenPal, YouTube, Twitch, S3)
- OpenAI Whisper transcribes to timestamped text segments
Stage 2: Visual Analysis
- FFmpeg extracts frames at scene changes (threshold: 0.4)
- Moondream2 VLM analyzes each frame for UI elements and content
- Generates detailed visual descriptions with timestamps
Stage 3: Unified Document Creation
- Correlates audio segments with visual frames by timestamp
- Creates synchronized JSON and Markdown documents
- Stores all outputs in
~/Downloads/video-transcripts-{timestamp}/
- video-transcriber-mcp: Audio extraction and Whisper transcription
- ffmpeg-mcp: Frame extraction with scene detection
- moondream-mcp: Visual analysis using Ollama + Moondream2
- yt-dlp: Media stream extraction
- Ollama: Local VLM runtime (native macOS or Docker)
knowledge/
├── transcription-tools/ # Core agent documentation and workflows
├── workflow-automation/ # MCP server setup and configuration
├── screenpal-api/ # ScreenPal platform integration
└── best-practices/ # Video processing best practices
This project uses Perl scripts for system automation and configuration management. Perl provides robust text processing, system integration, and cross-platform compatibility for our video processing workflows.
The main setup script automates the complete installation and configuration process:
# Make executable and run
chmod +x setup.pl
./setup.plWhat it does:
- Verifies Kiro CLI installation
- Installs dependencies (yt-dlp, OpenAI Whisper, uv package manager)
- Clones and builds MCP servers from source
- Configures Ollama with Moondream model
- Creates MCP configuration files
- Sets up agent profiles
- Performs comprehensive verification
Note: This script was functional as of January 2026 but is not actively maintained. If you encounter issues, refer to the manual setup instructions in the documentation.
- Text Processing: Excellent for configuration file manipulation and JSON handling
- System Integration: Native support for shell commands and file operations
- Cross-Platform: Works consistently across macOS, Linux, and Windows
- Mature Ecosystem: Stable libraries for JSON, file handling, and HTTP operations
- Error Handling: Robust error checking and reporting capabilities
The agent includes execute_bash tool for shell command execution, file cleanup, and directory operations alongside the existing video processing and vision analysis capabilities.
- Kiro CLI installed
- Node.js 16+ and npm
- Either native Ollama or Docker
- 4GB+ RAM, 5GB+ disk space
Option 1: GitHub CLI (Recommended for local development)
# Authenticate with GitHub CLI
gh auth login
# Setup secure token access
./scripts/setup-github-token.sh
# Use secure authentication
./scripts/docker-auth-secure.shOption 2: AWS Parameter Store (Recommended for production)
# Store token securely in AWS Parameter Store
./scripts/store-github-token-aws.sh ghp_your_token_here
# Use secure authentication
./scripts/docker-auth-secure.shFor S3 videos: AWS credentials in environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN if needed)
Note: The setup script automatically installs yt-dlp and OpenAI Whisper dependencies.
# Run the automated setup script
chmod +x setup.pl
./setup.plThe setup script will:
- Verify Kiro CLI installation
- Install yt-dlp for video extraction
- Install OpenAI Whisper for transcription
- Build MCP servers from source
- Setup Ollama with Moondream model
- Configure global MCP settings (
~/.kiro/settings/mcp.json) - Create agent profile (
~/.kiro/agents/screenpal-video-transcriber.json) - Verify all components
From the project directory:
cd /path/to/kiro-cli-custom-agent-screenpal-video-transcription
kiro-cli chat --agent screenpal-video-transcriberNote: The agent is automatically discovered when you're in the project directory. No global linking required.
> Please transcribe this ScreenPal video: https://go.screenpal.com/[video-id]
> Please transcribe this YouTube video: https://youtube.com/watch?v=[video-id]
> Please transcribe this Twitch video: https://twitch.tv/videos/[video-id]
> Please transcribe this S3 video: https://bucket.s3.amazonaws.com/video.mp4
The agent will:
- Validate the URL
- Extract audio using yt-dlp
- Transcribe with Whisper
- Extract key frames
- Analyze visual content with Moondream
- Generate comprehensive transcript with visual descriptions
- Create unified document correlating audio + visual by timestamp
Output files created in ~/Downloads/video-transcripts-{timestamp}/:
{video-id}-UNIFIED.json- Structured synchronized data{video-id}-UNIFIED.md- Human-readable walkthrough{video-id}-frames/- Extracted PNG frames
The agent uses a two-level MCP configuration system:
Defines all available MCP servers for all agents:
{
"mcpServers": {
"video-transcriber": {
"command": "sh",
"args": ["-c", "node /tmp/video-transcriber-mcp/dist/index.js 2>/dev/null"],
"env": {
"WHISPER_MODEL": "base",
"YOUTUBE_FORMAT": "bestaudio",
"WHISPER_DEVICE": "cpu"
},
"disabled": false
},
"vision-server": {
"command": "sh",
"args": ["-c", "node /tmp/moondream-mcp/build/index.js 2>/dev/null"],
"env": {
"OLLAMA_BASE_URL": "http://localhost:11434"
},
"disabled": false
},
"ffmpeg-mcp": {
"command": "uvx",
"args": ["video-creator"],
"env": {
"SCENE_THRESHOLD": "0.4"
},
"disabled": false
}
}
}Specialized configuration for this agent:
{
"name": "screenpal-video-transcriber",
"description": "Specialized agent for processing ScreenPal videos...",
"includeMcpJson": true,
"tools": [
"fs_read", "fs_write", "knowledge", "execute_bash",
"@video-transcriber/transcribe_video",
"@ffmpeg-mcp/extract_frames_from_video",
"@ffmpeg-mcp/get_video_info",
"@vision-server/analyze_image",
"@vision-server/detect_objects",
"@vision-server/generate_caption"
],
"model": "claude-sonnet-4"
}Key Features:
includeMcpJson: true- Inherits all servers from global config- Complete toolchain for video processing and visual analysis
- No server duplication - All MCP servers come from global config
- URL Validation: Automatic platform detection from URL patterns (ScreenPal, YouTube, Twitch, S3)
- Audio Transcription: High-quality speech-to-text with timestamps
- Frame Extraction: Scene-change detection with FFmpeg for key moments
- Detailed Visual Analysis: Complete UI element descriptions including:
- Exact text and button labels
- Window titles and menu items
- Form fields and data displayed
- Visual layout and positioning
- Interactive elements and controls
- Timestamp Correlation: Synchronized audio-visual walkthrough
- Unified Output: Single document combining audio + visual
- Local Storage: Organized output in
~/Downloads/video-transcripts-{timestamp}/ - Privacy Focused: No data leaves your local environment
"dummy" tool error: MCP server communication failure
- Root cause: MCP servers not properly registering tools with Kiro CLI
- Solution: Restart agent session:
kiro-cli chat --agent screenpal-video-transcriber
Tool not found: Missing dependencies or configuration issues
- Solution: Run setup script:
./setup.pl - Check: Verify yt-dlp and Whisper are installed
Ollama not responding: Vision analysis unavailable
- Solution: Start Ollama:
ollama serveor check Docker container - Verify:
curl -s http://localhost:11434/api/tags
Frame Extraction Issues: Problems with scene detection or frame quality
- Cause: Incorrect scene threshold or video format issues
- Solution: Adjust scene_threshold parameter or verify video accessibility
- ARCHITECTURE.md - System design and component overview
- MCP-CONFIGURATION.md - Detailed MCP setup and configuration
- TROUBLESHOOTING.md - Common issues and solutions
- Local Processing: All transcription and analysis happens locally
- No Cloud APIs: No external service dependencies
- Secure URLs: Only processes validated platform domains (ScreenPal, YouTube, Twitch, S3)
- Controlled Access: Agent permissions include video processing and visual analysis tasks