Read time: 5 minutes Goal: Understand the mental model before diving into code
You ask an LLM to do multi-step work:
- "Analyze this codebase"
- "Find performance bottlenecks"
- "Suggest optimizations"
- "Verify your suggestions"
- "Write a report"
What usually happens:
- Agent writes a wall of text
- You can't verify if it actually did the work
- No structured output (just narrative)
- If something's wrong, you can't tell where it failed
- Hard to reuse or automate
What you wish happened:
- Agent produces evidence of each step
- You can verify work was done
- Outputs are machine-readable
- Failures are clear and debuggable
- System is repeatable and composable
Think of this like a CI/CD pipeline for LLM work:
┌─────────────────────────────────────────────────────┐
│ ORCHESTRATOR (Main Skill) │
│ - Creates session directory (isolated run folder) │
│ - Spawns phases in order │
│ - Validates outputs before continuing │
│ - Aggregates final results │
└─────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ PHASE 1: Initial Analysis │
│ - Reads instructions │
│ - Does work │
│ - Writes report file │
│ - Returns JSON │
└───────────────────────────────────────┘
│
▼ (orchestrator validates)
┌───────────────────────────────────────┐
│ PHASE 2: Deep Analysis │
│ - Reads Phase 1 report │
│ - Does more work │
│ - Writes report file │
│ - Returns JSON │
└───────────────────────────────────────┘
│
▼ (orchestrator validates)
┌───────────────────────────────────────┐
│ PHASE 3: Risk Assessment │
│ - Reads Phase 1-2 reports │
│ - Calculates risks │
│ - Writes report file │
│ - Returns JSON │
└───────────────────────────────────────┘
│
▼ (orchestrator validates)
┌───────────────────────────────────────┐
│ PHASE 4: VERIFICATION (KEY PHASE) │
│ - Reads Phase 2-3 conclusions │
│ - RUNS REAL SCRIPT │
│ - Compares script vs conclusions │
│ - Writes verification report │
│ - Returns JSON │
└───────────────────────────────────────┘
│
▼ (orchestrator validates)
┌───────────────────────────────────────┐
│ PHASE 5: Final Recommendations │
│ - Synthesizes all prior work │
│ - Prioritizes actions │
│ - Writes final report │
│ - Returns JSON │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ ORCHESTRATOR OUTPUT │
│ { │
│ "status": "complete", │
│ "session_dir": "...", │
│ "phase_reports": {...}, │
│ "final_summary": {...} │
│ } │
└───────────────────────────────────────┘
Every workflow run gets its own isolated folder:
reports/runs/2025-01-15_143022/
├── 01-initial-analysis.md
├── 02-deep-analysis.md
├── 03-risk-assessment.md
├── 04-verification.md
└── 05-recommendations.md
Why: Evidence. You can inspect what was actually done.
Every phase MUST return this format:
{
"status": "complete",
"report_path": "/absolute/path/to/report.md",
"phase_summary": {
"key1": "value1",
"key2": "value2"
}
}Why: Machine-readable. The orchestrator can validate programmatically.
After each phase, orchestrator checks:
- ✅ JSON is valid (not malformed)
- ✅
statusis "complete" - ✅
report_pathfile exists on disk - ✅ Required summary keys are present
Why: Fail-fast. If Phase 2 fails, don't waste time on Phase 3-5.
This is the "money shot" that makes it feel real:
- Reads conclusions from Phases 2-3
- Runs an ACTUAL SCRIPT (not LLM analysis)
- Compares script output vs manual conclusions
- Reports: confirmed, revised, unexpected findings
Why: Empirical validation. Script doesn't lie.
Each phase has a corresponding instruction file:
references/
├── 01-phase-1.md # "Do exactly this for Phase 1"
├── 02-phase-2.md # "Do exactly this for Phase 2"
├── 03-phase-3.md
├── 04-verify-with-script.md # "Run this script, compare results"
└── 05-phase-5.md
Why: Deterministic behavior. Same instructions = same outputs.
Think of it as a scientific experiment protocol:
- Hypothesis Phase (1-3): Analyze, form conclusions
- Verification Phase (4): Run experiment to test hypothesis
- Conclusion Phase (5): Synthesize validated findings
Or as a code review pipeline:
- Analysis: What's the code doing?
- Linting: Run automated checks
- Testing: Run test suite
- Verification: Compare manual vs automated findings
- Recommendation: What should we change?
Or as a forensic investigation:
- Scene Analysis: What happened?
- Evidence Collection: Gather data
- Risk Assessment: What's the impact?
- Lab Testing: Run forensic tests
- Report: Official findings
Traditional LLM workflow:
User: "Analyze this and give recommendations"
Agent: [writes 5000 words of text]
User: "Uh... is this correct?"
Agent: "Yes! Trust me!"
User: "How do I verify?"
Agent: "...you could manually check everything I said?"
Test harness workflow:
User: "Run schema optimization workflow"
Orchestrator: "Creating session directory..."
Orchestrator: "Phase 1 complete. Report: ./01-analysis.md"
Orchestrator: "Phase 2 complete. Report: ./02-utilization.md"
Orchestrator: "Phase 3 complete. Report: ./03-impact.md"
Orchestrator: "Phase 4 running verification script..."
Orchestrator: "Script confirmed 21/23 conclusions. Revised 2."
Orchestrator: "Phase 5 complete. Final JSON: {...}"
User: "I can inspect all 5 reports. Phase 4 ran real script. Trust verified."
By the end of this lab, you'll be able to:
- Understand the test harness pattern and why it's powerful
- Navigate the reference implementation (schema-optimization)
- Modify phases and reference docs for your needs
- Build your own multi-phase workflow from scratch
- Debug when phases fail or return invalid outputs
- Deploy production-ready workflows using this pattern
| Term | Meaning |
|---|---|
| Orchestrator | Main skill that creates session dir, spawns phases, validates outputs |
| Phase Agent | Subagent that executes one step of workflow |
| Session Directory | Isolated run folder with timestamp (e.g., runs/2025-01-15_143022/) |
| Reference Doc | Step-by-step instructions for a phase (e.g., references/01-phase-1.md) |
| JSON Contract | Required output format: { status, report_path, summary } |
| Validation Gate | Orchestrator checks after each phase: JSON valid? File exists? Keys present? |
| Verification Phase | Phase that runs real script to empirically validate prior conclusions |
Option 1: Dive into code
Open schema-optimization/SKILL.md and start reading the orchestrator logic.
Option 2: Learn the pattern deeply
Read GUIDE-01-PATTERN-EXPLAINED.md for architectural breakdown.
Option 3: Hands-on practice
Jump to exercises/exercise-1-run-workflow.md and execute the reference implementation.
Recommended for most people: Read GUIDE-01 next, then do Exercise 1.
Key Insight: This pattern turns "LLM wrote some text" into "LLM executed a validated procedure with evidence and structured outputs."
That's the difference between a chatbot and a production system.
Next: GUIDE-01-PATTERN-EXPLAINED.md