GUIDE 01: Pattern Explained - Deep Dive

Read time: 15 minutes Goal: Understand every component and why it exists

The Architecture (Component Breakdown)

schema-optimization/
├── SKILL.md                    # Orchestrator (the conductor)
├── agents/                     # Phase subagents (the workers)
│   ├── phase_1.md              # "What I do, what I return"
│   ├── phase_2.md
│   ├── phase_3.md
│   ├── phase_4.md
│   └── phase_5.md
├── references/                 # Instructions (the playbook)
│   ├── 01-phase-1.md           # "Step-by-step: do this"
│   ├── 02-phase-2.md
│   ├── 03-phase-3.md
│   ├── 04-verify-with-script.md
│   └── 05-phase-5.md
├── scripts/                    # Deterministic tools (the lab equipment)
│   └── analyze_field_utilization.sh
└── reports/                    # Evidence (the paper trail)
    ├── runs/                   # Session directories
    │   └── 2025-01-15_143022/  # One workflow run
    │       ├── 01-*.md         # Phase outputs
    │       ├── 02-*.md
    │       ├── 03-*.md
    │       ├── 04-*.md
    │       └── 05-*.md
    └── _samples/               # Examples of good outputs

1. The Orchestrator (SKILL.md)

Role: Conductor of the workflow Responsibilities:

Create session directory with timestamp
Spawn phases in sequence
Pass context between phases
Validate outputs before continuing
Aggregate final results

Key Sections:

A. Input Schema

inputs:
  skill_dir: "/absolute/path/to/this/skill"
  input_folder: "/path/to/data"
  extraction_type: "bigquery_json"
  session_dir_base: "reports/runs"  # optional

Why: Orchestrator needs to know where things are and what to process.

B. Session Directory Creation

TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
SESSION_DIR="${session_dir_base}/${TIMESTAMP}"
mkdir -p "${SESSION_DIR}"

Why: Isolation. Each run is independent. Can compare runs. Evidence persists.

C. Phase Invocation Pattern

For Phase 1:
- Call agent with inputs: {skill_dir, session_dir, reference_path, input_folder, extraction_type}
- Expect JSON: {status, report_path, schema_summary}
- Validate: JSON valid? status="complete"? report_path exists? schema_summary has keys?
- If valid: store phase1_report_path for Phase 2
- If invalid: STOP and return error

Why: Fail-fast. Don't waste time on downstream phases if upstream fails.

D. Phase Chaining

Phase 2 inputs = Phase 1 inputs + {phase1_report_path}
Phase 3 inputs = Phase 1 inputs + {phase1_report_path, phase2_report_path}
Phase 4 inputs = Phase 1 inputs + {phase2_report_path, phase3_report_path, script_path}
Phase 5 inputs = Phase 1 inputs + {phase1_report_path, phase2_report_path, phase3_report_path, phase4_report_path}

Why: Context accumulation. Later phases build on validated earlier work.

E. Validation Rules

After each phase:
1. Parse returned JSON (fail if invalid)
2. Check status="complete" (fail if error or missing)
3. Verify report_path file exists on disk (fail if not)
4. Validate phase-specific summary keys (fail if missing)

Why: Machine-checkable gates. No human judgment needed.

F. Final Output

{
  "status": "complete",
  "session_dir": "/absolute/path",
  "timestamp": "2025-01-15_143022",
  "phase_reports": {
    "phase1": "...",
    "phase2": "...",
    ...
  },
  "final_summary": {
    "total_tables": 42,
    "total_fields": 367,
    "unused_fields": 23,
    "optimization_opportunities": 21,
    "estimated_savings_pct": 15,
    "verification_status": "confirmed"
  }
}

Why: Machine-readable summary of entire workflow. Can be logged, graphed, alerted on.

2. Phase Agents (agents/*.md)

Role: Workers that execute one step Pattern: All phase agents follow same contract

Standard Phase Agent Structure

# Phase N Agent: [Name]

## Inputs (JSON)
{
  "skill_dir": "...",
  "session_dir": "...",
  "reference_path": "...",
  "phaseX_report_path": "...",  # from prior phases
  ...
}

## Task Instructions
1. Read reference document at reference_path
2. Read prior phase reports (if applicable)
3. Execute analysis/work
4. Write report to: <session_dir>/NN-[name].md
5. Return JSON only (no terminal text)

## Output Format (JSON Only)
{
  "status": "complete",
  "report_path": "/absolute/path",
  "phase_summary": {
    "key1": "value1",
    ...
  }
}

## Validation Requirements
- status must be "complete"
- report_path must exist
- phase_summary must have required keys

## Error Handling
If work fails, return:
{
  "status": "error",
  "error_message": "...",
  "report_path": null,
  "phase_summary": null
}

Why this structure:

Inputs section: Clear contract of what's provided
Task instructions: High-level overview (details in reference doc)
Output format: Exact JSON schema expected
Validation requirements: What orchestrator will check
Error handling: Graceful failure mode

Phase-Specific Variations

Phase 1: Initial Analysis

No prior phase inputs
Establishes baseline metrics
Summary: {total_tables, total_fields, key_findings}

Phase 2-3: Analysis Phases

Read Phase 1 report for context
Build on established baseline
Summary: Domain-specific metrics

Phase 4: Verification (CRITICAL)

Reads Phase 2-3 conclusions
RUNS REAL SCRIPT (analyze_field_utilization.sh)
Compares script output vs manual analysis
Summary: {files_analyzed, conclusions_confirmed, conclusions_revised, unexpected_findings, revised_action_items}

Phase 5: Synthesis

Reads ALL prior phase reports
Aggregates validated findings
Summary: {priority_actions, implementation_plan, success_metrics}

3. Reference Docs (references/*.md)

Role: The playbook - step-by-step instructions Why separate from agent contract: Agent knows WHAT to return, reference knows HOW to do work

Anatomy of a Reference Doc

# Phase N: [Name]

## Context
[What this phase accomplishes in the bigger picture]

## Inputs Available to You
- skill_dir: ...
- session_dir: ...
- input_folder: ...
- phaseX_report_path: ...

## Step-by-Step Procedure

### Step 1: [Action]
[Detailed instructions]
[Example commands if applicable]

### Step 2: [Action]
[Detailed instructions]

...

## Report Format

Write to: {session_dir}/NN-[name].md

Required sections:
- ## Summary
- ## Methodology
- ## Findings
- ## Recommendations

## Output Requirements

Return JSON:
{
  "status": "complete",
  "report_path": "<absolute path>",
  "phase_summary": {
    "key1": "...",
    "key2": "..."
  }
}

## Quality Checklist
- [ ] All data sources documented
- [ ] Metrics include methodology
- [ ] Findings are evidence-based
- [ ] Recommendations are actionable

Key difference from agent contract:

Agent contract: "Return JSON with these keys"
Reference doc: "Calculate metrics using this methodology"

Example: Phase 4 Reference (The Critical One)

# Phase 4: Verify Field Utilization with Script

## Context
Phases 2-3 made conclusions about unused fields through manual analysis.
This phase runs a deterministic script to empirically validate those conclusions.

## Key Conclusions from Phase 2/3 to Verify
Read phase2_report_path and phase3_report_path.
Extract:
- Fields marked as "unused" (>90% null)
- Fields marked as "low utilization" (70-90% null)
- Estimated storage savings

## Step-by-Step Procedure

### Step 1: Extract Conclusions
From Phase 2/3 reports, create a list:
- table.field: conclusion (e.g., "users.legacy_id: unused, 100% null")

### Step 2: Execute Verification Script
Run: scripts/analyze_field_utilization.sh <input_folder> <output_folder>

This script:
- Scans all JSON/CSV files in input_folder
- Calculates ACTUAL null percentages
- Produces: <output_folder>/field_utilization_report.json

### Step 3: Parse Script Output
Read: <output_folder>/field_utilization_report.json
Extract:
- files_analyzed
- field_usage_breakdown

### Step 4: Compare Script vs Manual
For each conclusion from Phase 2/3:
- If script confirms (within 5% tolerance): add to "conclusions_confirmed"
- If script differs (>5% difference): add to "conclusions_revised"
- If script finds new issues: add to "unexpected_findings"

### Step 5: Update Action Items
Based on comparison:
- Keep safe actions (script confirmed)
- Revise risky actions (script showed different data)
- Add new actions (script found unexpected issues)

### Step 6: Write Report
Save to: {session_dir}/04-field-utilization-verification.md

Required sections:
- ## Original Conclusions (from Phase 2/3)
- ## Script Execution Results
- ## Comparison Analysis (table)
- ## Revised Action Items

### Step 7: Return JSON
{
  "status": "complete",
  "report_path": "<absolute>",
  "verification_summary": {
    "files_analyzed": N,
    "conclusions_confirmed": ["table.field: ..."],
    "conclusions_revised": ["table.field: was X%, actually Y%"],
    "unexpected_findings": ["table.field: script found issue"],
    "revised_action_items": ["Safe: Remove field X", "Risky: Monitor field Y"]
  }
}

Why this is powerful:

Forces empirical validation (script doesn't lie)
Documents discrepancies (revised conclusions)
Updates recommendations based on evidence
Produces audit trail (can see what changed and why)

4. Scripts (scripts/*.sh)

Role: Deterministic computation (the lab equipment) Why: Remove LLM variability for objective calculations

Characteristics of Good Verification Scripts

1. Deterministic

Same inputs = same outputs (always)
No randomness, no API calls, no LLM usage

2. Fast

Runs in seconds, not minutes
Optimized for local execution

3. Self-Contained

No external dependencies (or minimal)
Validates its own inputs

4. Structured Output

Produces JSON (machine-readable)
Clear schema

5. Error Handling

Validates inputs before processing
Returns meaningful error messages

Example: analyze_field_utilization.sh

#!/bin/bash
# Deterministic field utilization analyzer
# No LLM usage - pure computation

set -euo pipefail  # Fail on errors

# ============================================
# Input Validation
# ============================================
if [ $# -ne 2 ]; then
  echo "Usage: $0 <input_folder> <output_folder>"
  exit 1
fi

INPUT_FOLDER="$1"
OUTPUT_FOLDER="$2"

if [ ! -d "$INPUT_FOLDER" ]; then
  echo "Error: Input folder does not exist: $INPUT_FOLDER"
  exit 1
fi

mkdir -p "$OUTPUT_FOLDER"

# ============================================
# Analysis
# ============================================
FILES_ANALYZED=0
UNUSED_FIELDS=()
LOW_UTIL_FIELDS=()
HIGH_UTIL_FIELDS=()

for file in "$INPUT_FOLDER"/*.json "$INPUT_FOLDER"/*.csv; do
  if [ -f "$file" ]; then
    FILES_ANALYZED=$((FILES_ANALYZED + 1))

    # Parse file and calculate null percentages
    # (Implementation: jq for JSON, awk for CSV)
    # For each field:
    #   - Count total rows
    #   - Count null rows
    #   - Calculate null_pct = (null_rows / total_rows) * 100
    #   - Categorize: >90% = unused, 70-90% = low, <70% = high

    # Placeholder (replace with real parsing):
    # field_null_pct=$(...)
    # if [ $field_null_pct -gt 90 ]; then
    #   UNUSED_FIELDS+=("$file:$field:$field_null_pct")
    # elif [ $field_null_pct -gt 70 ]; then
    #   LOW_UTIL_FIELDS+=("$file:$field:$field_null_pct")
    # else
    #   HIGH_UTIL_FIELDS+=("$file:$field:$field_null_pct")
    # fi
  fi
done

# ============================================
# Output JSON
# ============================================
cat > "$OUTPUT_FOLDER/field_utilization_report.json" <<EOF
{
  "files_analyzed": $FILES_ANALYZED,
  "timestamp": "$(date -Iseconds)",
  "field_usage_breakdown": {
    "unused_fields": $(printf '%s\n' "${UNUSED_FIELDS[@]}" | jq -R . | jq -s .),
    "low_utilization_fields": $(printf '%s\n' "${LOW_UTIL_FIELDS[@]}" | jq -R . | jq -s .),
    "high_utilization_fields": $(printf '%s\n' "${HIGH_UTIL_FIELDS[@]}" | jq -R . | jq -s .)
  },
  "summary": {
    "total_fields": $((${#UNUSED_FIELDS[@]} + ${#LOW_UTIL_FIELDS[@]} + ${#HIGH_UTIL_FIELDS[@]})),
    "unused_count": ${#UNUSED_FIELDS[@]},
    "low_util_count": ${#LOW_UTIL_FIELDS[@]},
    "high_util_count": ${#HIGH_UTIL_FIELDS[@]}
  }
}
EOF

echo "Analysis complete: $OUTPUT_FOLDER/field_utilization_report.json"
exit 0

Why this works:

Validates inputs before processing
Pure computation (no LLM guessing)
Structured JSON output
Can be tested independently
Fast and deterministic

5. Session Directories (reports/runs/)

Role: Evidence repository Pattern: One directory per workflow execution

Structure

reports/runs/
├── 2025-01-15_143022/          # Timestamp = session ID
│   ├── 01-initial-schema-analysis.md
│   ├── 02-field-utilization-analysis.md
│   ├── 03-impact-assessment.md
│   ├── 04-field-utilization-verification.md
│   └── 05-final-recommendations.md
├── 2025-01-15_151435/          # Another run (different data or config)
│   ├── 01-*.md
│   └── ...
└── 2025-01-16_092314/          # Next day's run
    ├── 01-*.md
    └── ...

Benefits:

Audit Trail: Can review exactly what was analyzed
Debugging: If Phase 4 failed, read Phase 1-3 reports to understand context
Comparison: Compare runs across time
Evidence: Proves work was actually done
Rollback: If new run is bad, refer to previous run

Report Format Standards

Each report should include:

# Phase N: [Name]

**Session:** 2025-01-15_143022
**Generated:** 2025-01-15 14:30:45 CST
**Input:** /path/to/data

---

## Executive Summary
[3-5 bullet points of key findings]

---

## Methodology
[How the analysis was performed]

---

## Findings

### [Category 1]
[Detailed findings with evidence]

### [Category 2]
[Detailed findings with evidence]

---

## Data

| Metric | Value |
|--------|-------|
| Total Tables | 42 |
| Total Fields | 367 |
| ... | ... |

---

## Recommendations
- [Actionable recommendation 1]
- [Actionable recommendation 2]

---

## Next Steps
[What Phase N+1 should focus on based on these findings]

---

*Generated by Phase N Agent*
*Report Path: 0N-[name].md*

Why this structure:

Metadata: Know when/where this came from
Executive Summary: TL;DR for stakeholders
Methodology: Reproducibility and trust
Findings: Evidence-based details
Data: Structured metrics (can be graphed)
Recommendations: Actionable outputs
Next Steps: Guide for next phase

6. Sample Outputs (reports/_samples/)

Role: Reference examples Why: Show what good outputs look like

What to Include

Complete session directory with all 5 phase reports
Realistic data (not "lorem ipsum")
Actual verification (Phase 4 shows script output)
Proper formatting (demonstrates report standards)

Learning Value

New users can:

See what finished workflow produces
Understand report quality standards
Learn Phase 4 verification pattern
Copy structure for their own reports

The Critical Phase 4 Pattern (Deep Dive)

Why Phase 4 is the "money shot":

Most LLM workflows are all narrative:

Agent analyzes → writes text
Agent concludes → writes text
Agent recommends → writes text

Problem: No empirical validation. Just LLM's opinion.

Phase 4 solution:

Phases 2-3 make conclusions (LLM analysis)
Phase 4 runs SCRIPT (deterministic computation)
Phase 4 compares: LLM conclusions vs script facts
Phase 4 revises: "Was 95% null, actually 87% null"
Phase 4 updates: "Can't safely remove, need monitoring"

This is the pattern that turns "chatbot output" into "validated engineering work"

How to Adapt for Your Use Case

Code review workflow:

Phases 1-3: LLM analyzes code, suggests improvements
Phase 4: Run linter/formatter/tests, compare LLM suggestions vs actual tool output
Phase 5: Recommendations based on validated issues only

Security audit workflow:

Phases 1-3: LLM analyzes for vulnerabilities
Phase 4: Run SAST/DAST tools, compare LLM findings vs scanner results
Phase 5: Prioritize confirmed vulnerabilities only

Research synthesis workflow:

Phases 1-3: LLM summarizes papers, identifies themes
Phase 4: Run citation analysis script, verify LLM's claimed connections
Phase 5: Synthesis based on validated citation network

Pattern: Always include one phase that runs deterministic tools to validate LLM conclusions.

Design Principles Summary

Evidence-Based
- Every phase writes a file
- Can't claim work without artifact
Machine-Checkable
- JSON outputs can be validated programmatically
- No human judgment needed for gates
Fail-Fast
- Orchestrator stops on first validation failure
- Don't waste time on downstream work
Composable
- Phases are independent
- Can reuse in other workflows
Debuggable
- Clear failure points
- Structured outputs
- Session directories preserve state
Repeatable
- Deterministic scripts
- Same inputs = same outputs
Auditable
- Timestamp-based sessions
- Complete paper trail
- Can review past runs

Next Steps

To see it in action: Read exercises/exercise-1-run-workflow.md and execute the reference implementation.

To build your own: Read GUIDE-02-BUILDING-YOUR-OWN.md for step-by-step adaptation guide.

To debug issues: Read GUIDE-03-DEBUGGING-TIPS.md for common pitfalls and solutions.

Next: GUIDE-02-BUILDING-YOUR-OWN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GUIDE 01: Pattern Explained - Deep Dive

The Architecture (Component Breakdown)

1. The Orchestrator (SKILL.md)

A. Input Schema

B. Session Directory Creation

C. Phase Invocation Pattern

D. Phase Chaining

E. Validation Rules

F. Final Output

2. Phase Agents (agents/*.md)

Standard Phase Agent Structure

Phase-Specific Variations

3. Reference Docs (references/*.md)

Anatomy of a Reference Doc

Example: Phase 4 Reference (The Critical One)

4. Scripts (scripts/*.sh)

Characteristics of Good Verification Scripts

Example: analyze_field_utilization.sh

5. Session Directories (reports/runs/)

Structure

Report Format Standards

6. Sample Outputs (reports/_samples/)

What to Include

Learning Value

The Critical Phase 4 Pattern (Deep Dive)

How to Adapt for Your Use Case

Design Principles Summary

Next Steps

FilesExpand file tree

GUIDE-01-PATTERN-EXPLAINED.md

Latest commit

History

GUIDE-01-PATTERN-EXPLAINED.md

File metadata and controls

GUIDE 01: Pattern Explained - Deep Dive

The Architecture (Component Breakdown)

1. The Orchestrator (SKILL.md)

A. Input Schema

B. Session Directory Creation

C. Phase Invocation Pattern

D. Phase Chaining

E. Validation Rules

F. Final Output

2. Phase Agents (agents/*.md)

Standard Phase Agent Structure

Phase-Specific Variations

3. Reference Docs (references/*.md)

Anatomy of a Reference Doc

Example: Phase 4 Reference (The Critical One)

4. Scripts (scripts/*.sh)

Characteristics of Good Verification Scripts

Example: analyze_field_utilization.sh

5. Session Directories (reports/runs/)

Structure

Report Format Standards

6. Sample Outputs (reports/_samples/)

What to Include

Learning Value

The Critical Phase 4 Pattern (Deep Dive)

How to Adapt for Your Use Case

Design Principles Summary

Next Steps