Skip to content

Latest commit

 

History

History
384 lines (284 loc) · 8.8 KB

File metadata and controls

384 lines (284 loc) · 8.8 KB

Exercise 1: Run the Reference Workflow

Goal: Execute the schema-optimization workflow end-to-end and understand how all components work together.

Time: 20 minutes

Prerequisites:

  • Read GUIDE-00-START-HERE.md
  • Basic familiarity with bash and JSON

What You'll Learn

  • How to invoke the orchestrator
  • How phases chain together
  • How validation gates work
  • How verification scripts run
  • How to inspect session outputs

Step 1: Prepare Test Data

Create a mock BigQuery schema export:

cd workspace/lab/schema-optimization/

# Create test input directory
mkdir -p test-data/

# Create sample schema file
cat > test-data/users.json <<'EOF'
{
  "table": "users",
  "schema": [
    {"name": "id", "type": "INTEGER", "mode": "REQUIRED"},
    {"name": "email", "type": "STRING", "mode": "REQUIRED"},
    {"name": "created_at", "type": "TIMESTAMP", "mode": "REQUIRED"},
    {"name": "legacy_id", "type": "STRING", "mode": "NULLABLE"},
    {"name": "old_email", "type": "STRING", "mode": "NULLABLE"},
    {"name": "deprecated_field", "type": "STRING", "mode": "NULLABLE"},
    {"name": "notes", "type": "STRING", "mode": "NULLABLE"}
  ]
}
EOF

cat > test-data/orders.json <<'EOF'
{
  "table": "orders",
  "schema": [
    {"name": "id", "type": "INTEGER", "mode": "REQUIRED"},
    {"name": "user_id", "type": "INTEGER", "mode": "REQUIRED"},
    {"name": "total", "type": "FLOAT", "mode": "REQUIRED"},
    {"name": "created_at", "type": "TIMESTAMP", "mode": "REQUIRED"},
    {"name": "temp_flag", "type": "BOOLEAN", "mode": "NULLABLE"},
    {"name": "internal_notes", "type": "STRING", "mode": "NULLABLE"}
  ]
}
EOF

echo "✅ Test data created in test-data/"

Step 2: Test Verification Script Standalone

Before running the full workflow, verify the script works:

# Run script
./scripts/analyze_field_utilization.sh test-data/ test-output/

# Check output
cat test-output/field_utilization_report.json | jq .

# Verify structure
jq '.results.field_usage_breakdown.unused_fields' test-output/field_utilization_report.json

Expected:

  • Script completes without errors
  • JSON output is valid
  • Lists unused fields (legacy_id, old_email, deprecated_field, temp_flag)

Step 3: Invoke Orchestrator (Manual Simulation)

Since we're in a learning lab, we'll simulate the orchestrator by running phases manually:

Phase 1: Initial Schema Analysis

Create input file for Phase 1:

cat > phase1-input.json <<EOF
{
  "skill_dir": "$(pwd)",
  "session_dir": "$(pwd)/reports/runs/manual-test-$(date +%Y-%m-%d_%H%M%S)",
  "reference_path": "$(pwd)/references/01-phase-1.md",
  "input_folder": "$(pwd)/test-data",
  "extraction_type": "bigquery_json"
}
EOF

# Create session directory
SESSION_DIR=$(jq -r '.session_dir' phase1-input.json)
mkdir -p "$SESSION_DIR"

echo "Session directory: $SESSION_DIR"

Now you would invoke Phase 1 agent with this input.

For this exercise, let's manually create what Phase 1 would produce:

cat > "$SESSION_DIR/01-initial-schema-analysis.md" <<\EOF
# Phase 1: Initial Schema Analysis

**Session:** manual-test-2025-01-15_143022
**Generated:** 2025-01-15 14:30:22 CST
**Input:** test-data/
**Extraction Type:** bigquery_json

---

## Executive Summary

- Analyzed 2 tables with 13 total fields
- Identified standard naming patterns (id, created_at)
- Found 4 fields with legacy/deprecated naming
- 61% of fields are nullable

---

## Schema Overview

### Tables
- Total tables: 2
- Table size range: 6-7 fields per table
- Average fields per table: 6.5

### Fields
- Total fields across all tables: 13
- NULLABLE fields: 8 (61%)
- REQUIRED fields: 5 (39%)

### Data Types
| Type      | Count | Percentage |
| --------- | ----- | ---------- |
| STRING    | 7     | 54%        |
| INTEGER   | 3     | 23%        |
| FLOAT     | 1     | 8%         |
| TIMESTAMP | 2     | 15%        |

---

## Key Findings (Machine-Readable)
```json
{
  "total_tables": 2,
  "total_fields": 13,
  "nullable_pct": 61.5,
  "issues_count": 4
}
```
---

*Generated by Phase 1 Agent*
EOF

Phase 1 would return this JSON:

cat > "$SESSION_DIR/phase1-output.json" <<EOF
{
  "status": "complete",
  "report_path": "$SESSION_DIR/01-initial-schema-analysis.md",
  "schema_summary": {
    "total_tables": 2,
    "total_fields": 13,
    "key_findings": [
      "2 tables analyzed",
      "13 total fields",
      "4 legacy/deprecated fields detected"
    ]
  }
}
EOF

echo "✅ Phase 1 simulated"

Validate Phase 1 Output

# Check JSON is valid
jq . "$SESSION_DIR/phase1-output.json"

# Check report exists
ls -lh "$SESSION_DIR/01-initial-schema-analysis.md"

# Verify required keys
jq '.status, .report_path, .schema_summary' "$SESSION_DIR/phase1-output.json"

Expected: All checks pass ✅

Phase 4: Verification (The Critical Phase)

Now let's simulate Phase 4 which runs the actual script:

# Phase 4 input includes script path
cat > phase4-input.json <<EOF
{
  "skill_dir": "$(pwd)",
  "session_dir": "$SESSION_DIR",
  "reference_path": "$(pwd)/references/04-verify-with-script.md",
  "phase2_report_path": "$SESSION_DIR/02-field-utilization-analysis.md",
  "phase3_report_path": "$SESSION_DIR/03-impact-assessment.md",
  "input_folder": "$(pwd)/test-data",
  "script_path": "$(pwd)/scripts/analyze_field_utilization.sh",
  "output_folder_path": "$SESSION_DIR/script-output"
}
EOF

# Run the verification script
SCRIPT_PATH=$(jq -r '.script_path' phase4-input.json)
INPUT_FOLDER=$(jq -r '.input_folder' phase4-input.json)
OUTPUT_FOLDER=$(jq -r '.output_folder_path' phase4-input.json)

mkdir -p "$OUTPUT_FOLDER"

echo "Running verification script..."
$SCRIPT_PATH "$INPUT_FOLDER" "$OUTPUT_FOLDER"

echo "✅ Script completed"

# Inspect script output
cat "$OUTPUT_FOLDER/field_utilization_report.json" | jq .

Expected:

  • Script runs successfully
  • Finds unused fields: legacy_id, old_email, deprecated_field, temp_flag
  • JSON output is valid

Step 4: Inspect Session Directory

Look at what was created:

tree "$SESSION_DIR"

# Should show:
# reports/runs/manual-test-2025-01-15_143022/
# ├── 01-initial-schema-analysis.md
# ├── phase1-output.json
# └── script-output/
#     └── field_utilization_report.json

Read the reports:

# Phase 1 report
cat "$SESSION_DIR/01-initial-schema-analysis.md"

# Script results
jq '.results.field_usage_breakdown' "$SESSION_DIR/script-output/field_utilization_report.json"

Step 5: Understand the Workflow Pattern

Review what just happened:

  1. Orchestrator creates session directoryreports/runs/<timestamp>/
  2. Phase 1 runs → Writes 01-initial-schema-analysis.md, returns JSON
  3. Orchestrator validates Phase 1 → Checks JSON, verifies file exists
  4. Phase 4 runs → Executes analyze_field_utilization.sh, writes verification report
  5. Orchestrator validates Phase 4 → Checks JSON, verifies script output
  6. All artifacts preserved → Can inspect session directory later

Key insight: Every phase produces evidence (files) + structured output (JSON).


Step 6: Experiment

Try breaking things to see validation in action:

Experiment A: Missing Report File

# Simulate Phase 1 returning JSON but not writing file
rm "$SESSION_DIR/01-initial-schema-analysis.md"

# Try to validate
if [ -f "$(jq -r '.report_path' "$SESSION_DIR/phase1-output.json")" ]; then
  echo "✅ Validation passed"
else
  echo "❌ Validation failed: Report file missing"
fi

Experiment B: Invalid JSON

# Simulate Phase 1 returning malformed JSON
echo "This is not JSON" > "$SESSION_DIR/phase1-bad-output.txt"

# Try to parse
if jq . "$SESSION_DIR/phase1-bad-output.txt" 2>/dev/null; then
  echo "✅ Valid JSON"
else
  echo "❌ Invalid JSON"
fi

Experiment C: Missing Required Keys

# Simulate Phase 1 missing schema_summary
cat > "$SESSION_DIR/phase1-incomplete.json" <<EOF
{
  "status": "complete",
  "report_path": "$SESSION_DIR/01-initial-schema-analysis.md"
}
EOF

# Try to extract required key
if jq -e '.schema_summary' "$SESSION_DIR/phase1-incomplete.json" >/dev/null 2>&1; then
  echo "✅ Has schema_summary"
else
  echo "❌ Missing schema_summary"
fi

Step 7: Cleanup

# Remove test artifacts
rm -rf test-data/ test-output/ phase*-input.json

# Keep session directory for inspection
echo "Session preserved at: $SESSION_DIR"

What You Learned

✅ How to prepare test data
✅ How to run verification script standalone
✅ How phases produce reports + JSON
✅ How orchestrator validates outputs
✅ How session directories preserve evidence
✅ How to debug validation failures


Next Steps

Exercise 2: Modify the verification script to add new checks

Exercise 3: Add a Phase 6 to the workflow

Exercise 4: Build your own 3-phase workflow from scratch


Completed Exercise 1: Run the Reference Workflow