Add harness methodology: sprint contracts, calibrated evaluation, stripping principle by aweider · Pull Request #998 · danielmiessler/Personal_AI_Infrastructure

aweider · 2026-03-26T04:10:52Z

Summary

Integrates three patterns from Anthropic's harness design research into the Algorithm (v3.7.0) and PRD format:

Sprint Contracts — Before BUILD, builder and evaluator agree on what "done" means for each ISC criterion. Prevents the self-evaluation bias problem where the agent building is also the sole judge. Advanced+ effort only — zero overhead for Standard/Extended.
Calibrated Evaluation — Four-dimension scoring rubric (Correctness 30%, Completeness 25%, Craft 25%, User Experience 20%) with 1-10 scales and a threshold gate (below 6 on any dimension = targeted redo). Supplements binary pass/fail to catch the gap between "technically works" and "actually good."
Stripping Principle — LEARN phase reflection asking which scaffolding steps were load-bearing vs. added latency. As models improve, assumptions encoded in the Algorithm should be stress-tested and removed when no longer needed.

Changes

File	What Changed
`Algorithm/v3.7.0.md`	Sprint contract step in PLAN, calibrated scoring in VERIFY, stripping reflection in LEARN (+42 lines)
`PRDFORMAT.md`	Two new optional sections: Sprint Contract and Evaluation Scores (+23 lines)

Design Decisions

Gated to Advanced+ — Sprint contracts only activate for substantial builds. Quick tasks stay fast.
Post-test scoring — Calibrated evaluation happens AFTER pass/fail testing, never instead of it.
Self-pruning — The stripping principle means these additions will recommend their own removal if they stop catching real issues.

Context

The Anthropic article's key insight: separating the agent doing the work from the agent judging it is a strong lever against self-evaluation bias. Their GAN-inspired generator-evaluator loop with sprint contracts and calibrated scoring produced dramatically better output than solo agents at the cost of more structured workflow. These additions bring that pattern into PAI's existing Algorithm phases without adding new agents or systems.

Test Plan

Run an Advanced+ Algorithm task — verify sprint contract step appears in PLAN
Run a Standard task — verify zero additional output or latency
Check VERIFY phase produces 4-dimension scoring table for user-facing builds
Check LEARN phase includes stripping test reflection question

🤖 Generated with Claude Code

…ipping principle Integrates three patterns from Anthropic's harness design research for long-running application development into the Algorithm and PRD format: 1. Sprint Contracts (PLAN phase, Advanced+ only): Before BUILD, builder and evaluator agree on what "done" means for each ISC criterion — implementation approach and verification method. Prevents misalignment between building and judging. 2. Calibrated Evaluation (VERIFY phase): Four-dimension scoring rubric (Correctness, Completeness, Craft, User Experience) with 1-10 scales, anchor descriptions, and a threshold gate (below 6 = targeted redo). Supplements binary pass/fail testing to catch the gap between "technically works" and "actually good." 3. Stripping Principle (LEARN phase): Reflection question asking which scaffolding steps were load-bearing vs. added latency. Every component encodes an assumption about what the model can't do alone — as models improve, assumptions should be stress-tested. Zero overhead for Standard/Extended tiers. Sprint contracts gate to Advanced+. Calibrated scoring activates only for Advanced+ with QA agent or user-facing output. Reference: https://www.anthropic.com/engineering/harness-design-long-running-apps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add harness methodology: sprint contracts, calibrated evaluation, stripping principle#998

Add harness methodology: sprint contracts, calibrated evaluation, stripping principle#998
aweider wants to merge 1 commit intodanielmiessler:mainfrom
aweider:harness-methodology

aweider commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aweider commented Mar 26, 2026

Summary

Changes

Design Decisions

Context

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant