Skip to content

Add harness methodology: sprint contracts, calibrated evaluation, stripping principle#998

Open
aweider wants to merge 1 commit intodanielmiessler:mainfrom
aweider:harness-methodology
Open

Add harness methodology: sprint contracts, calibrated evaluation, stripping principle#998
aweider wants to merge 1 commit intodanielmiessler:mainfrom
aweider:harness-methodology

Conversation

@aweider
Copy link
Copy Markdown

@aweider aweider commented Mar 26, 2026

Summary

Integrates three patterns from Anthropic's harness design research into the Algorithm (v3.7.0) and PRD format:

  • Sprint Contracts — Before BUILD, builder and evaluator agree on what "done" means for each ISC criterion. Prevents the self-evaluation bias problem where the agent building is also the sole judge. Advanced+ effort only — zero overhead for Standard/Extended.
  • Calibrated Evaluation — Four-dimension scoring rubric (Correctness 30%, Completeness 25%, Craft 25%, User Experience 20%) with 1-10 scales and a threshold gate (below 6 on any dimension = targeted redo). Supplements binary pass/fail to catch the gap between "technically works" and "actually good."
  • Stripping Principle — LEARN phase reflection asking which scaffolding steps were load-bearing vs. added latency. As models improve, assumptions encoded in the Algorithm should be stress-tested and removed when no longer needed.

Changes

File What Changed
Algorithm/v3.7.0.md Sprint contract step in PLAN, calibrated scoring in VERIFY, stripping reflection in LEARN (+42 lines)
PRDFORMAT.md Two new optional sections: Sprint Contract and Evaluation Scores (+23 lines)

Design Decisions

  • Gated to Advanced+ — Sprint contracts only activate for substantial builds. Quick tasks stay fast.
  • Post-test scoring — Calibrated evaluation happens AFTER pass/fail testing, never instead of it.
  • Self-pruning — The stripping principle means these additions will recommend their own removal if they stop catching real issues.

Context

The Anthropic article's key insight: separating the agent doing the work from the agent judging it is a strong lever against self-evaluation bias. Their GAN-inspired generator-evaluator loop with sprint contracts and calibrated scoring produced dramatically better output than solo agents at the cost of more structured workflow. These additions bring that pattern into PAI's existing Algorithm phases without adding new agents or systems.

Test Plan

  • Run an Advanced+ Algorithm task — verify sprint contract step appears in PLAN
  • Run a Standard task — verify zero additional output or latency
  • Check VERIFY phase produces 4-dimension scoring table for user-facing builds
  • Check LEARN phase includes stripping test reflection question

🤖 Generated with Claude Code

…ipping principle

Integrates three patterns from Anthropic's harness design research for long-running
application development into the Algorithm and PRD format:

1. Sprint Contracts (PLAN phase, Advanced+ only): Before BUILD, builder and evaluator
   agree on what "done" means for each ISC criterion — implementation approach and
   verification method. Prevents misalignment between building and judging.

2. Calibrated Evaluation (VERIFY phase): Four-dimension scoring rubric (Correctness,
   Completeness, Craft, User Experience) with 1-10 scales, anchor descriptions, and
   a threshold gate (below 6 = targeted redo). Supplements binary pass/fail testing
   to catch the gap between "technically works" and "actually good."

3. Stripping Principle (LEARN phase): Reflection question asking which scaffolding
   steps were load-bearing vs. added latency. Every component encodes an assumption
   about what the model can't do alone — as models improve, assumptions should be
   stress-tested.

Zero overhead for Standard/Extended tiers. Sprint contracts gate to Advanced+.
Calibrated scoring activates only for Advanced+ with QA agent or user-facing output.

Reference: https://www.anthropic.com/engineering/harness-design-long-running-apps

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant