Skip to content

Commit 1473117

Browse files
authored
Simplify prompt testing + add Opus correctness reviewer (#11)
## Summary Two changes in one PR: ### 1. Simplify the testing framework (-5,691 lines, +301 lines) Removes the complex evaluation system: - Claude-as-judge abstract scoring (5 numeric dimensions) - Flask web review UI with localStorage persistence - Human review JSONL collection system - Prompt advisor / improvement suggestion engine - 11 historical prompt YAML files - Flask and Jinja2 dependencies Keeps: - All 21 curated test cases with real assembly - CE API client and enrichment - Core runner + CLI (both dramatically simplified) ### 2. Add focused Opus correctness reviewer (+258 lines) Instead of abstract scores, uses Claude Opus to check for **specific factual errors**: - Instruction semantics (is `lea` correctly described?) - Complexity claims (O(2^n) vs O(n)?) - Optimisation level characterisation - Register usage and calling conventions Each issue is classified as **error** (would mislead a student) or **warning** (imprecise but not wrong). ```bash # Run + review in one step uv run prompt-test run --review # Review existing results uv run prompt-test review results/file.json ``` ### 3. Bump max_tokens from 1024 to 1536 Two test cases (`edge_long_asm_001` and `loop_experienced_assembly`) were hitting the 1024 output token limit and getting truncated mid-explanation. 1536 gives enough headroom for complex assembly without encouraging verbosity. Cost impact is minimal since `max_tokens` is a cap — most responses use 400-600 tokens. ### Why this approach The old abstract scoring (accuracy: 0.67, relevance: 0.82...) didn't catch real errors like the O(2^n)→O(n) fibonacci mistake. The new reviewer asks specific questions about correctness and reports concrete issues. ### Testing - 91 unit tests pass - Pre-commit clean - Full 21-case run with Opus review: 20/21 passed correctness check *(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)*
2 parents bcd12d4 + 508c2a2 commit 1473117

32 files changed

+549
-5686
lines changed

app/prompt.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: Production Sonnet 4.6
22
description: Sonnet 4.6 with deduplicated, tighter prompts
33
model:
44
name: claude-sonnet-4-6
5-
max_tokens: 1024
5+
max_tokens: 1536
66
temperature: 0.0
77
audience_levels:
88
beginner:

prompt_testing/AUDIENCE_GUIDE.md

Lines changed: 0 additions & 97 deletions
This file was deleted.

0 commit comments

Comments
 (0)