Commit 1473117
authored
Simplify prompt testing + add Opus correctness reviewer (#11)
## Summary
Two changes in one PR:
### 1. Simplify the testing framework (-5,691 lines, +301 lines)
Removes the complex evaluation system:
- Claude-as-judge abstract scoring (5 numeric dimensions)
- Flask web review UI with localStorage persistence
- Human review JSONL collection system
- Prompt advisor / improvement suggestion engine
- 11 historical prompt YAML files
- Flask and Jinja2 dependencies
Keeps:
- All 21 curated test cases with real assembly
- CE API client and enrichment
- Core runner + CLI (both dramatically simplified)
### 2. Add focused Opus correctness reviewer (+258 lines)
Instead of abstract scores, uses Claude Opus to check for **specific
factual errors**:
- Instruction semantics (is `lea` correctly described?)
- Complexity claims (O(2^n) vs O(n)?)
- Optimisation level characterisation
- Register usage and calling conventions
Each issue is classified as **error** (would mislead a student) or
**warning** (imprecise but not wrong).
```bash
# Run + review in one step
uv run prompt-test run --review
# Review existing results
uv run prompt-test review results/file.json
```
### 3. Bump max_tokens from 1024 to 1536
Two test cases (`edge_long_asm_001` and `loop_experienced_assembly`)
were hitting the 1024 output token limit and getting truncated
mid-explanation. 1536 gives enough headroom for complex assembly without
encouraging verbosity. Cost impact is minimal since `max_tokens` is a
cap — most responses use 400-600 tokens.
### Why this approach
The old abstract scoring (accuracy: 0.67, relevance: 0.82...) didn't
catch real errors like the O(2^n)→O(n) fibonacci mistake. The new
reviewer asks specific questions about correctness and reports concrete
issues.
### Testing
- 91 unit tests pass
- Pre-commit clean
- Full 21-case run with Opus review: 20/21 passed correctness check
*(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)*File tree
32 files changed
+549
-5686
lines changed- app
- prompt_testing
- evaluation
- prompts
- templates
32 files changed
+549
-5686
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
This file was deleted.
0 commit comments