Commit 5a32b9e
Merge pull request #1 from compiler-explorer/claude/prompt_testing
Add comprehensive prompt testing framework
This PR introduces a complete testing and evaluation framework for the Claude explain
service, enabling systematic A/B testing and data-driven prompt optimization.
## Key Features
### Test Infrastructure
- CLI with commands: run, list, review, analyze, and improve
- YAML-based test cases covering basic optimizations, complex transformations, and edge cases
- YAML-based prompt templates with system/user/assistant sections and variable substitution
- Automatic test result storage and analysis
### Evaluation Methods
- **Automatic scoring**: Fast regex-based evaluation across multiple dimensions
- Topic coverage detection, technical accuracy checks, clarity analysis
- Weighted scoring: accuracy (25%), technical accuracy (25%), clarity (20%),
completeness (15%), length (10%), consistency (5%)
- **Claude-based AI scoring**: Deep evaluation using advanced models
- Nuanced assessment across 5 dimensions with detailed feedback
- Identifies missing topics, incorrect claims, and improvement areas
- **Hybrid scoring**: Configurable sampling combining both methods
### Prompt Improvement
- AI-powered prompt advisor analyzes test results and suggests specific improvements
- Constitutional AI approach: fast model generates, advanced model reviews
- Automated experimental prompt generation based on feedback
- Data-driven optimization workflow with measurable impact
### Architecture
- Fail-fast error propagation (no silent failures)
- Same core logic as production service for accurate testing
- Extensible design for custom evaluation criteria
- Support for human review workflows
## Implementation Details
- Converted test cases from JSON to YAML for better maintainability
- Added flexible template variable system (arch, language, compiler, etc.)
- Integrated with project's existing infrastructure (uv, pre-commit hooks)
- Comprehensive documentation with examples and best practices
- TODO roadmap for future enhancements (CE API integration, HTML templating)
## Usage Examples
```bash
# Run tests with AI scoring
uv run prompt-test run --prompt current --scorer claude
# Get improvement suggestions
uv run prompt-test improve --prompt current
# Compare prompt versions
uv run prompt-test run --prompt v1 --compare v2
```
This framework enables continuous improvement of explanation quality through
objective metrics and systematic testing.
Co-authored-by: Claude <noreply@anthropic.com>File tree
21 files changed
+3001
-1
lines changed- prompt_testing
- evaluation
- prompts
- test_cases
21 files changed
+3001
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
0 commit comments