Skip to content

Commit 5a32b9e

Browse files
mattgodboltclaude
andauthored
Merge pull request #1 from compiler-explorer/claude/prompt_testing
Add comprehensive prompt testing framework This PR introduces a complete testing and evaluation framework for the Claude explain service, enabling systematic A/B testing and data-driven prompt optimization. ## Key Features ### Test Infrastructure - CLI with commands: run, list, review, analyze, and improve - YAML-based test cases covering basic optimizations, complex transformations, and edge cases - YAML-based prompt templates with system/user/assistant sections and variable substitution - Automatic test result storage and analysis ### Evaluation Methods - **Automatic scoring**: Fast regex-based evaluation across multiple dimensions - Topic coverage detection, technical accuracy checks, clarity analysis - Weighted scoring: accuracy (25%), technical accuracy (25%), clarity (20%), completeness (15%), length (10%), consistency (5%) - **Claude-based AI scoring**: Deep evaluation using advanced models - Nuanced assessment across 5 dimensions with detailed feedback - Identifies missing topics, incorrect claims, and improvement areas - **Hybrid scoring**: Configurable sampling combining both methods ### Prompt Improvement - AI-powered prompt advisor analyzes test results and suggests specific improvements - Constitutional AI approach: fast model generates, advanced model reviews - Automated experimental prompt generation based on feedback - Data-driven optimization workflow with measurable impact ### Architecture - Fail-fast error propagation (no silent failures) - Same core logic as production service for accurate testing - Extensible design for custom evaluation criteria - Support for human review workflows ## Implementation Details - Converted test cases from JSON to YAML for better maintainability - Added flexible template variable system (arch, language, compiler, etc.) - Integrated with project's existing infrastructure (uv, pre-commit hooks) - Comprehensive documentation with examples and best practices - TODO roadmap for future enhancements (CE API integration, HTML templating) ## Usage Examples ```bash # Run tests with AI scoring uv run prompt-test run --prompt current --scorer claude # Get improvement suggestions uv run prompt-test improve --prompt current # Compare prompt versions uv run prompt-test run --prompt v1 --compare v2 ``` This framework enables continuous improvement of explanation quality through objective metrics and systematic testing. Co-authored-by: Claude <noreply@anthropic.com>
2 parents d232cc7 + 4ed6011 commit 5a32b9e

21 files changed

+3001
-1
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,10 @@ node_modules
1414
/.venv
1515
.pytest_cache
1616
/.env
17+
18+
# Prompt testing outputs
19+
/prompt_testing/results/
20+
*.jsonl
21+
22+
# Package build artifacts
23+
*.egg-info/

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ The service processes compiler output through a pipeline: input validation → s
6161

6262
- Prefer using modern Python 3.13+ type syntax. Good: `a: list[str] | None`. Bad: `a: Optional[List[str]]`
6363
- Use ruff for linting and formatting with line length of 120 characters
64+
- Prefer pathlib.Path over old-fashioned io like naked `open` and `glob` calls. Always supply an encoding
6465

6566
## Development Workflow Notes
6667

0 commit comments

Comments
 (0)