Skip to content

Add comprehensive evaluation infrastructure for spec and plan templates#2

Merged
kanfil merged 14 commits intomainfrom
adding_eval
Jan 19, 2026
Merged

Add comprehensive evaluation infrastructure for spec and plan templates#2
kanfil merged 14 commits intomainfrom
adding_eval

Conversation

@kfinkels
Copy link
Collaborator

Summary

  • Implemented comprehensive evaluation infrastructure using PromptFoo to test spec and plan template quality
  • Added custom annotation tool with FastHTML-based UI for manual spec evaluation
  • Integrated automated error analysis for iterative prompt refinement
  • Achieved 90% evaluation pass rate through systematic prompt improvements
  • Added GitHub Actions workflow for continuous evaluation on pull requests
  • Created dataset of 17 real specs and 2 real plans for testing
  • Made LLM model configurable via environment variables

Key Features

  • PromptFoo Integration: Automated evaluation framework with custom graders for assessing spec/plan quality
  • Annotation Tool: Interactive web UI for reviewing and annotating evaluation results
  • Error Analysis: Automated analysis of failures with actionable improvement recommendations
  • GitHub Actions: CI/CD integration for running evaluations on PRs
  • Comprehensive Documentation: Setup guides, workflows, and quick reference materials
  • Model Flexibility: Configurable LLM provider and model selection

Test plan

  • Run evaluation locally using evals/scripts/run-promptfoo-eval.sh
  • Verify GitHub Actions workflow executes successfully on PR
  • Test annotation tool with evals/scripts/run-annotation-tool.sh
  • Review automated error analysis output
  • Confirm evaluation scores meet minimum thresholds (60% for spec, 75% for plan)

Keren Finkelstein added 14 commits January 8, 2026 10:36
… of spec and plan template outputs using PromptFoo with Claude Sonnet 4.5.
  - Create run-auto-error-analysis.sh script for automated spec evaluation
  - Add run-automated-error-analysis.py with Claude-powered categorization
  - Evaluate specs with binary pass/fail and failure categorization
  - Generate detailed CSV reports and summary files
  - Update .gitignore to exclude analysis results
  - Document automated and manual error analysis workflows in README
  - Mark Week 1 (Error Analysis Foundation) as completed in workplan

  Provides two error analysis options:
  1. Automated (Claude API) - fast, batch evaluation
  2. Manual (Jupyter) - deep investigation and exploration
Implement keyboard-driven web interface for reviewing generated specs, providing 10x faster review workflow. Includes auto-save, progress tracking, and JSON export capabilities. Update documentation with complete annotation tool guide and usage instructions.
@kfinkels kfinkels requested a review from kanfil January 19, 2026 08:19
@kanfil kanfil merged commit 2851781 into main Jan 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants