Add comprehensive evaluation infrastructure for spec and plan templates by kfinkels · Pull Request #2 · tikalk/agentic-sdlc-spec-kit

kfinkels · 2026-01-19T07:37:53Z

Summary

Implemented comprehensive evaluation infrastructure using PromptFoo to test spec and plan template quality
Added custom annotation tool with FastHTML-based UI for manual spec evaluation
Integrated automated error analysis for iterative prompt refinement
Achieved 90% evaluation pass rate through systematic prompt improvements
Added GitHub Actions workflow for continuous evaluation on pull requests
Created dataset of 17 real specs and 2 real plans for testing
Made LLM model configurable via environment variables

Key Features

PromptFoo Integration: Automated evaluation framework with custom graders for assessing spec/plan quality
Annotation Tool: Interactive web UI for reviewing and annotating evaluation results
Error Analysis: Automated analysis of failures with actionable improvement recommendations
GitHub Actions: CI/CD integration for running evaluations on PRs
Comprehensive Documentation: Setup guides, workflows, and quick reference materials
Model Flexibility: Configurable LLM provider and model selection

Test plan

Run evaluation locally using evals/scripts/run-promptfoo-eval.sh
Verify GitHub Actions workflow executes successfully on PR
Test annotation tool with evals/scripts/run-annotation-tool.sh
Review automated error analysis output
Confirm evaluation scores meet minimum thresholds (60% for spec, 75% for plan)

… of spec and plan template outputs using PromptFoo with Claude Sonnet 4.5.

- Create run-auto-error-analysis.sh script for automated spec evaluation - Add run-automated-error-analysis.py with Claude-powered categorization - Evaluate specs with binary pass/fail and failure categorization - Generate detailed CSV reports and summary files - Update .gitignore to exclude analysis results - Document automated and manual error analysis workflows in README - Mark Week 1 (Error Analysis Foundation) as completed in workplan Provides two error analysis options: 1. Automated (Claude API) - fast, batch evaluation 2. Manual (Jupyter) - deep investigation and exploration

Implement keyboard-driven web interface for reviewing generated specs, providing 10x faster review workflow. Includes auto-save, progress tracking, and JSON export capabilities. Update documentation with complete annotation tool guide and usage instructions.

Keren Finkelstein added 14 commits January 8, 2026 10:36

Implement comprehensive evaluation infrastructure to test the quality…

d7a123b

… of spec and plan template outputs using PromptFoo with Claude Sonnet 4.5.

upadte README file and set the model as env var

cb1d00d

Achieve 90% evaluation pass rate through iterative prompt refinement

c3031f0

update gitignore

b051f12

Add plan error analysis foundation (100% pass rate)

00583ae

move the sh to scripts

cc2c6f5

Merge branch 'main' into adding_eval

669257e

add documentation

5a199a2

add github actions, replace claude with any model (configurable)

cf5ccba

fix linting issues

2768469

update documentation

bb25b69

fix lint errors

525c7a7

kfinkels requested a review from kanfil January 19, 2026 08:19

kanfil merged commit 2851781 into main Jan 19, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive evaluation infrastructure for spec and plan templates#2

Add comprehensive evaluation infrastructure for spec and plan templates#2
kanfil merged 14 commits intomainfrom
adding_eval

kfinkels commented Jan 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kfinkels commented Jan 19, 2026

Summary

Key Features

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants