Commit d06bab4
feat: Terminal-Bench eval harness (MVP Phase 1) (ambient-code#178)
* feat: container support (ambient-code#171)
* docs: fix container Quick Start to use writable output volumes
Users were unable to access reports because examples used ephemeral
container /tmp directory. Updated all examples to show proper pattern:
- Mount writable host directory for output
- Use mounted path for --output-dir
- Reports now accessible on host filesystem
Changes:
- CONTAINER.md: Updated Quick Start, Usage, CI/CD examples
- README.md: Updated Container (Recommended) section
- Added troubleshooting section for ephemeral filesystem issue
- Removed confusing "Save Output Files" section (integrated into examples)
Fixes issue where `podman run --rm -v /repo:/repo:ro agentready assess /repo --output-dir /tmp`
writes reports inside container's ephemeral /tmp, destroyed on exit.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: update bundler to v2.5.23 for Dependabot compatibility
Dependabot only supports bundler v2.* but Gemfile.lock specified v1.17.2.
Updated BUNDLED WITH section to use bundler 2.5.23.
Fixes Dependabot error:
"Dependabot detected the following bundler requirement for your project: '1'.
Currently, the following bundler versions are supported in Dependabot: v2.*."
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* feat: implement Terminal-Bench eval harness (Phase 1A-1D)
Add comprehensive evaluation harness for systematic A/B testing of
AgentReady assessors against Terminal-Bench (tbench.ai) performance.
**New Components:**
Models (eval_harness.py):
- TbenchResult: Individual benchmark run data
- BaselineMetrics: Statistical baseline (mean, std_dev, median)
- AssessorImpact: Delta score with p-value and Cohen's d effect size
- EvalSummary: Aggregated results with tier-level statistics
Services (eval_harness/):
- TbenchRunner: Mocked tbench with deterministic scoring (seeded by commit hash)
- BaselineEstablisher: Baseline metrics calculation
- AssessorTester: Core A/B testing (clone → assess → fix → measure)
- ResultsAggregator: Multi-assessor aggregation and ranking
CLI Commands (eval-harness):
- baseline: Establish baseline performance (N iterations)
- show-baseline: Display previous baseline results
- test-assessor: Test single assessor impact
- run-tier: Test all assessors in tier sequentially
- summarize: Display aggregated results with rankings
**Features:**
- Deterministic mocking for reproducible workflow validation
- Statistical rigor: scipy t-tests, Cohen's d effect size
- Significance testing: p < 0.05 AND |d| > 0.2
- Integration with existing FixerService for remediation
- Comprehensive JSON output for dashboard generation
- 45 unit tests (100% passing)
**Workflow:**
1. agentready eval-harness baseline --iterations 5
2. agentready eval-harness run-tier --tier 1
3. agentready eval-harness summarize --verbose
**Results Structure:**
.agentready/eval_harness/
├── baseline/
│ ├── summary.json (statistics)
│ └── run_00[1-N].json (individual runs)
├── assessors/
│ ├── <assessor_id>/
│ │ ├── impact.json (delta, p-value, effect size)
│ │ └── run_00[1-N].json (post-remediation runs)
│ └── ...
└── summary.json (aggregated with tier impacts + rankings)
**Next Phase:** Dashboard generation (Jekyll + Chart.js)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* feat: add GitHub Pages dashboard (Phase 1E)
Complete Terminal-Bench evaluation dashboard with Chart.js visualizations.
**New Components:**
DashboardGenerator Service:
- Generates Jekyll-compatible JSON data files
- 5 output files: summary, ranked_assessors, tier_impacts, baseline, stats
- Auto-discovers repository root for docs/ placement
Dashboard Page (docs/tbench.md):
- Chart.js bar chart for tier impacts
- Overview cards: assessors tested, significance rate, baseline
- Top 5 performers table
- Complete results table with sortable columns
- Methodology section (collapsible)
- XSS-safe DOM manipulation (no innerHTML)
CLI Command (dashboard):
- Generates all dashboard data files
- Verbose mode shows file sizes
- Auto-updates on run-tier execution
Navigation:
- Added 'Terminal-Bench' to docs/_config.yml
**Security:**
- Fixed XSS vulnerability using safe DOM methods
- All user data rendered via textContent/createElement
**Testing:**
- Dashboard command tested successfully
- Generates 5 data files (6.5 KB total)
- All 45 unit tests passing
**Configuration:**
- Updated markdown-link-check to ignore internal docs links
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* feat: add eval harness documentation and tests (Phase 1F)
Complete comprehensive documentation and testing suite for Terminal-Bench eval harness.
**New Documentation:**
docs/tbench/methodology.md:
- Comprehensive methodology explanation (A/B testing workflow)
- Statistical methods (t-tests, Cohen's d, effect sizes)
- Interpreting results (delta scores, significance, tier impact)
- Limitations and validity criteria
- Examples and FAQ
CLAUDE.md:
- New "Terminal-Bench Eval Harness" section
- Architecture overview and component descriptions
- Running evaluations (complete command examples)
- File structure and organization
- Statistical methods and interpretation
- Current status (Phase 1 complete, Phase 2 planned)
- Testing instructions and troubleshooting
**New Tests:**
tests/unit/test_eval_harness_cli.py:
- 6 CLI tests validating command structure
- Help message validation for all subcommands
- Baseline, test-assessor, run-tier, summarize, dashboard commands
tests/integration/test_eval_harness_e2e.py:
- 5 end-to-end integration tests
- Baseline establishment workflow
- File structure verification
- Mocked tbench determinism validation
- Complete workflow testing (baseline → files)
**Test Results:**
- 56 total tests passing (45 existing + 6 CLI + 5 integration)
- CLI: 100% help command coverage
- Integration: End-to-end workflow validated
- Models: 90-95% coverage (unchanged)
- Services: 85-90% coverage (unchanged)
**Phase 1F Status:**
✅ CLI tests (6 tests, 100% pass rate)
✅ Integration tests (5 tests, 100% pass rate)
✅ Comprehensive methodology documentation
✅ CLAUDE.md updated with complete guide
✅ All tests passing (56/56)
**Remaining Phase 1F Tasks** (deferred):
- Update README.md with eval harness quick start (lower priority)
- Create docs/eval-harness-guide.md (can reuse methodology.md content)
**Next Phase:**
Phase 2 - Real Terminal-Bench integration with Harbor framework
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* feat: container support (ambient-code#171)
* docs: fix container Quick Start to use writable output volumes
Users were unable to access reports because examples used ephemeral
container /tmp directory. Updated all examples to show proper pattern:
- Mount writable host directory for output
- Use mounted path for --output-dir
- Reports now accessible on host filesystem
Changes:
- CONTAINER.md: Updated Quick Start, Usage, CI/CD examples
- README.md: Updated Container (Recommended) section
- Added troubleshooting section for ephemeral filesystem issue
- Removed confusing "Save Output Files" section (integrated into examples)
Fixes issue where `podman run --rm -v /repo:/repo:ro agentready assess /repo --output-dir /tmp`
writes reports inside container's ephemeral /tmp, destroyed on exit.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: update bundler to v2.5.23 for Dependabot compatibility
Dependabot only supports bundler v2.* but Gemfile.lock specified v1.17.2.
Updated BUNDLED WITH section to use bundler 2.5.23.
Fixes Dependabot error:
"Dependabot detected the following bundler requirement for your project: '1'.
Currently, the following bundler versions are supported in Dependabot: v2.*."
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* fix: resolve Pydantic config validation errors and macOS path issues
Fix config loading test failures by properly handling Pydantic ValidationError:
- Add dict type validation before Pydantic parsing
- Configure Pydantic to reject unknown config keys (extra="forbid")
- Convert Pydantic ValidationError to ValueError with test-expected messages
- Map error types: extra_forbidden, dict_type, float_parsing, list_type, string_type, value_error
Fix macOS symlink handling for sensitive directories:
- Add /private/etc and /var/log to sensitive dirs (macOS resolves /etc → /private/etc)
- Exclude /var/folders (pytest temp directories) from sensitive dir check
- Update both main.py (user warnings) and security.py (config validation)
Fix large repository warning abort handling:
- Re-raise click.Abort exceptions to properly exit when user declines
- Prevents try/except block from swallowing user's "no" response
All 41 tests in test_main.py now pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: remove unused LanguageDetector mocks from align tests
- Removed all @patch("agentready.cli.align.LanguageDetector") decorators
- Removed mock_detector parameters from all test methods
- Updated all tests to mock FixerService.generate_fix_plan() with proper FixPlan structure
- Added user input ("y\n") to tests with fixes that need confirmation
- All 25 align command tests now pass (12 tests fixed)
The align command doesn't use LanguageDetector, so the mocks were causing
decorator failures. Tests also needed proper mocking of the FixPlan object
with fixes, projected_score, and points_gained attributes.
* docs: update test fix progress - 48 failures remaining
* docs: add Phase 2 test fix plan for remaining 48 failures
* test: improve CLI extract-skills test fixtures with known attribute IDs
Updated temp_repo fixture to use attribute IDs that PatternExtractor
recognizes (claude_md_file, type_annotations) instead of generic test
attributes. This allows PatternExtractor to extract skills from the
test assessment.
Identified root cause for remaining 6 CLI failures: output directory
is created relative to cwd, not repo_path. Fix planned for next session.
Test status: Still 19 failed (investigation in progress)
Related: ambient-code#178
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* chore: add docs/_site to gitignore (Jekyll generated files)
* fix: revert sensitive directory list to match upstream/main
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent 990fa2d commit d06bab4
File tree
31 files changed
+8106
-73
lines changed- docs
- _data/tbench
- tbench
- mockups
- src/agentready
- cli
- models
- services/eval_harness
- tests
- integration
- unit
31 files changed
+8106
-73
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
| 72 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
195 | 322 | | |
196 | 323 | | |
197 | 324 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
0 commit comments