Skip to content

Commit aab5383

Browse files
jeremyederclaude
andcommitted
feat: add Harbor Terminal-Bench comparison for agent effectiveness
Add comprehensive Harbor integration to empirically measure Claude Code performance impact of .claude/agents/doubleagent.md using Terminal-Bench. Features: - A/B testing framework (with/without agent file) - Statistical significance testing (t-tests, Cohen's d) - Multiple output formats (JSON, Markdown, HTML) - Interactive dashboard with Chart.js visualizations - CLI commands: compare, list, view Data Models: - HarborTaskResult: Single task result from result.json - HarborRunMetrics: Aggregated metrics per run - HarborComparison: Complete comparison with deltas Services: - HarborRunner: Execute Harbor CLI via subprocess - AgentFileToggler: Safely enable/disable agent files - ResultParser: Parse Harbor result.json files - HarborComparer: Calculate deltas and statistical significance - DashboardGenerator: Generate HTML reports Reporters: - HarborMarkdownReporter: GitHub-Flavored Markdown - DashboardGenerator: Interactive HTML with inlined Chart.js Critical fixes applied based on code review: - Fixed division by zero in delta calculations (returns None) - Removed manual toggler.enable() call (context manager handles it) - Inlined Chart.js library (self-contained HTML principle) Test Coverage: 24/24 tests passing, 98% models coverage, 95%+ services Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 589269b commit aab5383

File tree

16 files changed

+3155
-1
lines changed

16 files changed

+3155
-1
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ coverage.xml
5151
# AgentReady runtime artifacts
5252
.agentready/
5353
.agentready/cache/ # Explicitly exclude cached repositories (510MB+)
54+
.agentready/harbor_comparisons/ # Harbor benchmark comparison results
5455
*.log
5556
*.tmp
5657
plans/ # Planning documents (was .plans/)

CLAUDE.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,106 @@ agentready/
215215

216216
---
217217

218+
## Harbor Benchmark Comparison
219+
220+
**Purpose**: Empirically measure Claude Code performance impact of `.claude/agents/doubleagent.md` using Harbor's Terminal-Bench.
221+
222+
### Overview
223+
224+
The Harbor comparison feature automates A/B testing of agent file effectiveness by:
225+
1. Running Terminal-Bench tasks WITHOUT doubleagent.md (disabled)
226+
2. Running Terminal-Bench tasks WITH doubleagent.md (enabled)
227+
3. Calculating deltas and statistical significance
228+
4. Generating comprehensive reports (JSON, Markdown, HTML)
229+
230+
### Quick Start
231+
232+
```bash
233+
# Install Harbor
234+
uv tool install harbor
235+
236+
# Run comparison (3 tasks, ~30-60 min)
237+
agentready harbor compare \
238+
-t adaptive-rejection-sampler \
239+
-t async-http-client \
240+
-t terminal-file-browser \
241+
--verbose \
242+
--open-dashboard
243+
```
244+
245+
### Key Metrics
246+
247+
- **Success Rate**: Percentage of tasks completed successfully
248+
- **Duration**: Average time to complete tasks
249+
- **Statistical Significance**: T-tests (p<0.05) and Cohen's d effect sizes
250+
- **Per-Task Impact**: Individual task improvements/regressions
251+
252+
### Output Files
253+
254+
Results stored in `.agentready/harbor_comparisons/` (gitignored):
255+
256+
- **JSON**: Machine-readable comparison data
257+
- **Markdown**: GitHub-friendly report (commit this for PRs)
258+
- **HTML**: Interactive dashboard with Chart.js visualizations
259+
260+
### CLI Commands
261+
262+
**Compare**:
263+
```bash
264+
agentready harbor compare -t task1 -t task2 [--verbose] [--open-dashboard]
265+
```
266+
267+
**List comparisons**:
268+
```bash
269+
agentready harbor list
270+
```
271+
272+
**View comparison**:
273+
```bash
274+
agentready harbor view .agentready/harbor_comparisons/comparison_latest.json
275+
```
276+
277+
### Architecture
278+
279+
**Data Models** (`src/agentready/models/harbor.py`):
280+
- `HarborTaskResult` - Single task result from result.json
281+
- `HarborRunMetrics` - Aggregated metrics per run
282+
- `HarborComparison` - Complete comparison with deltas
283+
284+
**Services** (`src/agentready/services/harbor/`):
285+
- `HarborRunner` - Execute Harbor CLI via subprocess
286+
- `AgentFileToggler` - Safely enable/disable agent files
287+
- `ResultParser` - Parse Harbor result.json files
288+
- `HarborComparer` - Calculate deltas and statistical significance
289+
- `DashboardGenerator` - Generate HTML reports
290+
291+
**Reporters** (`src/agentready/reporters/`):
292+
- `HarborMarkdownReporter` - GitHub-Flavored Markdown
293+
- `DashboardGenerator` - Interactive HTML with Chart.js
294+
295+
### Statistical Methods
296+
297+
**Significance Criteria** (both required):
298+
- **P-value < 0.05**: 95% confidence (two-sample t-test)
299+
- **Cohen's d effect size**:
300+
- Small: 0.2 ≤ |d| < 0.5
301+
- Medium: 0.5 ≤ |d| < 0.8
302+
- Large: |d| ≥ 0.8
303+
304+
### Sample Sizes
305+
306+
- **Minimum**: 3 tasks (for statistical tests)
307+
- **Recommended**: 5-10 tasks (reliable results)
308+
- **Comprehensive**: 20+ tasks (production validation)
309+
310+
### Documentation
311+
312+
- **User Guide**: `docs/harbor-comparison-guide.md`
313+
- **Implementation Plan**: `.claude/plans/vivid-knitting-codd.md`
314+
- **Harbor Docs**: https://harborframework.com/docs
315+
316+
---
317+
218318
## Technologies
219319

220320
- **Python 3.12+** (only N and N-1 versions supported)

0 commit comments

Comments
 (0)