Proposal: Add benchmarking via TerminalBench to spec-kit #159

adam-paterson · 2025-09-11T00:04:59Z

adam-paterson
Sep 11, 2025

Summary
Add first-class benchmarking to spec-kit by integrating TerminalBench. This enables contributors and maintainers to measure performance of key commands and workflows over time, locally and in CI, and to spot regressions early.

Motivation

Catch performance regressions before release with repeatable, comparable runs.
Provide baseline targets that guide optimization and review.
Make performance visible as a quality signal alongside tests and lint.

Goals (initial)

spec-kit bench subcommand that runs a curated suite.
Zero-config defaults; optional bench.config.(yml|json) for advanced control.
Machine- and human-readable outputs (JSON plus markdown summaries).
CI-friendly execution with artifacts and simple trend comparison.

Non‑Goals (initial)

Micro-benchmarking individual syscalls.
Cross-host normalization beyond what TerminalBench already supports.

Proposed Approach

Wrap TerminalBench as a subcommand exposed by spec-kit.
Ship a small "core" suite covering common flows (startup, generate, validate).
Emit JSON results plus a concise markdown summary and deltas versus baseline.
Provide an opt‑in "stress" suite to observe variance and soak behavior.

CLI Sketch

spec-kit bench → run default suite
spec-kit bench --suite core → named suite
spec-kit bench --filter <pattern> → subset
spec-kit bench --output results/ --format json,md → outputs
spec-kit bench --compare results/baseline.json → local diff

Example Config (optional)

bench:
  defaultSuite: core
  suites:
    core:
      - name: startup_cold
        cmd: "spec-kit --help"
        reps: 20
      - name: generate_small
        cmd: "spec-kit gen --template basic"
        reps: 10
    stress:
      - name: validate_large
        cmd: "spec-kit validate ./large-spec"
        duration: "2m"

Outputs & Signals

bench.json: raw metrics (durations, variance, environment).
bench.md: human-readable summary with deltas vs. baseline.
Exit codes: 0 OK; 2 soft regression (warn threshold); 3 hard regression (fail).

CI Integration

Run: spec-kit bench --format json,md --output artifacts/.
Upload artifacts/bench.*; optionally comment a summary on PRs.
Nightly on main publishes/updates a moving baseline (bench-baseline.json).

Alternatives Considered

Homegrown timing harness → more maintenance, less comparability.
Hyperfine-only → great single-command timing, lacks suites and trend glue.

Risks & Mitigations

Cross-host noise → rely on warmups/reps; prefer CI pinning and fixed runners.
Suite creep → keep "core" small; move extras to an "extended" suite.

Rollout Plan

Land spec-kit bench with a minimal core suite.
Add CI job and publish artifacts.
Introduce baseline + thresholds for regression gating.
Expand suites based on real-world usage and feedback.

Future Work

GEPA template generation: add perf-focused templates and sample specs to standardize scenarios.
Optional trend dashboard (badge or GitHub Pages) sourced from JSON artifacts.

Request for Feedback

jaycangel · 2025-09-11T06:17:48Z

jaycangel
Sep 11, 2025

This would be amazing!! It would be good to understand how much practical improvement spec-kit gives

1 reply

adam-paterson Sep 11, 2025
Author

Absolutely! I think this flow has potential to sort at the top of the leaderboard with some of the other greats!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Add benchmarking via TerminalBench to spec-kit #159

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: Add benchmarking via TerminalBench to spec-kit #159

Uh oh!

Uh oh!

adam-paterson Sep 11, 2025

Replies: 1 comment · 1 reply

Uh oh!

jaycangel Sep 11, 2025

Uh oh!

adam-paterson Sep 11, 2025 Author

adam-paterson
Sep 11, 2025

Replies: 1 comment 1 reply

jaycangel
Sep 11, 2025

adam-paterson Sep 11, 2025
Author