# Acceptance Command run_bench.sh with: - [ ] Per-category metrics: accuracy, response time, token counts (prompt/completion/total) - [ ] Per-model metrics: success rate, error distribution, latency distribution - [ ] Export to CSV/JSON for analysis