Skip to content

Commit 96f67e7

Browse files
committed
Add skills telemetry and v1.8.4 docs
1 parent 87f629e commit 96f67e7

File tree

6 files changed

+274
-33
lines changed

6 files changed

+274
-33
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ make build # Build the CLI
101101
./sanity eval --agent gemini --dry-run # Preview without running
102102
./sanity eval --agent droid --reasoning high # Set reasoning effort
103103
./sanity eval --agent gemini --use-mcp-tools # Enable MCP tools
104+
./sanity eval --agent opencode --use-skills # Enable Agent Skills mode
104105
./sanity eval --agent opencode --disable-mcp # Disable MCP tools / currently only supported for opencode
105106
./sanity eval --agent opencode --keep-workspaces # Keep workspaces for debugging
106107
./sanity eval --agent gemini --no-sandbox # Disable bubblewrap sandbox

docs/ROAD-TO-V2-Overhaul.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,20 @@ Two focused improvements to eval ergonomics:
141141
- **MCP tool guidance integrated into task prompt sections.** Instead of appending a separate `MCP TOOLS:` block, MCP guidance is now woven into the existing `ENVIRONMENT`, `YOUR TASK`, `IMPORTANT`, and `RULES` sections when `--use-mcp-tools` is enabled. This produces a more natural prompt structure and makes MCP tool usage a first-class instruction rather than an afterthought. Agent-specific `mcp_prompt` text is no longer appended separately.
142142
- **Workspace cleanup preserves eval artifacts.** Previously, `--keep-workspaces=false` (the default) removed the entire workspace directory after validation, which also deleted `agent.log`, `validation.log`, and integrity artifacts since the workspace dir doubles as the task output dir. Cleanup now selectively removes only source files while preserving harness-produced outputs (`agent.log`, `validation.log`, `integrity.json`, `integrity-files/`, `integrity-diff/`).
143143

144+
## What changed in v1.8.4 — Agent Skills prompting and telemetry
145+
146+
This release focuses on making `--use-skills` behavior measurable and auditable.
147+
148+
- **Stronger skills prompt contract.** The eval prompt now explicitly instructs agents to load at least one relevant skill (when available) and make the first skill call before writing code.
149+
- **Per-task skills telemetry.** `results[]` now records `skills_used` and `skills_usage_signals` so skill adoption can be analyzed per task.
150+
- **Run-level skills metrics.** `summary.json`, `submission.json`, and `report.md` now include:
151+
- `skills_usage_rate`
152+
- `total_skills_usage_signals`
153+
- `tasks_with_skills_usage`
154+
- **OpenCode-compatible skill signal detection.** Telemetry parsing now detects common OpenCode skill markers like `Skill "..."` activations and explicit `firecrawl ...` command usage, in addition to file-path based skill artifact signals.
155+
156+
Practical impact: A/B runs with and without `--use-skills` can now show an observable usage delta instead of only a mode toggle.
157+
144158
## Compatibility and comparing old runs
145159

146160
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:

docs/SCORING.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ eval-results/<timestamp>-<agent>/
158158
"model": "gemini-3-flash-preview",
159159
"reasoning": "high",
160160
"use_mcp_tools": false,
161+
"use_skills": true,
161162
"disable_mcp": false,
162163
"timestamp": "2026-01-07T052902",
163164
"harness_version": "abc123",
@@ -174,6 +175,9 @@ eval-results/<timestamp>-<agent>/
174175
"weighted_pass_rate": 45.5,
175176
"weighted_score": 15.12,
176177
"max_possible_score": 33.29,
178+
"skills_usage_rate": 38.5,
179+
"total_skills_usage_signals": 5,
180+
"tasks_with_skills_usage": 10,
177181

178182
"by_language": {
179183
"go": { "passed": 3, "failed": 3, "total": 6, "pass_rate": 50.0 }
@@ -194,16 +198,21 @@ eval-results/<timestamp>-<agent>/
194198
"weight": 1.0,
195199
"score": 1.0,
196200
"duration_ms": 45000,
197-
"attempts": 1
201+
"attempts": 1,
202+
"skills_used": true,
203+
"skills_usage_signals": 1
198204
}
199205
]
200206
}
201207
```
202208

203209
Notes:
204-
- `timeout`, `parallel`, `use_mcp_tools`, `disable_mcp`, `sandbox`, `legacy`,
210+
- `timeout`, `parallel`, `use_mcp_tools`, `use_skills`, `disable_mcp`, `sandbox`, `legacy`,
205211
`quota_affected_tasks`, and `total_quota_retries` are always emitted.
206212
- Per-task `results[]` include explicit retry/infra metadata fields.
213+
- Skills telemetry fields are emitted at both run and per-task levels:
214+
`skills_usage_rate`, `total_skills_usage_signals`, `tasks_with_skills_usage`,
215+
`skills_used`, and `skills_usage_signals`.
207216

208217
### attestation.json Schema
209218

@@ -243,6 +252,7 @@ Optimized for leaderboard submissions:
243252
"model": "gemini-3-flash-preview",
244253
"reasoning": "high",
245254
"use_mcp_tools": false,
255+
"use_skills": true,
246256
"disable_mcp": false,
247257
"timestamp": "2026-01-07T052902",
248258

@@ -256,6 +266,9 @@ Optimized for leaderboard submissions:
256266
"max_possible_score": 33.29,
257267

258268
"integrity_violations": 0,
269+
"skills_usage_rate": 38.5,
270+
"total_skills_usage_signals": 5,
271+
"tasks_with_skills_usage": 10,
259272

260273
"by_language": {
261274
"go": { "passed": 3, "failed": 3, "total": 6, "pass_rate": 50.0 }
@@ -271,7 +284,7 @@ Optimized for leaderboard submissions:
271284
Notes:
272285
- `submission.json` includes run metadata and audit counters:
273286
`timeout`, `parallel`, `quota_affected_tasks`, and `total_quota_retries`.
274-
- Configuration booleans (`use_mcp_tools`, `disable_mcp`, `sandbox`, `legacy`)
287+
- Configuration booleans (`use_mcp_tools`, `use_skills`, `disable_mcp`, `sandbox`, `legacy`)
275288
are always emitted as explicit booleans.
276289

277290
### report.md Format

0 commit comments

Comments
 (0)