You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ROAD-TO-V2-Overhaul.md
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -141,6 +141,20 @@ Two focused improvements to eval ergonomics:
141
141
-**MCP tool guidance integrated into task prompt sections.** Instead of appending a separate `MCP TOOLS:` block, MCP guidance is now woven into the existing `ENVIRONMENT`, `YOUR TASK`, `IMPORTANT`, and `RULES` sections when `--use-mcp-tools` is enabled. This produces a more natural prompt structure and makes MCP tool usage a first-class instruction rather than an afterthought. Agent-specific `mcp_prompt` text is no longer appended separately.
142
142
-**Workspace cleanup preserves eval artifacts.** Previously, `--keep-workspaces=false` (the default) removed the entire workspace directory after validation, which also deleted `agent.log`, `validation.log`, and integrity artifacts since the workspace dir doubles as the task output dir. Cleanup now selectively removes only source files while preserving harness-produced outputs (`agent.log`, `validation.log`, `integrity.json`, `integrity-files/`, `integrity-diff/`).
143
143
144
+
## What changed in v1.8.4 — Agent Skills prompting and telemetry
145
+
146
+
This release focuses on making `--use-skills` behavior measurable and auditable.
147
+
148
+
-**Stronger skills prompt contract.** The eval prompt now explicitly instructs agents to load at least one relevant skill (when available) and make the first skill call before writing code.
149
+
-**Per-task skills telemetry.**`results[]` now records `skills_used` and `skills_usage_signals` so skill adoption can be analyzed per task.
150
+
-**Run-level skills metrics.**`summary.json`, `submission.json`, and `report.md` now include:
151
+
-`skills_usage_rate`
152
+
-`total_skills_usage_signals`
153
+
-`tasks_with_skills_usage`
154
+
-**OpenCode-compatible skill signal detection.** Telemetry parsing now detects common OpenCode skill markers like `Skill "..."` activations and explicit `firecrawl ...` command usage, in addition to file-path based skill artifact signals.
155
+
156
+
Practical impact: A/B runs with and without `--use-skills` can now show an observable usage delta instead of only a mode toggle.
157
+
144
158
## Compatibility and comparing old runs
145
159
146
160
`1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:
0 commit comments