lemon07r
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/ROAD-TO-V2-Overhaul.md‎
Lines changed: 14 additions & 0 deletions b/‎docs/ROAD-TO-V2-Overhaul.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎docs/SCORING.md‎
Lines changed: 16 additions & 3 deletions b/‎docs/SCORING.md‎
Lines changed: 16 additions & 3 deletions
@@ -101,6 +101,7 @@ make build    # Build the CLI
 ./sanity eval --agent gemini --dry-run                # Preview without running
 ./sanity eval --agent droid --reasoning high          # Set reasoning effort
 ./sanity eval --agent gemini --use-mcp-tools          # Enable MCP tools
+./sanity eval --agent opencode --use-skills           # Enable Agent Skills mode
 ./sanity eval --agent opencode --disable-mcp          # Disable MCP tools / currently only supported for opencode
 ./sanity eval --agent opencode --keep-workspaces      # Keep workspaces for debugging
 ./sanity eval --agent gemini --no-sandbox             # Disable bubblewrap sandbox
 
@@ -141,6 +141,20 @@ Two focused improvements to eval ergonomics:
 - **MCP tool guidance integrated into task prompt sections.** Instead of appending a separate `MCP TOOLS:` block, MCP guidance is now woven into the existing `ENVIRONMENT`, `YOUR TASK`, `IMPORTANT`, and `RULES` sections when `--use-mcp-tools` is enabled. This produces a more natural prompt structure and makes MCP tool usage a first-class instruction rather than an afterthought. Agent-specific `mcp_prompt` text is no longer appended separately.
 - **Workspace cleanup preserves eval artifacts.** Previously, `--keep-workspaces=false` (the default) removed the entire workspace directory after validation, which also deleted `agent.log`, `validation.log`, and integrity artifacts since the workspace dir doubles as the task output dir. Cleanup now selectively removes only source files while preserving harness-produced outputs (`agent.log`, `validation.log`, `integrity.json`, `integrity-files/`, `integrity-diff/`).
 
+## What changed in v1.8.4 — Agent Skills prompting and telemetry
+
+This release focuses on making `--use-skills` behavior measurable and auditable.
+
+- **Stronger skills prompt contract.** The eval prompt now explicitly instructs agents to load at least one relevant skill (when available) and make the first skill call before writing code.
+- **Per-task skills telemetry.** `results[]` now records `skills_used` and `skills_usage_signals` so skill adoption can be analyzed per task.
+- **Run-level skills metrics.** `summary.json`, `submission.json`, and `report.md` now include:
+  - `skills_usage_rate`
+  - `total_skills_usage_signals`
+  - `tasks_with_skills_usage`
+- **OpenCode-compatible skill signal detection.** Telemetry parsing now detects common OpenCode skill markers like `Skill "..."` activations and explicit `firecrawl ...` command usage, in addition to file-path based skill artifact signals.
+
+Practical impact: A/B runs with and without `--use-skills` can now show an observable usage delta instead of only a mode toggle.
+
 ## Compatibility and comparing old runs
 
 `1.7.x` is intentionally not identical to `v1.6.1` behavior. If you are comparing against historical leaderboard-era runs, use legacy mode:
 
@@ -158,6 +158,7 @@ eval-results/<timestamp>-<agent>/
   "model": "gemini-3-flash-preview",
   "reasoning": "high",
   "use_mcp_tools": false,
+  "use_skills": true,
   "disable_mcp": false,
   "timestamp": "2026-01-07T052902",
   "harness_version": "abc123",
@@ -174,6 +175,9 @@ eval-results/<timestamp>-<agent>/
   "weighted_pass_rate": 45.5,
   "weighted_score": 15.12,
   "max_possible_score": 33.29,
+  "skills_usage_rate": 38.5,
+  "total_skills_usage_signals": 5,
+  "tasks_with_skills_usage": 10,
 
   "by_language": {
     "go": { "passed": 3, "failed": 3, "total": 6, "pass_rate": 50.0 }
@@ -194,16 +198,21 @@ eval-results/<timestamp>-<agent>/
       "weight": 1.0,
       "score": 1.0,
       "duration_ms": 45000,
-      "attempts": 1
+      "attempts": 1,
+      "skills_used": true,
+      "skills_usage_signals": 1
     }
   ]
 }
 ```
 
 Notes:
-- `timeout`, `parallel`, `use_mcp_tools`, `disable_mcp`, `sandbox`, `legacy`,
+- `timeout`, `parallel`, `use_mcp_tools`, `use_skills`, `disable_mcp`, `sandbox`, `legacy`,
   `quota_affected_tasks`, and `total_quota_retries` are always emitted.
 - Per-task `results[]` include explicit retry/infra metadata fields.
+- Skills telemetry fields are emitted at both run and per-task levels:
+  `skills_usage_rate`, `total_skills_usage_signals`, `tasks_with_skills_usage`,
+  `skills_used`, and `skills_usage_signals`.
 
 ### attestation.json Schema
 
@@ -243,6 +252,7 @@ Optimized for leaderboard submissions:
   "model": "gemini-3-flash-preview",
   "reasoning": "high",
   "use_mcp_tools": false,
+  "use_skills": true,
   "disable_mcp": false,
   "timestamp": "2026-01-07T052902",
 
@@ -256,6 +266,9 @@ Optimized for leaderboard submissions:
   "max_possible_score": 33.29,
 
   "integrity_violations": 0,
+  "skills_usage_rate": 38.5,
+  "total_skills_usage_signals": 5,
+  "tasks_with_skills_usage": 10,
 
   "by_language": {
     "go": { "passed": 3, "failed": 3, "total": 6, "pass_rate": 50.0 }
@@ -271,7 +284,7 @@ Optimized for leaderboard submissions:
 Notes:
 - `submission.json` includes run metadata and audit counters:
   `timeout`, `parallel`, `quota_affected_tasks`, and `total_quota_retries`.
-- Configuration booleans (`use_mcp_tools`, `disable_mcp`, `sandbox`, `legacy`)
+- Configuration booleans (`use_mcp_tools`, `use_skills`, `disable_mcp`, `sandbox`, `legacy`)
   are always emitted as explicit booleans.
 
 ### report.md Format