Skip to content

Commit e6ac581

Browse files
committed
feat: add comprehensive README for Scenario Runner with usage instructions and YAML format details
1 parent df5142a commit e6ac581

File tree

1 file changed

+197
-0
lines changed

1 file changed

+197
-0
lines changed

tools/scenario/README.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Scenario Runner
2+
3+
Automated testing and quality tracking for azd-copilot. Define repeatable scenarios as YAML, replay them through `azd copilot`, score the results, detect regressions, and track improvements over time.
4+
5+
## Quick Start
6+
7+
```bash
8+
cd tools/scenario
9+
10+
# 1. Run an existing scenario
11+
mage scenario:run ../../scenarios/dog-breed-lookup.yaml
12+
13+
# 2. Analyze the session that was produced
14+
mage scenario:analyze <session-id> ../../scenarios/dog-breed-lookup.yaml
15+
16+
# 3. View results
17+
mage scenario:history dog-breed-lookup
18+
mage scenario:dashboard
19+
```
20+
21+
## Concepts
22+
23+
| Concept | Description |
24+
|---------|-------------|
25+
| **Scenario** | A YAML file defining prompts to send, scoring limits, regression patterns, and optional Playwright verification steps. |
26+
| **Run** | One execution of a scenario — produces a session ID, metrics, and a composite score. |
27+
| **Scoring** | Each metric (duration, turns, azd up attempts, Bicep edits, skill invocations, regressions) contributes to a weighted 0–100% composite score. |
28+
| **Improvement Loop** | Automated cycle: run scenario → analyze → send failures to copilot for fixes → rebuild extension → repeat. |
29+
30+
## Scenario YAML Format
31+
32+
```yaml
33+
name: my-scenario
34+
description: "What this scenario tests"
35+
timeout: 35m # overall timeout for the entire run
36+
37+
prompts:
38+
- text: "build a todo app" # first prompt sent to azd copilot
39+
success_criteria: # optional — what should exist after this prompt
40+
files_exist:
41+
- azure.yaml
42+
- infra/main.bicep
43+
deployed: true
44+
endpoint_responds: true
45+
46+
- text: "add a database" # subsequent prompts resume the session
47+
48+
scoring:
49+
max_duration_minutes: 25 # score degrades proportionally above limit
50+
max_turns: 40 # agent turn count
51+
max_azd_up_attempts: 4 # how many times `azd up` was called
52+
max_bicep_edits: 5 # edits to main.bicep
53+
must_delegate: true # requires task() delegation to sub-agents
54+
must_invoke_skills: # skills that must be invoked
55+
- avm-bicep-rules
56+
- azure-functions
57+
regressions: # regex patterns to watch for in assistant output
58+
- name: "ACR auth spiral"
59+
pattern: "ACR.*auth|can't pull|registry.*credential"
60+
max_occurrences: 2
61+
62+
verification: # optional Playwright steps run against the deployed app
63+
- name: "homepage loads"
64+
action: navigate
65+
url: "{{endpoint}}" # {{endpoint}} is replaced with the discovered URL
66+
status_code: 200
67+
68+
- name: "page has content"
69+
action: check
70+
selector: "body"
71+
value: "expected text"
72+
```
73+
74+
### Verification Actions
75+
76+
| Action | Fields | Description |
77+
|--------|--------|-------------|
78+
| `navigate` | `url`, `status_code`, `value` | Navigate to URL, check status code and optional body text |
79+
| `click` | `selector` | Click an element |
80+
| `type` | `selector`, `value` | Type text into an input |
81+
| `wait` | `selector` | Wait for an element to appear |
82+
| `check` | `selector`, `value` | Assert element is visible and optionally contains text |
83+
| `check_not_empty` | `selector` | Assert at least one matching element exists |
84+
| `screenshot` | — | Take a full-page screenshot |
85+
86+
The `{{endpoint}}` placeholder is replaced with the deployed app URL, discovered from `azd env get-values`.
87+
88+
## Mage Commands
89+
90+
All commands are in the `scenario` namespace. Run from `tools/scenario/`:
91+
92+
| Command | Description |
93+
|---------|-------------|
94+
| `mage scenario:extract <session-id>` | Generate a scenario YAML from an existing copilot session log |
95+
| `mage scenario:run <scenario.yaml>` | Execute a scenario (launches `azd copilot` for each prompt) |
96+
| `mage scenario:analyze <session-id> <scenario.yaml>` | Score a session against a scenario and save to DB |
97+
| `mage scenario:history [scenario-name]` | Show recent runs (all scenarios if name omitted) |
98+
| `mage scenario:dashboard` | Generate and serve an interactive HTML dashboard |
99+
| `mage scenario:loop <scenario.yaml>` | Run the full improvement loop (3 iterations) |
100+
| `mage scenario:export` | Export results from SQLite to `results.json` (for git) |
101+
| `mage scenario:import` | Import results from `results.json` into SQLite |
102+
103+
## How It Works
104+
105+
### Running a Scenario
106+
107+
`scenario:run` launches `azd copilot --yolo -p "<prompt>"` for each prompt in a fresh temp directory. Subsequent prompts use `--resume` to continue the same session. The runner watches for:
108+
109+
- **task_complete** events in `~/.copilot/session-state/<id>/events.jsonl` — signals the prompt is done
110+
- **Stuck loops** — kills the process if the same short line repeats 5+ times
111+
- **Idle timeout** — kills after 3 minutes of no output
112+
- **Per-prompt timeout** — kills after 15 minutes per prompt
113+
114+
### Analyzing & Scoring
115+
116+
`scenario:analyze` reads the session's `events.jsonl` and computes:
117+
118+
| Metric | Weight | Scoring |
119+
|--------|--------|---------|
120+
| Duration | 25 pts | Full points at/below limit, proportional reduction above (limit/actual) |
121+
| Turns | 20 pts | Same proportional scoring |
122+
| azd up attempts | 20 pts | Same proportional scoring |
123+
| Bicep edits | 10 pts | Same proportional scoring |
124+
| Delegation | 10 pts | Binary — did the agent use `task()` |
125+
| Skills (per skill) | 5 pts | Binary — was each required skill invoked |
126+
| Regressions (per check) | 5 pts | Binary — pattern occurrences within limit |
127+
128+
A run **passes** only if all metrics are within their limits. The composite score (0–100%) uses proportional credit — even a run 2× over the turn limit gets 50% credit for turns.
129+
130+
### Extracting Scenarios from Sessions
131+
132+
`scenario:extract` creates a scenario YAML from a real copilot session by:
133+
134+
1. Extracting user messages as prompts
135+
2. Computing scoring baselines from actual metrics (with ~30-50% headroom)
136+
3. Adding default regression patterns (ACR auth, zone redundancy, npm lockfile)
137+
138+
### Improvement Loop
139+
140+
`scenario:loop` automates the full cycle:
141+
142+
1. **Run** the scenario → get a session ID
143+
2. **Analyze** the session → compute score, save to DB
144+
3. If failed: **Generate a fix prompt** describing what went wrong and send it to `azd copilot` targeting `cli/src/internal/assets/`
145+
4. **Rebuild** the extension (`go build && go test && mage build`)
146+
5. **Repeat** (default: 3 iterations, stops early on pass)
147+
148+
## Results Storage
149+
150+
| File | Format | Purpose |
151+
|------|--------|---------|
152+
| `scenarios/results.db` | SQLite | Primary storage — gitignored, used by dashboard |
153+
| `scenarios/results.json` | JSON | Portable export — committed to git for sharing |
154+
155+
Use `scenario:export` to save DB → JSON and `scenario:import` to load JSON → DB (skips duplicates by session ID).
156+
157+
### Database Schema
158+
159+
- **`runs`** — one row per scenario execution (score, metrics, pass/fail, git commit)
160+
- **`run_skills`** — which required skills were invoked per run
161+
- **`run_regressions`** — regression pattern match counts per run
162+
- **`run_verification`** — Playwright verification step results per run
163+
164+
## Dashboard
165+
166+
The dashboard is a self-contained HTML file that loads `results.db` client-side via [sql.js](https://github.com/sql-js/sql.js) (WASM SQLite). It shows:
167+
168+
- Summary cards (total runs, scenarios, latest score)
169+
- Per-scenario tabs with charts (score, duration, turns, azd up over time)
170+
- Run history table with expandable skill/regression details
171+
- Comparison view (first run vs latest)
172+
173+
The dashboard reads the DB live — no regeneration needed after new runs. Served at `http://localhost:8086` via `mage scenario:dashboard`.
174+
175+
## File Layout
176+
177+
```
178+
tools/scenario/
179+
├── scenario.go # YAML types, Load/Save, session event parsing
180+
├── runner.go # RunScenario — prompt execution, stuck detection, event watching
181+
├── analyze.go # Extract, Analyze, scoring, FormatReport
182+
├── loop.go # RunLoop — the improvement cycle
183+
├── verify.go # Playwright verification step execution
184+
├── db.go # SQLite schema, InsertRun, ListRuns
185+
├── db_export.go # ExportJSON / ImportJSON
186+
├── dashboard.go # HTML dashboard generation
187+
├── magefile.go # Mage commands (extract, run, analyze, history, dashboard, loop)
188+
├── scenario_test.go # Unit tests
189+
└── go.mod
190+
191+
scenarios/
192+
├── dog-breed-lookup.yaml # Example: SWA + Container App
193+
├── functions-todo-api.yaml # Example: Azure Functions + Cosmos DB
194+
├── results.db # SQLite results (gitignored)
195+
├── results.json # JSON export (committed)
196+
└── dashboard.html # Generated dashboard (gitignored)
197+
```

0 commit comments

Comments
 (0)