Skip to content

Commit bc86c5d

Browse files
committed
Added Skill to generate benchmarks
1 parent 6e10917 commit bc86c5d

File tree

1 file changed

+204
-0
lines changed

1 file changed

+204
-0
lines changed

skills/offload-benchmark/SKILL.md

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
---
2+
name: offload-benchmark
3+
description: "Run local vs Offload benchmarks for mng and sculptor, then update the Benchmarks section of the Offload README."
4+
---
5+
6+
# Offload Benchmark
7+
8+
Run local and Offload test suites for **mng** and **sculptor**, collect timing data, and update the `## Benchmarks` section of the Offload README with fresh numbers.
9+
10+
## Prerequisites
11+
12+
1. Offload must be installed: `cargo install offload@0.5.0` (or built from source in the offload repo).
13+
2. Modal must be authenticated: `modal token new` (if credentials are expired).
14+
3. Both repos must exist locally:
15+
- **mng**: `~/imbue/mng` (or wherever the monorepo is checked out)
16+
- **sculptor**: `~/imbue/sculptor`
17+
4. Run `just install` from the offload repo root to ensure the binary is up to date.
18+
19+
Verify all four before proceeding. If any prerequisite is missing, stop and tell the user.
20+
21+
## Step 1: Read Current Test Commands
22+
23+
**Do not hardcode commands.** Read the justfile in each repo at invocation time, since commands change:
24+
25+
- **mng**: Read `~/imbue/mng/justfile` and locate:
26+
- `test-integration` -- the local baseline (unit + integration tests via pytest with xdist)
27+
- `test-offload` -- the Offload run on Modal
28+
- **sculptor**: Read `~/imbue/sculptor/justfile` and locate:
29+
- `test-integration` -- the local baseline (Playwright integration tests via pytest with xdist)
30+
- `test-integration-offload` -- the Offload run on Modal
31+
32+
Record the exact commands from each justfile. If any target is missing or has changed significantly, stop and tell the user.
33+
34+
Also read each justfile to determine:
35+
- The default xdist worker count for the local baseline (look for `-n auto --maxprocesses N` or `-n N`)
36+
- The higher xdist worker count to use for a second local run (use double the default, capped at 8)
37+
38+
## Step 2: Run Benchmarks
39+
40+
**CRITICAL RULES:**
41+
- NEVER run any two benchmark runs concurrently. Each run must have exclusive machine resources. Run them strictly sequentially.
42+
- Close all other heavy processes (browsers, IDEs, Docker containers) before starting, or at minimum warn the user to do so.
43+
- Use `time` (bash builtin) to measure wall-clock time for each run. Wrap each command in `time (...)` and record the real time.
44+
- If any tests fail in a run, flag the run as **INVALID** but still record the timing. Report the failure to the user at the end.
45+
46+
### 2a: mng benchmarks
47+
48+
Run these three commands sequentially from the mng repo root (`~/imbue/mng`):
49+
50+
1. **Local baseline** (default xdist workers):
51+
```
52+
time just test-integration
53+
```
54+
Record the wall-clock time.
55+
56+
2. **Local high-xdist** (override to higher worker count):
57+
Read the default `-n` value from the justfile. Run with double that value (capped at 8). For example, if the default is `-n 4`:
58+
```
59+
time PYTEST_NUMPROCESSES=8 just test-integration
60+
```
61+
If the justfile uses `PYTEST_NUMPROCESSES` or a similar env var for xdist workers, use that. Otherwise, pass the appropriate flag or env var. Read the justfile carefully to determine the correct override mechanism.
62+
63+
3. **Offload (warm cache)**: Run the offload command TWICE. Discard the first run's timing (it warms the image cache). Record only the second run's timing.
64+
```
65+
just test-offload # warm-up run (discard timing)
66+
time just test-offload # benchmark run (record timing)
67+
```
68+
69+
Record the number of tests collected (visible in pytest output) and the xdist `-n` values used.
70+
71+
### 2b: sculptor benchmarks
72+
73+
Run these three commands sequentially from the sculptor repo root (`~/imbue/sculptor`):
74+
75+
1. **Local baseline** (default xdist workers):
76+
```
77+
time just test-integration
78+
```
79+
Record the wall-clock time.
80+
81+
2. **Local high-xdist** (override to higher worker count):
82+
Read the default `--maxprocesses` value from the justfile (currently 3). Run with a higher value:
83+
```
84+
time XDIST_WORKERS=8 just test-integration
85+
```
86+
Use the override mechanism from the justfile (`XDIST_WORKERS` env var).
87+
88+
3. **Offload (warm cache)**: Run the offload command TWICE. Discard the first run's timing. Record only the second.
89+
```
90+
just test-integration-offload # warm-up run (discard timing)
91+
time just test-integration-offload # benchmark run (record timing)
92+
```
93+
94+
Record the number of tests collected and the xdist `-n` values used.
95+
96+
## Step 3: Compute Metrics
97+
98+
For each project, compute:
99+
100+
| Metric | Formula |
101+
|--------|---------|
102+
| Time (s) | Wall-clock seconds from `time`, rounded to 1 decimal |
103+
| Time (%) | `(run_time / baseline_time) * 100`, rounded to 1 decimal |
104+
| Speedup | `baseline_time / run_time`, rounded to 2 decimals |
105+
| Bar width (px) | `round((run_time / baseline_time) * 150)` -- baseline is always 150px |
106+
107+
The baseline is always the first run (default xdist) for each project.
108+
109+
## Step 4: Generate the Updated Benchmarks Section
110+
111+
Replace the entire `## Benchmarks` section in the Offload README (from `## Benchmarks` up to but not including the next `##` heading) with the template below, filled in with measured values.
112+
113+
Use the correct labels for each project's test suite:
114+
- **sculptor**: "Integration Tests (Playwright)"
115+
- **mng**: "Unit + Integration Tests"
116+
117+
The bar chart images reference `docs/bar-local.svg` (gray, `#444`) and `docs/bar-offload.svg` (green, `#22a355`). These are 1x1 SVG rectangles scaled via the `width` attribute.
118+
119+
### Template
120+
121+
```markdown
122+
## Benchmarks
123+
124+
Speedups measured on Imbue projects using Offload with the Modal provider. All local baselines were run on a {MACHINE_DESCRIPTION}.
125+
126+
### Sculptor Integration Tests (Playwright)
127+
128+
| Run Kind | Time (s) | Time (%) | Speedup |
129+
|----------|----------|----------|---------|
130+
| pytest with xdist, n={SCULPTOR_BASELINE_N} (baseline) | <img src="docs/bar-local.svg" width="150" height="4"> {SCULPTOR_BASELINE_TIME} | 100.0% | 1.00x |
131+
| pytest with xdist, n={SCULPTOR_HIGH_N} | <img src="docs/bar-local.svg" width="{SCULPTOR_HIGH_BAR_WIDTH}" height="4"> {SCULPTOR_HIGH_TIME} | {SCULPTOR_HIGH_PCT}% | {SCULPTOR_HIGH_SPEEDUP}x |
132+
| Offload (Modal, max {SCULPTOR_MAX_PARALLEL}) | <img src="docs/bar-offload.svg" width="{SCULPTOR_OFFLOAD_BAR_WIDTH}" height="4"> {SCULPTOR_OFFLOAD_TIME} | {SCULPTOR_OFFLOAD_PCT}% | **{SCULPTOR_OFFLOAD_SPEEDUP}x** |
133+
134+
<details>
135+
<summary><strong>Notes</strong></summary>
136+
137+
{SCULPTOR_TEST_COUNT} Playwright integration tests (browser-based, each launching a full Sculptor instance).
138+
Individual tests are heavyweight (Chromium + backend server per worker), so the default xdist cap is n={SCULPTOR_BASELINE_N}.
139+
Offload bypasses xdist entirely, fanning out across up to {SCULPTOR_MAX_PARALLEL} isolated Modal sandboxes -- each running a single test against its own Sculptor instance. The high per-test cost makes Offload's per-sandbox overhead negligible, yielding a {SCULPTOR_OFFLOAD_SPEEDUP}x speedup.
140+
141+
</details>
142+
143+
### Mng Unit + Integration Tests
144+
145+
| Run Kind | Time (s) | Time (%) | Speedup |
146+
|----------|----------|----------|---------|
147+
| pytest with xdist, n={MNG_BASELINE_N} (baseline) | <img src="docs/bar-local.svg" width="150" height="4"> {MNG_BASELINE_TIME} | 100.0% | 1.00x |
148+
| pytest with xdist, n={MNG_HIGH_N} | <img src="docs/bar-local.svg" width="{MNG_HIGH_BAR_WIDTH}" height="4"> {MNG_HIGH_TIME} | {MNG_HIGH_PCT}% | {MNG_HIGH_SPEEDUP}x |
149+
| Offload (Modal, max {MNG_MAX_PARALLEL}) | <img src="docs/bar-offload.svg" width="{MNG_OFFLOAD_BAR_WIDTH}" height="4"> {MNG_OFFLOAD_TIME} | {MNG_OFFLOAD_PCT}% | **{MNG_OFFLOAD_SPEEDUP}x** |
150+
151+
<details>
152+
<summary><strong>Notes</strong></summary>
153+
154+
{MNG_TEST_COUNT} tests collected (unit + integration, excluding acceptance and release).
155+
Individual tests are lightweight and fast-running, so the default xdist cap is n={MNG_BASELINE_N}.
156+
Offload bypasses xdist entirely, fanning out across up to {MNG_MAX_PARALLEL} isolated Modal sandboxes. The low per-test cost makes Offload's per-sandbox overhead proportionally larger, yielding a more modest {MNG_OFFLOAD_SPEEDUP}x speedup vs Sculptor's {SCULPTOR_OFFLOAD_SPEEDUP}x.
157+
158+
</details>
159+
```
160+
161+
### Placeholder Reference
162+
163+
| Placeholder | Source |
164+
|-------------|--------|
165+
| `{MACHINE_DESCRIPTION}` | Ask the user, or read from the existing README intro paragraph |
166+
| `{*_BASELINE_N}` | Default xdist `-n` value from the justfile |
167+
| `{*_HIGH_N}` | The higher xdist count used in run 2 |
168+
| `{*_BASELINE_TIME}` | Wall-clock seconds from run 1 |
169+
| `{*_HIGH_TIME}` | Wall-clock seconds from run 2 |
170+
| `{*_OFFLOAD_TIME}` | Wall-clock seconds from run 3 (second invocation only) |
171+
| `{*_HIGH_PCT}` | `(high_time / baseline_time) * 100` |
172+
| `{*_OFFLOAD_PCT}` | `(offload_time / baseline_time) * 100` |
173+
| `{*_HIGH_SPEEDUP}` | `baseline_time / high_time` |
174+
| `{*_OFFLOAD_SPEEDUP}` | `baseline_time / offload_time` |
175+
| `{*_HIGH_BAR_WIDTH}` | `round((high_time / baseline_time) * 150)` |
176+
| `{*_OFFLOAD_BAR_WIDTH}` | `round((offload_time / baseline_time) * 150)` |
177+
| `{*_MAX_PARALLEL}` | Read from the relevant `offload*.toml` config (`max_parallel` field) |
178+
| `{*_TEST_COUNT}` | Number of tests collected, from pytest output |
179+
180+
## Step 5: Update the README
181+
182+
1. Read `offload/README.md`.
183+
2. Find the `## Benchmarks` section (starts at `## Benchmarks`, ends just before the next `## ` heading).
184+
3. Replace that entire section with the generated content from Step 4.
185+
4. Write the file back.
186+
5. Verify the replacement by reading the file again and confirming the new numbers appear.
187+
188+
## Step 6: Report Results
189+
190+
Summarize to the user:
191+
192+
1. A table of all six runs with their wall-clock times and pass/fail status.
193+
2. Any runs flagged as INVALID (test failures).
194+
3. The computed speedups.
195+
4. Confirmation that the README was updated (or not, if any run was INVALID -- in that case, ask the user whether to update anyway).
196+
197+
## Rules
198+
199+
- **Never run benchmarks concurrently.** Sequential execution only.
200+
- **Run Offload twice, take the second time.** The first run warms the Modal image cache.
201+
- **If any tests fail, report results but mark as INVALID.** Ask the user before updating the README with invalid data.
202+
- **Maintain parallel phrasing in Notes sections.** Both projects' Notes blocks should follow the same structure: test count and type, why the default xdist cap is what it is, how Offload bypasses xdist, and why the speedup is high or low.
203+
- **Read justfiles at invocation time.** Commands may have changed since this skill was written.
204+
- **Preserve the rest of the README.** Only replace the `## Benchmarks` section.

0 commit comments

Comments
 (0)