|
| 1 | +--- |
| 2 | +name: offload-benchmark |
| 3 | +description: "Run local vs Offload benchmarks for mng and sculptor, then update the Benchmarks section of the Offload README." |
| 4 | +--- |
| 5 | + |
| 6 | +# Offload Benchmark |
| 7 | + |
| 8 | +Run local and Offload test suites for **mng** and **sculptor**, collect timing data, and update the `## Benchmarks` section of the Offload README with fresh numbers. |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +1. Offload must be installed: `cargo install offload@0.5.0` (or built from source in the offload repo). |
| 13 | +2. Modal must be authenticated: `modal token new` (if credentials are expired). |
| 14 | +3. Both repos must exist locally: |
| 15 | + - **mng**: `~/imbue/mng` (or wherever the monorepo is checked out) |
| 16 | + - **sculptor**: `~/imbue/sculptor` |
| 17 | +4. Run `just install` from the offload repo root to ensure the binary is up to date. |
| 18 | + |
| 19 | +Verify all four before proceeding. If any prerequisite is missing, stop and tell the user. |
| 20 | + |
| 21 | +## Step 1: Read Current Test Commands |
| 22 | + |
| 23 | +**Do not hardcode commands.** Read the justfile in each repo at invocation time, since commands change: |
| 24 | + |
| 25 | +- **mng**: Read `~/imbue/mng/justfile` and locate: |
| 26 | + - `test-integration` -- the local baseline (unit + integration tests via pytest with xdist) |
| 27 | + - `test-offload` -- the Offload run on Modal |
| 28 | +- **sculptor**: Read `~/imbue/sculptor/justfile` and locate: |
| 29 | + - `test-integration` -- the local baseline (Playwright integration tests via pytest with xdist) |
| 30 | + - `test-integration-offload` -- the Offload run on Modal |
| 31 | + |
| 32 | +Record the exact commands from each justfile. If any target is missing or has changed significantly, stop and tell the user. |
| 33 | + |
| 34 | +Also read each justfile to determine: |
| 35 | +- The default xdist worker count for the local baseline (look for `-n auto --maxprocesses N` or `-n N`) |
| 36 | +- The higher xdist worker count to use for a second local run (use double the default, capped at 8) |
| 37 | + |
| 38 | +## Step 2: Run Benchmarks |
| 39 | + |
| 40 | +**CRITICAL RULES:** |
| 41 | +- NEVER run any two benchmark runs concurrently. Each run must have exclusive machine resources. Run them strictly sequentially. |
| 42 | +- Close all other heavy processes (browsers, IDEs, Docker containers) before starting, or at minimum warn the user to do so. |
| 43 | +- Use `time` (bash builtin) to measure wall-clock time for each run. Wrap each command in `time (...)` and record the real time. |
| 44 | +- If any tests fail in a run, flag the run as **INVALID** but still record the timing. Report the failure to the user at the end. |
| 45 | + |
| 46 | +### 2a: mng benchmarks |
| 47 | + |
| 48 | +Run these three commands sequentially from the mng repo root (`~/imbue/mng`): |
| 49 | + |
| 50 | +1. **Local baseline** (default xdist workers): |
| 51 | + ``` |
| 52 | + time just test-integration |
| 53 | + ``` |
| 54 | + Record the wall-clock time. |
| 55 | + |
| 56 | +2. **Local high-xdist** (override to higher worker count): |
| 57 | + Read the default `-n` value from the justfile. Run with double that value (capped at 8). For example, if the default is `-n 4`: |
| 58 | + ``` |
| 59 | + time PYTEST_NUMPROCESSES=8 just test-integration |
| 60 | + ``` |
| 61 | + If the justfile uses `PYTEST_NUMPROCESSES` or a similar env var for xdist workers, use that. Otherwise, pass the appropriate flag or env var. Read the justfile carefully to determine the correct override mechanism. |
| 62 | + |
| 63 | +3. **Offload (warm cache)**: Run the offload command TWICE. Discard the first run's timing (it warms the image cache). Record only the second run's timing. |
| 64 | + ``` |
| 65 | + just test-offload # warm-up run (discard timing) |
| 66 | + time just test-offload # benchmark run (record timing) |
| 67 | + ``` |
| 68 | + |
| 69 | +Record the number of tests collected (visible in pytest output) and the xdist `-n` values used. |
| 70 | + |
| 71 | +### 2b: sculptor benchmarks |
| 72 | + |
| 73 | +Run these three commands sequentially from the sculptor repo root (`~/imbue/sculptor`): |
| 74 | + |
| 75 | +1. **Local baseline** (default xdist workers): |
| 76 | + ``` |
| 77 | + time just test-integration |
| 78 | + ``` |
| 79 | + Record the wall-clock time. |
| 80 | + |
| 81 | +2. **Local high-xdist** (override to higher worker count): |
| 82 | + Read the default `--maxprocesses` value from the justfile (currently 3). Run with a higher value: |
| 83 | + ``` |
| 84 | + time XDIST_WORKERS=8 just test-integration |
| 85 | + ``` |
| 86 | + Use the override mechanism from the justfile (`XDIST_WORKERS` env var). |
| 87 | + |
| 88 | +3. **Offload (warm cache)**: Run the offload command TWICE. Discard the first run's timing. Record only the second. |
| 89 | + ``` |
| 90 | + just test-integration-offload # warm-up run (discard timing) |
| 91 | + time just test-integration-offload # benchmark run (record timing) |
| 92 | + ``` |
| 93 | + |
| 94 | +Record the number of tests collected and the xdist `-n` values used. |
| 95 | + |
| 96 | +## Step 3: Compute Metrics |
| 97 | + |
| 98 | +For each project, compute: |
| 99 | + |
| 100 | +| Metric | Formula | |
| 101 | +|--------|---------| |
| 102 | +| Time (s) | Wall-clock seconds from `time`, rounded to 1 decimal | |
| 103 | +| Time (%) | `(run_time / baseline_time) * 100`, rounded to 1 decimal | |
| 104 | +| Speedup | `baseline_time / run_time`, rounded to 2 decimals | |
| 105 | +| Bar width (px) | `round((run_time / baseline_time) * 150)` -- baseline is always 150px | |
| 106 | + |
| 107 | +The baseline is always the first run (default xdist) for each project. |
| 108 | + |
| 109 | +## Step 4: Generate the Updated Benchmarks Section |
| 110 | + |
| 111 | +Replace the entire `## Benchmarks` section in the Offload README (from `## Benchmarks` up to but not including the next `##` heading) with the template below, filled in with measured values. |
| 112 | + |
| 113 | +Use the correct labels for each project's test suite: |
| 114 | +- **sculptor**: "Integration Tests (Playwright)" |
| 115 | +- **mng**: "Unit + Integration Tests" |
| 116 | + |
| 117 | +The bar chart images reference `docs/bar-local.svg` (gray, `#444`) and `docs/bar-offload.svg` (green, `#22a355`). These are 1x1 SVG rectangles scaled via the `width` attribute. |
| 118 | + |
| 119 | +### Template |
| 120 | + |
| 121 | +```markdown |
| 122 | +## Benchmarks |
| 123 | + |
| 124 | +Speedups measured on Imbue projects using Offload with the Modal provider. All local baselines were run on a {MACHINE_DESCRIPTION}. |
| 125 | + |
| 126 | +### Sculptor Integration Tests (Playwright) |
| 127 | + |
| 128 | +| Run Kind | Time (s) | Time (%) | Speedup | |
| 129 | +|----------|----------|----------|---------| |
| 130 | +| pytest with xdist, n={SCULPTOR_BASELINE_N} (baseline) | <img src="docs/bar-local.svg" width="150" height="4"> {SCULPTOR_BASELINE_TIME} | 100.0% | 1.00x | |
| 131 | +| pytest with xdist, n={SCULPTOR_HIGH_N} | <img src="docs/bar-local.svg" width="{SCULPTOR_HIGH_BAR_WIDTH}" height="4"> {SCULPTOR_HIGH_TIME} | {SCULPTOR_HIGH_PCT}% | {SCULPTOR_HIGH_SPEEDUP}x | |
| 132 | +| Offload (Modal, max {SCULPTOR_MAX_PARALLEL}) | <img src="docs/bar-offload.svg" width="{SCULPTOR_OFFLOAD_BAR_WIDTH}" height="4"> {SCULPTOR_OFFLOAD_TIME} | {SCULPTOR_OFFLOAD_PCT}% | **{SCULPTOR_OFFLOAD_SPEEDUP}x** | |
| 133 | + |
| 134 | +<details> |
| 135 | +<summary><strong>Notes</strong></summary> |
| 136 | + |
| 137 | +{SCULPTOR_TEST_COUNT} Playwright integration tests (browser-based, each launching a full Sculptor instance). |
| 138 | +Individual tests are heavyweight (Chromium + backend server per worker), so the default xdist cap is n={SCULPTOR_BASELINE_N}. |
| 139 | +Offload bypasses xdist entirely, fanning out across up to {SCULPTOR_MAX_PARALLEL} isolated Modal sandboxes -- each running a single test against its own Sculptor instance. The high per-test cost makes Offload's per-sandbox overhead negligible, yielding a {SCULPTOR_OFFLOAD_SPEEDUP}x speedup. |
| 140 | + |
| 141 | +</details> |
| 142 | + |
| 143 | +### Mng Unit + Integration Tests |
| 144 | + |
| 145 | +| Run Kind | Time (s) | Time (%) | Speedup | |
| 146 | +|----------|----------|----------|---------| |
| 147 | +| pytest with xdist, n={MNG_BASELINE_N} (baseline) | <img src="docs/bar-local.svg" width="150" height="4"> {MNG_BASELINE_TIME} | 100.0% | 1.00x | |
| 148 | +| pytest with xdist, n={MNG_HIGH_N} | <img src="docs/bar-local.svg" width="{MNG_HIGH_BAR_WIDTH}" height="4"> {MNG_HIGH_TIME} | {MNG_HIGH_PCT}% | {MNG_HIGH_SPEEDUP}x | |
| 149 | +| Offload (Modal, max {MNG_MAX_PARALLEL}) | <img src="docs/bar-offload.svg" width="{MNG_OFFLOAD_BAR_WIDTH}" height="4"> {MNG_OFFLOAD_TIME} | {MNG_OFFLOAD_PCT}% | **{MNG_OFFLOAD_SPEEDUP}x** | |
| 150 | + |
| 151 | +<details> |
| 152 | +<summary><strong>Notes</strong></summary> |
| 153 | + |
| 154 | +{MNG_TEST_COUNT} tests collected (unit + integration, excluding acceptance and release). |
| 155 | +Individual tests are lightweight and fast-running, so the default xdist cap is n={MNG_BASELINE_N}. |
| 156 | +Offload bypasses xdist entirely, fanning out across up to {MNG_MAX_PARALLEL} isolated Modal sandboxes. The low per-test cost makes Offload's per-sandbox overhead proportionally larger, yielding a more modest {MNG_OFFLOAD_SPEEDUP}x speedup vs Sculptor's {SCULPTOR_OFFLOAD_SPEEDUP}x. |
| 157 | + |
| 158 | +</details> |
| 159 | +``` |
| 160 | + |
| 161 | +### Placeholder Reference |
| 162 | + |
| 163 | +| Placeholder | Source | |
| 164 | +|-------------|--------| |
| 165 | +| `{MACHINE_DESCRIPTION}` | Ask the user, or read from the existing README intro paragraph | |
| 166 | +| `{*_BASELINE_N}` | Default xdist `-n` value from the justfile | |
| 167 | +| `{*_HIGH_N}` | The higher xdist count used in run 2 | |
| 168 | +| `{*_BASELINE_TIME}` | Wall-clock seconds from run 1 | |
| 169 | +| `{*_HIGH_TIME}` | Wall-clock seconds from run 2 | |
| 170 | +| `{*_OFFLOAD_TIME}` | Wall-clock seconds from run 3 (second invocation only) | |
| 171 | +| `{*_HIGH_PCT}` | `(high_time / baseline_time) * 100` | |
| 172 | +| `{*_OFFLOAD_PCT}` | `(offload_time / baseline_time) * 100` | |
| 173 | +| `{*_HIGH_SPEEDUP}` | `baseline_time / high_time` | |
| 174 | +| `{*_OFFLOAD_SPEEDUP}` | `baseline_time / offload_time` | |
| 175 | +| `{*_HIGH_BAR_WIDTH}` | `round((high_time / baseline_time) * 150)` | |
| 176 | +| `{*_OFFLOAD_BAR_WIDTH}` | `round((offload_time / baseline_time) * 150)` | |
| 177 | +| `{*_MAX_PARALLEL}` | Read from the relevant `offload*.toml` config (`max_parallel` field) | |
| 178 | +| `{*_TEST_COUNT}` | Number of tests collected, from pytest output | |
| 179 | + |
| 180 | +## Step 5: Update the README |
| 181 | + |
| 182 | +1. Read `offload/README.md`. |
| 183 | +2. Find the `## Benchmarks` section (starts at `## Benchmarks`, ends just before the next `## ` heading). |
| 184 | +3. Replace that entire section with the generated content from Step 4. |
| 185 | +4. Write the file back. |
| 186 | +5. Verify the replacement by reading the file again and confirming the new numbers appear. |
| 187 | + |
| 188 | +## Step 6: Report Results |
| 189 | + |
| 190 | +Summarize to the user: |
| 191 | + |
| 192 | +1. A table of all six runs with their wall-clock times and pass/fail status. |
| 193 | +2. Any runs flagged as INVALID (test failures). |
| 194 | +3. The computed speedups. |
| 195 | +4. Confirmation that the README was updated (or not, if any run was INVALID -- in that case, ask the user whether to update anyway). |
| 196 | + |
| 197 | +## Rules |
| 198 | + |
| 199 | +- **Never run benchmarks concurrently.** Sequential execution only. |
| 200 | +- **Run Offload twice, take the second time.** The first run warms the Modal image cache. |
| 201 | +- **If any tests fail, report results but mark as INVALID.** Ask the user before updating the README with invalid data. |
| 202 | +- **Maintain parallel phrasing in Notes sections.** Both projects' Notes blocks should follow the same structure: test count and type, why the default xdist cap is what it is, how Offload bypasses xdist, and why the speedup is high or low. |
| 203 | +- **Read justfiles at invocation time.** Commands may have changed since this skill was written. |
| 204 | +- **Preserve the rest of the README.** Only replace the `## Benchmarks` section. |
0 commit comments