imbue-ai
diff --git a/‎.beads/issues.jsonl‎
Lines changed: 2 additions & 8 deletions b/‎.beads/issues.jsonl‎
Lines changed: 2 additions & 8 deletions
diff --git a/‎.beads/last-touched‎
Lines changed: 1 addition & 1 deletion b/‎.beads/last-touched‎
Lines changed: 1 addition & 1 deletion
@@ -2,14 +2,8 @@
 {"id":"code-10","title":"Rename project: Update src/*.rs files from offload to offload","description":"Update all references in Rust source files from 'offload' to 'offload'. This includes:\n- src/main.rs: CLI name, default config file, use statements\n- src/lib.rs: crate docs and examples\n- src/config.rs: doc comments\n- src/config/schema.rs: struct names (OffloadConfig -\u003e OffloadConfig), doc comments\n- src/bundled.rs: cache directory paths\n- src/report.rs: doc examples\n- src/report/junit.rs: testsuite default name\n- src/provider.rs, src/provider/*.rs: doc examples\n- src/framework.rs, src/framework/*.rs: doc examples\n- src/orchestrator.rs, src/orchestrator/*.rs: doc examples and sandbox ID prefix\n- src/connector.rs: doc examples\n\nThe struct OffloadConfig should be renamed to OffloadConfig, and the field 'offload' in Config should be renamed to 'offload'.","status":"done","priority":1,"issue_type":"task","created_at":"2026-01-29T18:24:57.591601347Z","created_by":"Danver Braganza","updated_at":"2026-01-29T18:42:11.144979343Z"}
 {"id":"code-100","title":"Pipeline sandbox creation with early batch execution","description":"## Directive\n\nIMPORTANT: Before doing ANY git or VCS operations, you MUST activate the jujutsu skill by running /jujutsu. This is a jujutsu-managed repository. Using raw git commands will corrupt data.\n\nWhen this bead is complete, mark the final revision with a branch: danver/pipeline-sandbox-creation\n\n## Problem\n\nThe orchestrator creates all sandboxes upfront before beginning any test execution. With 200 sandboxes, this takes 8.3 seconds. During this time, zero tests are running. Since the LPT scheduler sorts batches heaviest-first, the longest batches (55-67s acceptance tests) are delayed by the full sandbox creation time.\n\n## Evidence from Trace Analysis\n\nFrom Run 3 (200 sandboxes, 525 batches):\n- Sandbox pool creation: 8.34s\n- This is now the LARGEST remaining setup phase (after overlapped discovery was implemented)\n- The first batch does not start until 14.9s into the run\n- The longest batch (67.6s) starts at ~15s and finishes at ~83s\n- If it had started at ~7s instead, it would finish at ~75s, saving ~8s off the critical path\n\nThe sandbox creation happens in `src/orchestrator/pool.rs` lines 40-60:\n\n```rust\npub async fn populate\u003cP\u003e(\n    \u0026mut self, count: usize, provider: \u0026P, config: \u0026SandboxConfig,\n) -\u003e Result\u003c(), ProviderError\u003e\nwhere P: SandboxProvider\u003cSandbox = S\u003e,\n{\n    let futures: Vec\u003c_\u003e = (0..count)\n        .map(|i| { ... provider.create_sandbox(\u0026cfg).await })\n        .collect();\n    let sandboxes = futures::future::try_join_all(futures).await?;\n    self.sandboxes.extend(sandboxes);\n    Ok(())\n}\n```\n\nAnd in `src/orchestrator.rs`, `run_with_tests` takes an already-populated `SandboxPool\u003cS\u003e` (line 165). The pool is fully populated before `run_with_tests` is called (in `src/main.rs`).\n\n## Required Changes\n\n### 1. Investigate the current flow in main.rs\n\nRead `src/main.rs` to understand where `SandboxPool::populate` is called relative to `orchestrator.run_with_tests`. Map the full sequence:\n1. Where is the pool created?\n2. Where is it populated?\n3. Where is it passed to the orchestrator?\n4. What happens between populate and run_with_tests?\n\n### 2. Design the pipelining approach\n\nThe goal is to start executing batches on early-ready sandboxes while remaining sandboxes are still being created. Two approaches:\n\n**Option A: Streaming sandbox creation into the queue**\n- Modify `run_with_tests` to accept sandboxes via a channel instead of a pre-populated pool\n- Launch sandbox creation in the background, sending each sandbox to the channel as it becomes ready\n- Workers pull from the channel as sandboxes arrive\n\n**Option B: Two-phase pool population**\n- Create a small initial pool (e.g., 10-20 sandboxes) synchronously\n- Start execution with the initial pool\n- Continue creating remaining sandboxes in the background, adding them to the pool as they become ready\n- Workers check for new sandboxes periodically\n\n**Option C: Overlap pool creation with scheduling**\n- Run `populate` concurrently with duration loading + scheduling (which takes ~0.3s)\n- This is a minimal change but only saves ~0.3s\n\nChoose whichever option best fits the existing architecture. Option A is most elegant but requires the most refactoring. Option B is a good middle ground. Option C is low-effort but low-impact.\n\n### 3. Implement the chosen approach\n\nImplement with proper error handling. If a sandbox fails to create mid-execution, existing running sandboxes should continue. The orchestrator should work with fewer sandboxes than requested (just slower).\n\n### 4. Add tests\n\n- Test that execution starts before all sandboxes are ready\n- Test that the system works correctly with fewer sandboxes than requested (partial creation failure)\n- Test that all batches complete even if sandbox creation is slow\n\n### 5. Preserve trace events\n\nThe current trace emits `sandbox_pool_create` as a single span. With pipelining, this span should either:\n- Cover the full creation time (from first to last sandbox)\n- Or be replaced with per-sandbox creation events\n\n## Expected Impact\n\n- Saves 5-8s off the critical path by overlapping sandbox creation with early batch execution\n- The heaviest batches (acceptance tests at 55-67s) start sooner, reducing total wall clock\n- Improvement is proportional to sandbox count (more sandboxes = more savings)\n\n## Files to Modify\n- src/main.rs (understand current flow, possibly restructure)\n- src/orchestrator.rs (modify run_with_tests to accept streaming sandboxes)\n- src/orchestrator/pool.rs (possibly add streaming creation method)","status":"open","priority":2,"issue_type":"task","created_at":"2026-03-05T22:07:16.863726-08:00","created_by":"danver","updated_at":"2026-03-05T22:07:16.863726-08:00"}
 {"id":"code-101","title":"Investigate and instrument slow acceptance tests for optimization","description":"## Directive\n\nIMPORTANT: Before doing ANY git or VCS operations, you MUST activate the jujutsu skill by running /jujutsu. This is a jujutsu-managed repository. Using raw git commands will corrupt data.\n\nWhen this bead is complete, mark the final revision with a branch: danver/investigate-slow-acceptance-tests\n\n## Problem\n\nFive acceptance tests individually take 47-68 seconds each, forming the hard floor on execution time. No amount of parallelism or scheduling improvement can reduce the wall clock below the duration of the slowest single test. These tests are the dominant bottleneck in Run 3.\n\n## Evidence from Trace Analysis\n\nFrom Run 3 (26,210 tests, 200 sandboxes):\n\n| Batch | Tests | Duration | Sandbox |\n|-------|-------|----------|---------|\n| batch_3 | 1 | 67.6s | 17 |\n| batch_1 | 1 | 58.1s | 2 |\n| batch_0 | 1 | 48.0s | 0 |\n| batch_4 | 1 | 48.0s | 32 |\n| batch_2 | 1 | 47.6s | 3 |\n\nThese 5 tests consume 269 sandbox-seconds. The longest (67.6s) is 72% of the entire 93.8s execution window. Even a 2x improvement on just the slowest test would save ~34s off the critical path.\n\n## Context\n\nThese tests are in the `mng` repository (the repo that uses offload to run its tests), not in the offload repository itself. The offload tool runs whatever tests it discovers -- it does not control their content. However, we can instrument offload to help identify what makes these tests slow.\n\n## Required Changes\n\n### 1. Add per-test timing to batch output\n\nCurrently, offload knows the total batch duration but not individual test durations within a batch. For single-test batches this is fine, but for multi-test batches the per-test breakdown is invisible.\n\nIn `src/orchestrator/runner.rs`, after downloading the JUnit XML results, parse the `time` attribute from each `\u003ctestcase\u003e` element and log the top-N slowest tests:\n\n```rust\n// After downloading junit.xml, log slowest tests\nlet mut test_times: Vec\u003c(\u0026str, f64)\u003e = Vec::new();\n// Parse \u003ctestcase name=\"...\" time=\"...\"\u003e elements\n// Sort by time descending\n// Log top 5 slowest\nfor (name, time) in test_times.iter().take(5) {\n    info!(\"[SLOW TEST] {}: {:.1}s\", name, time);\n}\n```\n\n### 2. Add a `--slow-test-threshold` CLI flag\n\nAdd a `--slow-test-threshold` flag (default: 30s) that causes offload to emit a warning for any test exceeding the threshold:\n\n```\nWARNING: Test 'test_full_acceptance_flow' took 67.6s (threshold: 30s)\n```\n\nThis makes slow tests visible in CI output without requiring trace analysis.\n\n### 3. Add slow test data to the Perfetto trace\n\nIn the trace output, add per-test duration events. Currently the trace has batch-level events (`exec_batch`, `download_results`). Add individual test events within the exec thread:\n\n```rust\n// For each testcase in the junit XML:\ntracer.complete_event(\n    test_name,\n    \"test\",\n    sandbox_pid,\n    TID_EXEC,\n    test_start_us,\n    test_duration_us,\n);\n```\n\nThis requires parsing the JUnit XML for individual test times and mapping them back to the trace timeline. The start time can be approximated (batch_start + cumulative_previous_test_times).\n\n### 4. Add a summary section to the run output\n\nAfter the existing summary (passed/failed/flaky counts), add a \"Slowest Tests\" section:\n\n```\nSlowest tests:\n  1. test_full_acceptance_flow  67.6s\n  2. test_end_to_end_pipeline   58.1s\n  3. test_modal_integration     48.0s\n  ...\n```\n\nUse the JUnit XML `time` attributes as the source of truth.\n\n### 5. Write tests\n\n- Test that the slow test warning is emitted when a test exceeds the threshold\n- Test that the slow test summary is correctly sorted and limited to top N\n- Test that per-test trace events are emitted correctly\n\n## Expected Impact\n\n- No direct wall-clock improvement (this is instrumentation)\n- Enables the mng team to identify and profile the specific slow tests\n- The slow test warnings in CI output will create visibility and pressure to fix them\n- Per-test trace events enable deeper analysis in Perfetto UI\n\n## Files to Modify\n- src/orchestrator/runner.rs (add per-test timing extraction from JUnit XML)\n- src/main.rs (add --slow-test-threshold flag)\n- src/report.rs or src/report/junit.rs (add slow test summary to output)\n- src/trace.rs (possibly add per-test trace events)","status":"open","priority":2,"issue_type":"task","created_at":"2026-03-05T22:07:16.863726-08:00","created_by":"danver","updated_at":"2026-03-05T22:07:16.863726-08:00"}
-{"id":"code-102","title":"Strip pytest framework config to bare minimum","description":"Remove python, extra_args, markers fields from PytestFrameworkConfig. Make command required (not Option). Keep test_id_format as internal constant. Update all related code: schema.rs, pytest.rs, main.rs init template, example TOML configs, tests, README.","status":"done","priority":1,"issue_type":"task","created_at":"2026-03-10T11:56:28.4364-07:00","created_by":"Jacob Kirmayer","updated_at":"2026-03-10T12:03:22.088725-07:00"}
-{"id":"code-103","title":"Add CostEstimate struct to provider.rs with cpu_seconds and estimated_cost_usd fields. Include Display impl. The struct should be Clone, Debug, Default.","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T11:24:05.719648-07:00","created_by":"danver","updated_at":"2026-03-11T11:27:36.278936-07:00"}
-{"id":"code-104","title":"Track sandbox creation time in DefaultSandbox by adding a created_at: Instant field, set in DefaultSandbox::new. Update DefaultProvider::create_sandbox to pass Instant::now().","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T11:24:10.902597-07:00","created_by":"danver","updated_at":"2026-03-11T11:30:36.149893-07:00"}
-{"id":"code-105","title":"Add cost_estimate() -\u003e CostEstimate method to Sandbox trait. Implement in DefaultSandbox using elapsed time from created_at and Modal pricing ($0.00003942/core/sec). LocalSandbox returns CostEstimate::default().","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T11:24:15.537292-07:00","created_by":"danver","updated_at":"2026-03-11T11:33:51.975659-07:00"}
-{"id":"code-106","title":"Add estimated_cost: CostEstimate field to RunResult. Aggregate costs from sandboxes during cleanup in orchestrator.rs. Update print_summary to accept optional show_cost bool and display cost when true.","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T11:24:20.133117-07:00","created_by":"danver","updated_at":"2026-03-11T11:37:48.727278-07:00"}
-{"id":"code-107","title":"Add --show-estimated-cost flag to Commands::Run in main.rs. Help text: 'Show estimated sandbox cost after run. Note: This is calculated client-side using simple formulas and may not reflect actual billing, discounts, or pricing adjustments.'","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T11:24:27.889734-07:00","created_by":"danver","updated_at":"2026-03-11T11:40:21.818977-07:00"}
-{"id":"code-108","title":"Wire --show-estimated-cost through run_tests -\u003e dispatch_framework -\u003e run_all_tests -\u003e orchestrator. Pass show_cost to print_summary. Only display cost line when flag is set and cost \u003e 0.","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T11:24:32.467886-07:00","created_by":"danver","updated_at":"2026-03-11T11:44:20.743176-07:00"}
-{"id":"code-109","title":"Add cpu_cores field to DefaultProviderConfig and ModalProviderConfig with default 0.125. Pass cpu_cores to DefaultSandbox for cost calculation. Update cost_estimate() to multiply by cpu_cores. The cpu_cores should also be injectable into command templates via {cpu_cores} placeholder.","description":"Add cpu_cores field to ModalProviderConfig (default 0.125) and DefaultProviderConfig (default 1.0). Plumb cpu_cores through ModalProvider to modal_sandbox.py create via --cpu flag. Pass cpu_cores to DefaultSandbox for cost calculation. Update cost_estimate() to multiply by cpu_cores. Inject {cpu_cores} into DefaultProvider command templates. Trim wordy doc comments to one-line summaries.","status":"done","priority":0,"issue_type":"task","created_at":"2026-03-11T12:14:24.398544-07:00","created_by":"danver","updated_at":"2026-03-11T12:53:37.569493-07:00"}
+{"id":"code-102","title":"Add vitest duplicate test name check to onboarding skill","description":"Update SKILL.md to detect vitest framework and check for duplicate space-separated test IDs during onboarding. If duplicates are found, the agent must stop and ask the user if they want the agent to deduplicate them by renaming tests more verbosely. Convey that this is a blocking requirement for using Offload.","status":"done","priority":0,"issue_type":"task","owner":"jacob.kirmayer@imbue.com","created_at":"2026-03-16T10:16:55.275063-07:00","created_by":"Jacob Kirmayer","updated_at":"2026-03-16T10:22:43.276909-07:00"}
+{"id":"code-103","title":"Add offload collect verification step to onboarding skill","description":"Update SKILL.md Step 10 (Run Offload Locally and Verify) to instruct agents to use 'offload collect' first to verify discovery works before running full 'offload run'. The agent should iterate on offload collect until discovery succeeds before attempting execution.","status":"done","priority":0,"issue_type":"task","owner":"jacob.kirmayer@imbue.com","created_at":"2026-03-16T10:54:33.111537-07:00","created_by":"Jacob Kirmayer","updated_at":"2026-03-16T10:56:07.309012-07:00"}
 {"id":"code-11","title":"Rename project: Rename offload-*.toml config files to offload-*.toml","description":"Rename all configuration files with 'offload' prefix to use 'offload' prefix:\n- offload.toml -\u003e offload.toml\n- offload-local.toml -\u003e offload-local.toml\n- offload-modal.toml -\u003e offload-modal.toml\n- offload-cargo-local.toml -\u003e offload-cargo-local.toml\n- offload-cargo-modal.toml -\u003e offload-cargo-modal.toml\n- offload-computronium-modal.toml -\u003e offload-computronium-modal.toml\n- offload-sculptor-modal.toml -\u003e offload-sculptor-modal.toml\n\nAlso update the [offload] section in these files to [offload].","status":"done","priority":1,"issue_type":"task","created_at":"2026-01-29T18:25:03.560121502Z","created_by":"Danver Braganza","updated_at":"2026-01-29T18:45:18.15783543Z"}
 {"id":"code-12","title":"Rename project: Update README.md from offload to offload","description":"Update README.md to replace all references to 'offload' with 'offload'. This includes:\n- Project title\n- Feature descriptions\n- Installation commands\n- CLI examples (offload init, offload run, etc.)\n- Configuration file references (offload.toml -\u003e offload.toml)\n- Example configuration sections ([offload] -\u003e [offload])\n- All documentation text","status":"done","priority":1,"issue_type":"task","created_at":"2026-01-29T18:25:08.706866046Z","created_by":"Danver Braganza","updated_at":"2026-01-29T18:50:11.476117046Z"}
 {"id":"code-13","title":"Rename project: Update scripts/modal_sandbox.py from offload to offload","description":"Update scripts/modal_sandbox.py to replace all references to 'offload' with 'offload'. This includes:\n- Module docstring\n- CLI help text\n- Modal App names (offload-sandbox -\u003e offload-sandbox, offload-rust-sandbox -\u003e offload-rust-sandbox, etc.)\n- Function docstrings\n- Comments","status":"done","priority":1,"issue_type":"task","created_at":"2026-01-29T18:25:14.017333924Z","created_by":"Danver Braganza","updated_at":"2026-01-29T18:52:06.241321461Z"}
 
@@ -1 +1 @@
-code-109
+code-103