Skip to content

Commit 1baaf8b

Browse files
authored
Merge pull request #13 from KadenMc/queue-observability-and-lifecycle
0.2.1 — queue layer: live streaming, queue stop, recommit dedupe + side-friction
2 parents 51c2015 + 46a1332 commit 1baaf8b

22 files changed

Lines changed: 2902 additions & 97 deletions

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,3 +209,10 @@ cython_debug/
209209
marimo/_static/
210210
marimo/_lsp/
211211
__marimo__/
212+
213+
# Overnight 0.2.1 build — uncommitted decision log per user directive
214+
/DECISIONS.md
215+
216+
# Cygwin/MSYS bash crash artifacts (Git-for-Windows occasionally drops these)
217+
bash.exe.stackdump
218+
*.stackdump

CHANGELOG.md

Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,292 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
Nothing yet.
1111

12+
## [0.2.1] — 2026-04-27
13+
14+
### Release summary
15+
16+
0.2.1 is a queue-layer bugfix release driven by early-adopter findings
17+
from the electricrag F.1 inference session of 2026-04-26 → 2026-04-27.
18+
The 0.2.0 queue layer worked end-to-end but had three discrete design
19+
gaps that surfaced under real interactive cluster use; this release
20+
closes all three plus three smaller pieces of side friction the same
21+
session uncovered.
22+
23+
### Fixed (gap 1) — `aexp run-queued` streams subprocess output live
24+
25+
The 0.2.0 implementation invoked the runner via
26+
`subprocess.run(..., capture_output=True)`, which buffers stdout and
27+
stderr in memory until process exit then dumps them at once. For
28+
interactive consumers (notebook runners, terminal `aexp run-queued`
29+
calls) a 15-25 minute training run appeared totally silent until the
30+
end. During the electricrag F.1 session this caused multiple
31+
panic-kills of healthy jobs because the user couldn't tell whether
32+
the work was alive vs hung.
33+
34+
`run_queued` now uses `subprocess.Popen` with line-by-line streaming
35+
(`bufsize=1`, stderr merged into stdout for interleave-correct
36+
ordering), and writes each line to the parent's stdout immediately
37+
with a flush. A bounded `deque(maxlen=200)` ring buffer captures the
38+
last ~16 KB of merged output for the failure-tail path; the rendered
39+
`last_error.stderr_tail` is still capped at ~2 KB of bytes for
40+
log-storage parity with 0.2.0.
41+
42+
This fix obsoletes the in-place cluster patch that was applied during
43+
the 2026-04-26 session (`capture_output=True` line removed). The
44+
upstream version preserves both halves of the contract: live output
45+
to the caller AND a forensics-tail in `job.doc`.
46+
47+
### Added (gap 2) — `aexp queue stop <jobid>` interrupts a running job
48+
49+
0.2.0 had no verb to interrupt a running queued job. The only
50+
recourse was hand-rolled `ps aux | grep ... → kill -9 <pid>`,
51+
followed by `mark_status(job, 'failed')` via the Python API.
52+
Dangerous: SIGKILL on a recycled pid can nuke arbitrary cluster
53+
processes; multiple PIDs in the spawn tree (`aexp run-queued` parent
54+
+ wrapper + inner training process) had to be killed individually.
55+
56+
`run_queued` now spawns the subprocess in its own session/process
57+
group (POSIX `os.setsid` / Windows `CREATE_NEW_PROCESS_GROUP`) and
58+
records `pid`, `pgid`, hostname, and a process-start-time fingerprint
59+
in `job.doc["queue"]["proc"]` for the duration of the run. The
60+
record is cleared on every exit path so a downstream `queue stop`
61+
can't be tricked into killing a recycled pid.
62+
63+
`stop_queued()` (CLI: `aexp queue stop <jobid>`) reads the proc
64+
record, refuses if the recorded host differs from this machine,
65+
checks the start-time fingerprint to detect pid recycling, sends
66+
SIGTERM to the process group, polls during a configurable grace
67+
window (default 5s, override with `--grace-s`), and escalates to
68+
SIGKILL if the runner ignores SIGTERM. `--force` skips SIGTERM
69+
entirely.
70+
71+
A new `"stopped"` terminal status (added to `RunStatus`) distinguishes
72+
operator-stops from `"failed"` (runtime crash) and `"abandoned"`
73+
(never executed / pre-execution give-up). Validator's
74+
`VALID_STATUSES` constant updated to recognize the new status.
75+
76+
### Added (gap 3) — `add_to_queue` dedupes recommit-only diffs
77+
78+
0.2.0's `add_to_queue` silently created a new signac job whenever
79+
the sp differed, including when the only diff was the auto-injected
80+
`code_commit` from a working-tree commit between two queueings.
81+
Common footgun: queue, fix a docstring, queue again — now you have
82+
2N functionally identical pending jobs.
83+
84+
`add_to_queue` (and `add_many_to_queue` via Cartesian product) now
85+
scans existing pending entries for the same `(experiment_id, tag)`
86+
and compares sps modulo `code_commit` and `code_dirty`. Matches
87+
return the existing job and emit a `DuplicatePendingJobWarning`
88+
(new) instead of creating a duplicate. Pass
89+
`allow_dup_on_recommit=True` (CLI: `--allow-dup-on-recommit`) when
90+
the recommit *is* the point of the new entries.
91+
92+
Tag-scoped: different tags = different operational queues = no
93+
dedupe. Terminal-status entries (complete / failed / abandoned /
94+
stopped) are not deduped against — re-running a finished experiment
95+
is intentional, not a footgun.
96+
97+
### Added (side-friction) — `{sp_json_shell}` placeholder
98+
99+
The 0.2.0 `{sp_json}` placeholder emits raw JSON without shell
100+
escaping. Templates that wrap it in shell quotes
101+
(`runner_command: "python foo.py '{sp_json}'"`) break for any sp
102+
value containing the same quote character — apostrophes in
103+
sp.notes were the actual electricrag failure mode.
104+
105+
New `{sp_json_shell}` placeholder applies `shlex.quote` to the
106+
JSON payload. Drop it in the template *unquoted* (the shell quoting
107+
is part of what `shlex.quote` produces). POSIX-safe; Windows cmd.exe
108+
caveat is documented (cluster is Linux, where it matters).
109+
110+
The original `{sp_json}` is preserved unchanged for backward
111+
compatibility; the docstring now warns about the apostrophe trap and
112+
points consumers at `{sp_json_shell}` for any shell-quoted context.
113+
114+
### Added (side-friction) — heartbeat in `run_lifecycle`
115+
116+
0.2.0's signac job document had a `status='running'` flag set once
117+
at start of `run_lifecycle` and updated only on terminal transition.
118+
Consumers using doc mtime as a liveness signal got false-stale
119+
readings while jobs were working hard (no doc writes during inference
120+
loops). The electricrag F.1 session lost real time to this.
121+
122+
`run_lifecycle` now starts a daemon heartbeat thread that touches
123+
`doc["heartbeat_at"]` (ISO-8601 UTC) every `heartbeat_s` seconds
124+
(default 30s; override per-call via the kwarg, globally via
125+
`AEXP_HEARTBEAT_S` env var, or set to 0 to disable). External
126+
liveness probes can compare `heartbeat_at` to wall-clock to
127+
distinguish "still working" (heartbeat advancing) from "wedged"
128+
(heartbeat stuck > N intervals ago).
129+
130+
The heartbeat is daemon-threaded so SIGKILL of the parent doesn't
131+
leave it dangling; write exceptions inside the thread are swallowed
132+
silently so a heartbeat-thread crash can't mask the real failure on
133+
the main path.
134+
135+
### Added (side-friction) — `code_diff_summary` capture for dirty trees
136+
137+
When `code_dirty=True`, the bare `code_commit` SHA isn't a precise
138+
reproducer — there are uncommitted changes layered on top. 0.2.1
139+
captures a structured `queue.code_diff_summary` blob on dirty queue
140+
adds:
141+
142+
- `diff_stat`: `git diff --stat HEAD` output (one line per changed
143+
file plus totals row).
144+
- `modified_count`: number of modified/staged files.
145+
- `untracked_count`: number of untracked files (forensics for the
146+
"did I forget to `git add`?" case).
147+
148+
Best-effort: capture is wrapped in try/except so a queue add never
149+
fails because git is unavailable.
150+
151+
### Fixed — `aexp queue stop` actually kills the process tree on Windows
152+
153+
The 0.2.1-rc Windows path for `stop_queued` was broken in **four**
154+
layered ways, all caught during manual smoke testing between two
155+
PowerShell windows. Each fix below was needed; together they make
156+
cross-shell `aexp queue stop --force` work end-to-end.
157+
158+
1. **`CTRL_BREAK_EVENT` doesn't deliver across consoles.** Per Win32
159+
docs the signal is only delivered to processes that share a console
160+
with the sender; `aexp queue stop` invoked from a different shell
161+
than the one running `run-queued` runs in a different console, so
162+
the call succeeded but the signal was silently dropped.
163+
2. **The SIGKILL escalation also fell back to `CTRL_BREAK_EVENT`.**
164+
`signal.SIGKILL` doesn't exist on Windows, so the escalation path
165+
resolved to `signal.SIGTERM`, which the dispatch handled by sending
166+
`CTRL_BREAK_EVENT` again — same broken signal, same silent no-op.
167+
3. **`_proc_alive(pid, 0)` reported alive processes as dead.** Python's
168+
Windows `os.kill(pid, 0)` does not special-case `sig=0` as a
169+
liveness probe (the way POSIX does); it tries to dispatch through
170+
`TerminateProcess(handle, 0)`, which the kernel rejects with
171+
`ERROR_INVALID_PARAMETER` (WinError 87). Catching the
172+
`OSError` and returning `False` meant `stop_queued` thought every
173+
pid was dead before it ever tried to signal it, short-circuiting
174+
to "pid already exited; status only" without invoking taskkill.
175+
4. **signac doc-store rename races between processes.** Once `taskkill`
176+
actually fired, the runner process and the stop process both raced
177+
to write terminal-status fields to the same JSON file. signac's
178+
atomic-rename on Windows isn't atomic against concurrent
179+
rename/read from another process — whichever side lost the race
180+
raised `PermissionError [WinError 5]` (rename-over a locked target)
181+
or `[Errno 13]` (open-for-read while another writer holds the file).
182+
The losing process surfaced a Python traceback to the user even
183+
though the kill itself worked.
184+
185+
Net effect: stop_queued returned "stopped" (status flipped on disk)
186+
but the actual subprocess and its child python.exe both kept running
187+
to completion.
188+
189+
The Windows escalation path now invokes
190+
`taskkill /PID <pid> /F /T`:
191+
192+
- `/F` invokes `TerminateProcess` — works cross-console.
193+
- `/T` walks the process tree, killing the inner `python.exe` along
194+
with the `cmd.exe` shell wrapper that `subprocess.Popen(shell=True)`
195+
spawns. Without `/T`, killing only `cmd.exe` orphans the inner
196+
process and the user sees no behavior change.
197+
198+
The SIGTERM grace path still attempts `CTRL_BREAK_EVENT` (it works in
199+
the same-console case — unit tests, single-shell scripts) but the
200+
escalation no longer relies on it.
201+
202+
Fixes:
203+
204+
- `_send_stop_signal(pid, pgid, *, force: bool)` replaces the previous
205+
`_send_signal_safely(pid, pgid, sig)` shape. Encoding *intent*
206+
(force vs graceful) in the parameter rather than dispatching on a
207+
signal value removes the `signal.SIGKILL`-doesn't-exist-on-Windows
208+
ambiguity and guarantees the force path takes the `taskkill /F /T`
209+
branch. POSIX behavior unchanged.
210+
- `_proc_alive` on Windows now uses `OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION)`
211+
+ `GetExitCodeProcess` (via ctypes), checking against `STILL_ACTIVE`
212+
(259). This is the proper Win32 liveness pattern; it doesn't rely
213+
on the ambiguous `os.kill(pid, 0)` semantics. POSIX path unchanged.
214+
- New `aexp.utils.atomic.doc_op_with_retry` helper retries any
215+
signac-doc operation (read or write) on `PermissionError` with mild
216+
exponential backoff (10 attempts, 50ms → 500ms cap). Applied
217+
throughout the run/stop terminal-status writers in `run_lifecycle`,
218+
`mark_status`, `_finalize_stopped`, `_clear_running_proc`, and
219+
`run_queued`'s last_error capture. Resolves the cross-process rename
220+
race transparently; on POSIX it's a no-op (no contention).
221+
- `run_lifecycle`'s exception/clean-exit branches respect terminal
222+
statuses already on disk (`"stopped"`, `"abandoned"`) and don't
223+
overwrite — preserves operator-stop records over the runner's
224+
losing-the-race "failed" status.
225+
- `run_queued`'s failure-tail capture skips the `last_error` write
226+
when `cause="operator_stop"` is already on disk.
227+
228+
Tests:
229+
230+
- The previously POSIX-only `test_stop_queued_kills_running_subprocess_via_sigterm`
231+
and `test_stop_queued_force_skips_sigterm` now run on Windows too,
232+
validating the `taskkill` path AND the doc-store retry path under
233+
in-process thread contention.
234+
- New Windows-specific `test_stop_queued_force_invokes_taskkill_on_windows`
235+
monkeypatches `subprocess.run` to record the argv and asserts
236+
`taskkill /F /T` was actually invoked — a regression guard against
237+
the dispatch dead-code class of bug.
238+
239+
### Added (defensive) — `aexp install` refuses the aexp source tree
240+
241+
`aexp install` (and the underlying `install_limina`) now detects when
242+
`repo_root` is — or is a descendant of — the agentic-experiments source
243+
tree itself, and refuses with a clear error before any filesystem
244+
writes. Detection: walk up from the target directory looking for a
245+
`pyproject.toml` whose `[project].name` is `"agentic-experiments"`.
246+
247+
The mechanism that motivated this defense: invoking `aexp install`
248+
through `poetry -C <aexp-repo> run aexp install` from a separate
249+
scratch directory. Poetry's `-C` flag swaps the subprocess cwd to the
250+
project, so the install ended up materializing a consumer-side scaffold
251+
(`kb/`, `templates/`, `.claude/`, `.runs/`, etc.) inside the package's
252+
own source tree instead of the user's intended target. The guard
253+
catches this class of mistake at install time so the dev repo stays
254+
clean.
255+
256+
Pass `--allow-self-install` (CLI) / `allow_self_install=True` (Python
257+
API) to override when dogfooding the consumer scaffold against the dev
258+
repo is genuinely intended. New `InstallRefused(RuntimeError)` exception
259+
re-exported from `aexp` so programmatic callers can branch on it.
260+
261+
### Behavior changes worth noting
262+
263+
- `RunStatus` literal extended with `"stopped"`. Consumers that
264+
enumerate `RunStatus` values exhaustively in match statements will
265+
see a new lint warning until they handle it; semantically
266+
`"stopped"` is a terminal state alongside `"complete"`,
267+
`"failed"`, `"abandoned"`.
268+
- The new `proc` field under `job.doc["queue"]` is *transient* — it
269+
exists only between Popen-spawn and process-wait-return. Don't
270+
depend on it for post-hoc analysis.
271+
- `run_lifecycle` writes `doc["heartbeat_at"]` continually during
272+
runs. This is small per-write (~80 bytes ISO timestamp) but does
273+
bump signac doc-store I/O. Set `heartbeat_s=0` for short-lived
274+
in-process runs that don't need it.
275+
276+
### Test coverage
277+
278+
Queue tests grow from 58 → 79 (Linux: 80, Windows: 76). New
279+
coverage:
280+
281+
- Live-stream proof: parent stdout sees runner output before
282+
subprocess exit (regression guard for capture_output buffer-then-
283+
dump).
284+
- Stderr tail capture preserved through streaming refactor.
285+
- Proc info recorded during run / cleared after.
286+
- `stop_queued` no-live-proc / wrong-host / pid-recycle / SIGTERM /
287+
`--force` paths.
288+
- Recommit dedupe: returns existing job + emits warning; respects
289+
`--allow-dup-on-recommit`; doesn't fire against terminal entries;
290+
scoped per tag; per-combo in sweeps.
291+
- `{sp_json_shell}` apostrophe-safety.
292+
- `code_diff_summary` written on dirty queue / skipped on clean.
293+
- `run_lifecycle` heartbeat write / disable / env-var override.
294+
295+
`tests/test_validate.py::test_valid_statuses_constant_matches_run_status_literal`
296+
updated for the new `"stopped"` literal.
297+
12298
## [0.2.0] — 2026-04-25
13299

14300
### Release summary

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,9 @@ The design bet: agents already know how to run experiments. What they need is a
122122

123123
| | |
124124
|---|---|
125-
| **MCP server** | FastMCP with 21 tools covering artifact creation (H/E/F/T), run lifecycle, batch queries, queue management, tracker binding, and validation. Runs via `uvx --from agentic-experiments[mcp] aexp-mcp-server` — no absolute paths, no per-machine config, `.mcp.json` committable to git. |
126-
| **Slash commands** | Artifact creation: `/aexp-new-hypothesis`, `/aexp-new-experiment`, `/aexp-new-run`. Threads (forward-looking research concerns broader than a hypothesis): `/aexp-new-thread`, `/aexp-list-threads`, `/aexp-show-thread`, `/aexp-close-thread`. Finding creation (pick by what the finding cites): `/aexp-finding-from-run`, `/aexp-finding-from-batch`, `/aexp-finding-placeholder`. Read / inspect: `/aexp-show-run`, `/aexp-show-batch`, `/aexp-list-runs`, `/aexp-status`, `/aexp-validate`. Queue: `/aexp-queue-add`, `/aexp-queue-list`, `/aexp-queue-materialize`. 18 total. |
127-
| **CLI** | 18 verbs covering install, artifact creation (H/E/F/T + thread lifecycle), run lifecycle, batch queries, tracker binding, validation, offline sync, and the `queue` subcommand group (add/list/remove/clear/materialize) + `run-queued`. See `aexp --help` for the full list. Python API is a one-line `from aexp import ...`. |
125+
| **MCP server** | FastMCP with 22 tools covering artifact creation (H/E/F/T), run lifecycle, batch queries, queue management (incl. `queue_stop` for live-job interruption), tracker binding, and validation. Runs via `uvx --from agentic-experiments[mcp] aexp-mcp-server` — no absolute paths, no per-machine config, `.mcp.json` committable to git. |
126+
| **Slash commands** | Artifact creation: `/aexp-new-hypothesis`, `/aexp-new-experiment`, `/aexp-new-run`. Threads (forward-looking research concerns broader than a hypothesis): `/aexp-new-thread`, `/aexp-list-threads`, `/aexp-show-thread`, `/aexp-close-thread`. Finding creation (pick by what the finding cites): `/aexp-finding-from-run`, `/aexp-finding-from-batch`, `/aexp-finding-placeholder`. Read / inspect: `/aexp-show-run`, `/aexp-show-batch`, `/aexp-list-runs`, `/aexp-status`, `/aexp-validate`. Queue: `/aexp-queue-add`, `/aexp-queue-list`, `/aexp-queue-materialize`, `/aexp-queue-stop`. 19 total. |
127+
| **CLI** | 18 verbs covering install, artifact creation (H/E/F/T + thread lifecycle), run lifecycle, batch queries, tracker binding, validation, offline sync, and the `queue` subcommand group (add/list/remove/stop/clear/materialize/run) + `run-queued`. See `aexp --help` for the full list. Python API is a one-line `from aexp import ...`. |
128128
| **Typed JSON contracts** | Pydantic models (`RunLink`, `BatchSelector`, `Issue`, …) back the schema; MCP tools and CLI return the same shapes. |
129129

130130
---

docs/cli.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,10 +66,12 @@ aexp install-slash-commands [--target .claude/commands]
6666
# Queue subcommand group — pending-run registration + in-script execution + materialization
6767
aexp queue add --experiment E### [--sp K=V,...] [--sweep "K=V|V,K=a..b"]
6868
[--tag T] [--hypothesis H###] [--no-resolve] [--no-commit]
69+
[--allow-dup-on-recommit]
6970
aexp queue list [--experiment E###] [--tag T] [--include-terminal]
7071
aexp queue run [--experiment E###] [--tag T] [--index N]
7172
[--continue-on-failure] [--force] [--dry-run]
7273
aexp queue remove <job_id>
74+
aexp queue stop <job_id> [--grace-s 5] [--force]
7375
aexp queue clear [--experiment E###] [--tag T] [--yes]
7476
aexp queue materialize [--runner shell|slurm|manual] [--output PATH]
7577
[--tag T] [--experiment E###]

docs/mcp.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ teammates get the MCP server on clone.
9898
| `bind_tracker` | Attach a noop or wandb tracker to a run | `job_id` |
9999
| `validate` | Compose KB + run-link + finding-citation checks ||
100100
| `sync_offline` | `wandb sync` every offline run in the store ||
101+
| `queue_stop` | Interrupt a running queued job; transitions to `"stopped"` | `job_id` |
101102

102103
All return JSON-serializable dicts. Errors surface either as
103104
`{"error": ..., "code": ...}` in the return value or as MCP error

0 commit comments

Comments
 (0)