@@ -9,6 +9,292 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99
1010Nothing yet.
1111
12+ ## [ 0.2.1] — 2026-04-27
13+
14+ ### Release summary
15+
16+ 0.2.1 is a queue-layer bugfix release driven by early-adopter findings
17+ from the electricrag F.1 inference session of 2026-04-26 → 2026-04-27.
18+ The 0.2.0 queue layer worked end-to-end but had three discrete design
19+ gaps that surfaced under real interactive cluster use; this release
20+ closes all three plus three smaller pieces of side friction the same
21+ session uncovered.
22+
23+ ### Fixed (gap 1) — ` aexp run-queued ` streams subprocess output live
24+
25+ The 0.2.0 implementation invoked the runner via
26+ ` subprocess.run(..., capture_output=True) ` , which buffers stdout and
27+ stderr in memory until process exit then dumps them at once. For
28+ interactive consumers (notebook runners, terminal ` aexp run-queued `
29+ calls) a 15-25 minute training run appeared totally silent until the
30+ end. During the electricrag F.1 session this caused multiple
31+ panic-kills of healthy jobs because the user couldn't tell whether
32+ the work was alive vs hung.
33+
34+ ` run_queued ` now uses ` subprocess.Popen ` with line-by-line streaming
35+ (` bufsize=1 ` , stderr merged into stdout for interleave-correct
36+ ordering), and writes each line to the parent's stdout immediately
37+ with a flush. A bounded ` deque(maxlen=200) ` ring buffer captures the
38+ last ~ 16 KB of merged output for the failure-tail path; the rendered
39+ ` last_error.stderr_tail ` is still capped at ~ 2 KB of bytes for
40+ log-storage parity with 0.2.0.
41+
42+ This fix obsoletes the in-place cluster patch that was applied during
43+ the 2026-04-26 session (` capture_output=True ` line removed). The
44+ upstream version preserves both halves of the contract: live output
45+ to the caller AND a forensics-tail in ` job.doc ` .
46+
47+ ### Added (gap 2) — ` aexp queue stop <jobid> ` interrupts a running job
48+
49+ 0.2.0 had no verb to interrupt a running queued job. The only
50+ recourse was hand-rolled ` ps aux | grep ... → kill -9 <pid> ` ,
51+ followed by ` mark_status(job, 'failed') ` via the Python API.
52+ Dangerous: SIGKILL on a recycled pid can nuke arbitrary cluster
53+ processes; multiple PIDs in the spawn tree (` aexp run-queued ` parent
54+ + wrapper + inner training process) had to be killed individually.
55+
56+ ` run_queued ` now spawns the subprocess in its own session/process
57+ group (POSIX ` os.setsid ` / Windows ` CREATE_NEW_PROCESS_GROUP ` ) and
58+ records ` pid ` , ` pgid ` , hostname, and a process-start-time fingerprint
59+ in ` job.doc["queue"]["proc"] ` for the duration of the run. The
60+ record is cleared on every exit path so a downstream ` queue stop `
61+ can't be tricked into killing a recycled pid.
62+
63+ ` stop_queued() ` (CLI: ` aexp queue stop <jobid> ` ) reads the proc
64+ record, refuses if the recorded host differs from this machine,
65+ checks the start-time fingerprint to detect pid recycling, sends
66+ SIGTERM to the process group, polls during a configurable grace
67+ window (default 5s, override with ` --grace-s ` ), and escalates to
68+ SIGKILL if the runner ignores SIGTERM. ` --force ` skips SIGTERM
69+ entirely.
70+
71+ A new ` "stopped" ` terminal status (added to ` RunStatus ` ) distinguishes
72+ operator-stops from ` "failed" ` (runtime crash) and ` "abandoned" `
73+ (never executed / pre-execution give-up). Validator's
74+ ` VALID_STATUSES ` constant updated to recognize the new status.
75+
76+ ### Added (gap 3) — ` add_to_queue ` dedupes recommit-only diffs
77+
78+ 0.2.0's ` add_to_queue ` silently created a new signac job whenever
79+ the sp differed, including when the only diff was the auto-injected
80+ ` code_commit ` from a working-tree commit between two queueings.
81+ Common footgun: queue, fix a docstring, queue again — now you have
82+ 2N functionally identical pending jobs.
83+
84+ ` add_to_queue ` (and ` add_many_to_queue ` via Cartesian product) now
85+ scans existing pending entries for the same ` (experiment_id, tag) `
86+ and compares sps modulo ` code_commit ` and ` code_dirty ` . Matches
87+ return the existing job and emit a ` DuplicatePendingJobWarning `
88+ (new) instead of creating a duplicate. Pass
89+ ` allow_dup_on_recommit=True ` (CLI: ` --allow-dup-on-recommit ` ) when
90+ the recommit * is* the point of the new entries.
91+
92+ Tag-scoped: different tags = different operational queues = no
93+ dedupe. Terminal-status entries (complete / failed / abandoned /
94+ stopped) are not deduped against — re-running a finished experiment
95+ is intentional, not a footgun.
96+
97+ ### Added (side-friction) — ` {sp_json_shell} ` placeholder
98+
99+ The 0.2.0 ` {sp_json} ` placeholder emits raw JSON without shell
100+ escaping. Templates that wrap it in shell quotes
101+ (` runner_command: "python foo.py '{sp_json}'" ` ) break for any sp
102+ value containing the same quote character — apostrophes in
103+ sp.notes were the actual electricrag failure mode.
104+
105+ New ` {sp_json_shell} ` placeholder applies ` shlex.quote ` to the
106+ JSON payload. Drop it in the template * unquoted* (the shell quoting
107+ is part of what ` shlex.quote ` produces). POSIX-safe; Windows cmd.exe
108+ caveat is documented (cluster is Linux, where it matters).
109+
110+ The original ` {sp_json} ` is preserved unchanged for backward
111+ compatibility; the docstring now warns about the apostrophe trap and
112+ points consumers at ` {sp_json_shell} ` for any shell-quoted context.
113+
114+ ### Added (side-friction) — heartbeat in ` run_lifecycle `
115+
116+ 0.2.0's signac job document had a ` status='running' ` flag set once
117+ at start of ` run_lifecycle ` and updated only on terminal transition.
118+ Consumers using doc mtime as a liveness signal got false-stale
119+ readings while jobs were working hard (no doc writes during inference
120+ loops). The electricrag F.1 session lost real time to this.
121+
122+ ` run_lifecycle ` now starts a daemon heartbeat thread that touches
123+ ` doc["heartbeat_at"] ` (ISO-8601 UTC) every ` heartbeat_s ` seconds
124+ (default 30s; override per-call via the kwarg, globally via
125+ ` AEXP_HEARTBEAT_S ` env var, or set to 0 to disable). External
126+ liveness probes can compare ` heartbeat_at ` to wall-clock to
127+ distinguish "still working" (heartbeat advancing) from "wedged"
128+ (heartbeat stuck > N intervals ago).
129+
130+ The heartbeat is daemon-threaded so SIGKILL of the parent doesn't
131+ leave it dangling; write exceptions inside the thread are swallowed
132+ silently so a heartbeat-thread crash can't mask the real failure on
133+ the main path.
134+
135+ ### Added (side-friction) — ` code_diff_summary ` capture for dirty trees
136+
137+ When ` code_dirty=True ` , the bare ` code_commit ` SHA isn't a precise
138+ reproducer — there are uncommitted changes layered on top. 0.2.1
139+ captures a structured ` queue.code_diff_summary ` blob on dirty queue
140+ adds:
141+
142+ - ` diff_stat ` : ` git diff --stat HEAD ` output (one line per changed
143+ file plus totals row).
144+ - ` modified_count ` : number of modified/staged files.
145+ - ` untracked_count ` : number of untracked files (forensics for the
146+ "did I forget to ` git add ` ?" case).
147+
148+ Best-effort: capture is wrapped in try/except so a queue add never
149+ fails because git is unavailable.
150+
151+ ### Fixed — ` aexp queue stop ` actually kills the process tree on Windows
152+
153+ The 0.2.1-rc Windows path for ` stop_queued ` was broken in ** four**
154+ layered ways, all caught during manual smoke testing between two
155+ PowerShell windows. Each fix below was needed; together they make
156+ cross-shell ` aexp queue stop --force ` work end-to-end.
157+
158+ 1 . ** ` CTRL_BREAK_EVENT ` doesn't deliver across consoles.** Per Win32
159+ docs the signal is only delivered to processes that share a console
160+ with the sender; ` aexp queue stop ` invoked from a different shell
161+ than the one running ` run-queued ` runs in a different console, so
162+ the call succeeded but the signal was silently dropped.
163+ 2 . ** The SIGKILL escalation also fell back to ` CTRL_BREAK_EVENT ` .**
164+ ` signal.SIGKILL ` doesn't exist on Windows, so the escalation path
165+ resolved to ` signal.SIGTERM ` , which the dispatch handled by sending
166+ ` CTRL_BREAK_EVENT ` again — same broken signal, same silent no-op.
167+ 3 . ** ` _proc_alive(pid, 0) ` reported alive processes as dead.** Python's
168+ Windows ` os.kill(pid, 0) ` does not special-case ` sig=0 ` as a
169+ liveness probe (the way POSIX does); it tries to dispatch through
170+ ` TerminateProcess(handle, 0) ` , which the kernel rejects with
171+ ` ERROR_INVALID_PARAMETER ` (WinError 87). Catching the
172+ ` OSError ` and returning ` False ` meant ` stop_queued ` thought every
173+ pid was dead before it ever tried to signal it, short-circuiting
174+ to "pid already exited; status only" without invoking taskkill.
175+ 4 . ** signac doc-store rename races between processes.** Once ` taskkill `
176+ actually fired, the runner process and the stop process both raced
177+ to write terminal-status fields to the same JSON file. signac's
178+ atomic-rename on Windows isn't atomic against concurrent
179+ rename/read from another process — whichever side lost the race
180+ raised ` PermissionError [WinError 5] ` (rename-over a locked target)
181+ or ` [Errno 13] ` (open-for-read while another writer holds the file).
182+ The losing process surfaced a Python traceback to the user even
183+ though the kill itself worked.
184+
185+ Net effect: stop_queued returned "stopped" (status flipped on disk)
186+ but the actual subprocess and its child python.exe both kept running
187+ to completion.
188+
189+ The Windows escalation path now invokes
190+ ` taskkill /PID <pid> /F /T ` :
191+
192+ - ` /F ` invokes ` TerminateProcess ` — works cross-console.
193+ - ` /T ` walks the process tree, killing the inner ` python.exe ` along
194+ with the ` cmd.exe ` shell wrapper that ` subprocess.Popen(shell=True) `
195+ spawns. Without ` /T ` , killing only ` cmd.exe ` orphans the inner
196+ process and the user sees no behavior change.
197+
198+ The SIGTERM grace path still attempts ` CTRL_BREAK_EVENT ` (it works in
199+ the same-console case — unit tests, single-shell scripts) but the
200+ escalation no longer relies on it.
201+
202+ Fixes:
203+
204+ - ` _send_stop_signal(pid, pgid, *, force: bool) ` replaces the previous
205+ ` _send_signal_safely(pid, pgid, sig) ` shape. Encoding * intent*
206+ (force vs graceful) in the parameter rather than dispatching on a
207+ signal value removes the ` signal.SIGKILL ` -doesn't-exist-on-Windows
208+ ambiguity and guarantees the force path takes the ` taskkill /F /T `
209+ branch. POSIX behavior unchanged.
210+ - ` _proc_alive ` on Windows now uses ` OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION) `
211+ + ` GetExitCodeProcess ` (via ctypes), checking against ` STILL_ACTIVE `
212+ (259). This is the proper Win32 liveness pattern; it doesn't rely
213+ on the ambiguous ` os.kill(pid, 0) ` semantics. POSIX path unchanged.
214+ - New ` aexp.utils.atomic.doc_op_with_retry ` helper retries any
215+ signac-doc operation (read or write) on ` PermissionError ` with mild
216+ exponential backoff (10 attempts, 50ms → 500ms cap). Applied
217+ throughout the run/stop terminal-status writers in ` run_lifecycle ` ,
218+ ` mark_status ` , ` _finalize_stopped ` , ` _clear_running_proc ` , and
219+ ` run_queued ` 's last_error capture. Resolves the cross-process rename
220+ race transparently; on POSIX it's a no-op (no contention).
221+ - ` run_lifecycle ` 's exception/clean-exit branches respect terminal
222+ statuses already on disk (` "stopped" ` , ` "abandoned" ` ) and don't
223+ overwrite — preserves operator-stop records over the runner's
224+ losing-the-race "failed" status.
225+ - ` run_queued ` 's failure-tail capture skips the ` last_error ` write
226+ when ` cause="operator_stop" ` is already on disk.
227+
228+ Tests:
229+
230+ - The previously POSIX-only ` test_stop_queued_kills_running_subprocess_via_sigterm `
231+ and ` test_stop_queued_force_skips_sigterm ` now run on Windows too,
232+ validating the ` taskkill ` path AND the doc-store retry path under
233+ in-process thread contention.
234+ - New Windows-specific ` test_stop_queued_force_invokes_taskkill_on_windows `
235+ monkeypatches ` subprocess.run ` to record the argv and asserts
236+ ` taskkill /F /T ` was actually invoked — a regression guard against
237+ the dispatch dead-code class of bug.
238+
239+ ### Added (defensive) — ` aexp install ` refuses the aexp source tree
240+
241+ ` aexp install ` (and the underlying ` install_limina ` ) now detects when
242+ ` repo_root ` is — or is a descendant of — the agentic-experiments source
243+ tree itself, and refuses with a clear error before any filesystem
244+ writes. Detection: walk up from the target directory looking for a
245+ ` pyproject.toml ` whose ` [project].name ` is ` "agentic-experiments" ` .
246+
247+ The mechanism that motivated this defense: invoking ` aexp install `
248+ through ` poetry -C <aexp-repo> run aexp install ` from a separate
249+ scratch directory. Poetry's ` -C ` flag swaps the subprocess cwd to the
250+ project, so the install ended up materializing a consumer-side scaffold
251+ (` kb/ ` , ` templates/ ` , ` .claude/ ` , ` .runs/ ` , etc.) inside the package's
252+ own source tree instead of the user's intended target. The guard
253+ catches this class of mistake at install time so the dev repo stays
254+ clean.
255+
256+ Pass ` --allow-self-install ` (CLI) / ` allow_self_install=True ` (Python
257+ API) to override when dogfooding the consumer scaffold against the dev
258+ repo is genuinely intended. New ` InstallRefused(RuntimeError) ` exception
259+ re-exported from ` aexp ` so programmatic callers can branch on it.
260+
261+ ### Behavior changes worth noting
262+
263+ - ` RunStatus ` literal extended with ` "stopped" ` . Consumers that
264+ enumerate ` RunStatus ` values exhaustively in match statements will
265+ see a new lint warning until they handle it; semantically
266+ ` "stopped" ` is a terminal state alongside ` "complete" ` ,
267+ ` "failed" ` , ` "abandoned" ` .
268+ - The new ` proc ` field under ` job.doc["queue"] ` is * transient* — it
269+ exists only between Popen-spawn and process-wait-return. Don't
270+ depend on it for post-hoc analysis.
271+ - ` run_lifecycle ` writes ` doc["heartbeat_at"] ` continually during
272+ runs. This is small per-write (~ 80 bytes ISO timestamp) but does
273+ bump signac doc-store I/O. Set ` heartbeat_s=0 ` for short-lived
274+ in-process runs that don't need it.
275+
276+ ### Test coverage
277+
278+ Queue tests grow from 58 → 79 (Linux: 80, Windows: 76). New
279+ coverage:
280+
281+ - Live-stream proof: parent stdout sees runner output before
282+ subprocess exit (regression guard for capture_output buffer-then-
283+ dump).
284+ - Stderr tail capture preserved through streaming refactor.
285+ - Proc info recorded during run / cleared after.
286+ - ` stop_queued ` no-live-proc / wrong-host / pid-recycle / SIGTERM /
287+ ` --force ` paths.
288+ - Recommit dedupe: returns existing job + emits warning; respects
289+ ` --allow-dup-on-recommit ` ; doesn't fire against terminal entries;
290+ scoped per tag; per-combo in sweeps.
291+ - ` {sp_json_shell} ` apostrophe-safety.
292+ - ` code_diff_summary ` written on dirty queue / skipped on clean.
293+ - ` run_lifecycle ` heartbeat write / disable / env-var override.
294+
295+ ` tests/test_validate.py::test_valid_statuses_constant_matches_run_status_literal `
296+ updated for the new ` "stopped" ` literal.
297+
12298## [ 0.2.0] — 2026-04-25
13299
14300### Release summary
0 commit comments