openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110) by kaghni · Pull Request #172 · activeloopai/hivemind

kaghni · 2026-05-15T19:50:35Z

Fixes #100 and #110.

Why

Two spawn-lifecycle bugs in openclaw/src/index.ts:

#100 — Wasted re-spawns: agent_end fires on every turn. The on-disk lock at ~/.deeplake/state/skillify/<projectKey>.worker.lock prevents overlapping workers, but as soon as a worker exits and releases its lock, the NEXT agent_end re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL roundtrip, sees nothing new to mine, and exits — but each spawn costs ~50ms Node cold-start + ~200ms DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.

#110 — Stale locks halt mining permanently: tryAcquireOpenclawSkillifyLock does O_CREAT | O_EXCL | O_WRONLY and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its finally releases the lock, the lock persists forever and every subsequent agent_end silently no-ops mining for that project_key permanently. Hit live during the 2026-05-07 PR #98 E2E — a manual rm <lockfile> was needed to recover.

What changed

Per-runtime dedup (#100)

New module-level const skillifySpawnedFor = new Set<string>(). Tracks which session IDs have already triggered a spawn in this gateway runtime.
agent_end handler now wraps the spawnOpenclawSkillifyWorker(...) call in if (!skillifySpawnedFor.has(sid)) { skillifySpawnedFor.add(sid); … }.
The on-disk lock stays authoritative across processes (e.g. multiple gateway restarts). The new in-memory Set only suppresses within-runtime redundancy.

Stale-lock recovery (#110)

Lock file now writes String(Date.now()) on acquire (was an empty file).
On O_EXCL failure, reads the existing lock body, parses it as a ms timestamp. If Date.now() - ts > 10 minutes OR the body is unparseable (NaN), the lock is treated as stale → unlinked → retry acquire.
Mirrors the staleness logic in src/skillify/state.ts:tryAcquireWorkerLock for the non-openclaw agents.
Migration: empty pre-existing lock files (from earlier code) parse as NaN and are treated as immediately stale on the first patched run — no manual cleanup needed.
10-minute max age is generous vs typical worker runtime (<30s + buffer). Pathological hangs longer than that release the spawn slot to the next agent_end, instead of leaking mining for the rest of the gateway's lifetime.

Tests

npm run typecheck — clean
npm test — 2380/2380 passing (one bundle-scan regex distance bumped 500→1500 to accommodate the new dedup comment block between Auto-captured and the spawn site; same assertion intent)

Test plan after merge

Long-running openclaw session (50+ turns). grep -c "Auto-captured" /tmp/openclaw/openclaw-*.log should be many; ls ~/.deeplake/state/skillify/*.worker.lock should show at most one mtime-bump per session (one spawn, not 2-5).
Kill a worker mid-mine (kill -9 $WORKER_PID). Wait 11 minutes. Next agent_end should successfully re-acquire the lock (stale-recovery path).

Summary by CodeRabbit

Bug Fixes
- Improved reliability of background worker spawning in extended agent sessions by preventing redundant spawn attempts
- Enhanced detection and cleanup of stale worker states
- Added error handling to gracefully manage worker startup failures
Tests
- Updated test validations for worker spawning behavior

github-actions · 2026-05-15T19:51:48Z

Coverage Report

No src/*.ts files changed in this PR.

_{Generated for commit 5e8ef00.}

coderabbitai · 2026-05-15T19:53:09Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4a57e427-77a7-45d8-b0ff-6f7bcbbb0141

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR reduces redundant skillify worker spawns in OpenClaw sessions from 2-5 per long session to exactly 1. It adds per-runtime session-ID deduplication via an in-memory Set<string>, and upgrades the per-projectKey lock file to detect stale locks using acquisition timestamps, unlinking stale ones to permit re-acquisition when needed.

Changes

Skillify spawn deduplication and stale lock handling

Layer / File(s)	Summary
Stale lock detection and cleanup `openclaw/src/index.ts`	Node fs imports expanded to include `unlinkSync`. Function `tryAcquireOpenclawSkillifyLock()` now writes acquisition timestamps into `${projectKey}.worker.lock`, treats missing/empty locks and locks older than `LOCK_MAX_AGE_MS` as stale, unlinks them, and avoids spawn when the lock is fresh or lock-file operations fail.
Per-session spawn deduplication and error handling `openclaw/src/index.ts`	Initialization of `skillifySpawnedFor: Set<string>()` to track spawned session IDs per runtime. The `agent_end` handler now checks this set and spawns the worker at most once per session_id within a single runtime, with try/catch logging to prevent spawn errors from interrupting capture.
Test regex tolerance update `tests/claude-code/skillify-session-start-injection.test.ts`	The spawn-site assertion regex that verifies `agent_end ... Auto-captured ... spawnOpenclawSkillifyWorker` now permits a larger gap (`{0,1500}` instead of `{0,500}`), and related inline comments were updated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

#100: OpenClaw skilify worker re-spawns on every agent_end — This PR directly implements the proposed fix (option B) using an in-memory Set<string> to deduplicate spawns by session_id, reducing spawns from 2-5 per long session to 1, and adds timestamp-based stale lock detection to avoid deadlock from stuck lock files.

Poem

🐰 Spawning workers without the care,

Left locks scattered here and there.

With timestamps and sets held tight,

One worker spawns, done just right! 🔒✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title directly identifies both fixes: per-session dedup (`#100`) and stale-lock recovery (`#110`), matching the changeset's primary objectives.
Description check	✅ Passed	Description fully details both bug fixes, implementation changes, testing status, and post-merge validation plan; aligns with the provided template structure.
Linked Issues check	✅ Passed	Code changes implement the dedup Set pattern for `#100` and timestamp-based stale-lock recovery for `#110` as specified in objectives.
Out of Scope Changes check	✅ Passed	All changes in index.ts and the test regex adjustment directly support the two linked issue objectives; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/openclaw-tier2-spawn-lifecycle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/claude-code/skillify-session-start-injection.test.ts (1)

240-243: ⚡ Quick win

Assert the new session-dedup guard explicitly.

This widened regex still only proves there's a spawn call near agent_end; it won't fail if skillifySpawnedFor.has/add disappear entirely. Since this PR's contract is “one spawn per session per runtime,” add direct assertions for the Set declaration and the has(sid) guard instead of only increasing the distance budget.

Example hardening

     expect(src).toMatch(/HIVEMIND_SKILLIFY_WORKER:\s*"1"/);
+    expect(src).toMatch(/const skillifySpawnedFor = new Set<string>\(\)/);
+    expect(src).toMatch(/if \(!skillifySpawnedFor\.has\(sid\)\)/);
+    expect(src).toMatch(/skillifySpawnedFor\.add\(sid\)/);
     expect(src).toMatch(/agent_end[\s\S]{0,3500}Auto-captured[\s\S]{0,1500}spawnOpenclawSkillifyWorker/);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/claude-code/skillify-session-start-injection.test.ts` around lines 240
- 243, The test currently only checks proximity between agent_end and
spawnOpenclawSkillifyWorker; add explicit assertions that the session-dedup
guard exists by asserting the Set declaration and its usage: assert the code
contains a declaration like "skillifySpawnedFor = new Set" (or "const
skillifySpawnedFor = new Set") and an explicit guard check
"skillifySpawnedFor.has(sid)" (e.g., `if (!skillifySpawnedFor.has(sid))`) as
well as the corresponding "skillifySpawnedFor.add(sid)" occurring before the
call to spawnOpenclawSkillifyWorker, so the test will fail if the Set or its
has/add logic are removed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openclaw/src/index.ts`:
- Around line 1208-1223: The current code marks a session as deduped by adding
sid to skillifySpawnedFor before spawnOpenclawSkillifyWorker can indicate
success, which blocks retries on non-exception failure paths; change
spawnOpenclawSkillifyWorker to return a boolean (true on worker actually
launched, false on non-exception early exits), update the caller to await the
result (if async) and only call skillifySpawnedFor.add(sid) when the function
returns true, while preserving the existing try/catch to log thrown errors
(logger.error) and still not add sid on failure.
- Around line 381-400: The race occurs because acquire() creates an empty lock
file before writing the timestamp, so a reader can see an empty string and treat
it as stale; update the staleness check in the read path (the code that reads
lockPath and interprets ts) to not immediately unlink on Number.isNaN(ts): call
fsStat(lockPath) and use the file's mtime (or birthtime) to compute fileAge, and
only treat NaN/empty contents as stale if fileAge > min(LOCK_MAX_AGE_MS,
GRACE_MS) (pick a short GRACE_MS like a few seconds) or if the fileAge >
LOCK_MAX_AGE_MS; alternatively, avoid the window entirely by replacing acquire()
with an atomic create+write (e.g., write with 'wx'/'O_EXCL|O_WRONLY' so
timestamp is written on creation), but if you keep the current acquire(),
implement the mtime check before unlinking to prevent unlinking freshly-created
empty files.

---

Nitpick comments:
In `@tests/claude-code/skillify-session-start-injection.test.ts`:
- Around line 240-243: The test currently only checks proximity between
agent_end and spawnOpenclawSkillifyWorker; add explicit assertions that the
session-dedup guard exists by asserting the Set declaration and its usage:
assert the code contains a declaration like "skillifySpawnedFor = new Set" (or
"const skillifySpawnedFor = new Set") and an explicit guard check
"skillifySpawnedFor.has(sid)" (e.g., `if (!skillifySpawnedFor.has(sid))`) as
well as the corresponding "skillifySpawnedFor.add(sid)" occurring before the
call to spawnOpenclawSkillifyWorker, so the test will fail if the Set or its
has/add logic are removed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: aba56b3c-5c10-409f-8979-6dcf914a6d47

📥 Commits

Reviewing files that changed from the base of the PR and between bdffdf0 and b164368.

📒 Files selected for processing (2)

openclaw/src/index.ts
tests/claude-code/skillify-session-start-injection.test.ts

coderabbitai · 2026-05-15T19:56:43Z

+    const acquire = (): boolean => {
+      const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY);
+      try {
+        fsWriteFile(fd, String(Date.now()));
+      } finally {
+        fsClose(fd);
+      }
+      return true;
+    };
+    try {
+      return acquire();
+    } catch {
+      // O_EXCL failed → lock file already exists. Check staleness.
+      try {
+        const body = fsReadFile(lockPath, "utf-8");
+        const ts = Number.parseInt(body.trim(), 10);
+        const age = Date.now() - (Number.isFinite(ts) ? ts : 0);
+        if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) {
+          try { fsUnlink(lockPath); } catch { /* race; recheck below */ }
+          try { return acquire(); } catch { return false; }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid treating a freshly-created empty lock file as stale.

Line 382 creates the lock file before Line 384 writes the timestamp. A racing process that hits the read path during that window will see "", parse NaN, unlink the file, and acquire the same lock. That reintroduces overlapping workers across processes.

A simple fix is to only treat NaN / empty contents as stale when the file itself is older than the stale threshold (or at least older than a short grace window), instead of immediately unlinking on parse failure.

Suggested direction

-import { +import { existsSync as fsExists, mkdirSync as fsMkdir, openSync as fsOpen, closeSync as fsClose, writeFileSync as fsWriteFile, constants as fsConstants, - readFileSync as fsReadFile, renameSync as fsRename, unlinkSync as fsUnlink, + readFileSync as fsReadFile, renameSync as fsRename, statSync as fsStat, unlinkSync as fsUnlink, } from "node:fs"; ... try { const body = fsReadFile(lockPath, "utf-8"); const ts = Number.parseInt(body.trim(), 10); - const age = Date.now() - (Number.isFinite(ts) ? ts : 0); - if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) { + const age = Number.isFinite(ts) + ? Date.now() - ts + : Date.now() - fsStat(lockPath).mtimeMs; + if (age > LOCK_MAX_AGE_MS) { try { fsUnlink(lockPath); } catch { /* race; recheck below */ } try { return acquire(); } catch { return false; } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const acquire = (): boolean => {

const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY);

try {

fsWriteFile(fd, String(Date.now()));

} finally {

fsClose(fd);

}

return true;

};

try {

return acquire();

} catch {

// O_EXCL failed → lock file already exists. Check staleness.

try {

const body = fsReadFile(lockPath, "utf-8");

const ts = Number.parseInt(body.trim(), 10);

const age = Date.now() - (Number.isFinite(ts) ? ts : 0);

if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) {

try { fsUnlink(lockPath); } catch { /* race; recheck below */ }

try { return acquire(); } catch { return false; }

const acquire = (): boolean => {

const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY);

try {

fsWriteFile(fd, String(Date.now()));

} finally {

fsClose(fd);

}

return true;

};

try {

return acquire();

} catch {

// O_EXCL failed → lock file already exists. Check staleness.

try {

const body = fsReadFile(lockPath, "utf-8");

const ts = Number.parseInt(body.trim(), 10);

const age = Number.isFinite(ts)

? Date.now() - ts

: Date.now() - fsStat(lockPath).mtimeMs;

if (age > LOCK_MAX_AGE_MS) {

try { fsUnlink(lockPath); } catch { /* race; recheck below */ }

try { return acquire(); } catch { return false; }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openclaw/src/index.ts` around lines 381 - 400, The race occurs because acquire() creates an empty lock file before writing the timestamp, so a reader can see an empty string and treat it as stale; update the staleness check in the read path (the code that reads lockPath and interprets ts) to not immediately unlink on Number.isNaN(ts): call fsStat(lockPath) and use the file's mtime (or birthtime) to compute fileAge, and only treat NaN/empty contents as stale if fileAge > min(LOCK_MAX_AGE_MS, GRACE_MS) (pick a short GRACE_MS like a few seconds) or if the fileAge > LOCK_MAX_AGE_MS; alternatively, avoid the window entirely by replacing acquire() with an atomic create+write (e.g., write with 'wx'/'O_EXCL|O_WRONLY' so timestamp is written on creation), but if you keep the current acquire(), implement the mtime check before unlinking to prevent unlinking freshly-created empty files.

coderabbitai · 2026-05-15T19:56:43Z

+          if (!skillifySpawnedFor.has(sid)) {
+            skillifySpawnedFor.add(sid);
+            try {
+              spawnOpenclawSkillifyWorker({
+                apiUrl: cfg.apiUrl,
+                token: cfg.token,
+                orgId: cfg.orgId,
+                workspaceId: cfg.workspaceId,
+                userName: cfg.userName,
+                channel: ev.channel || "openclaw",
+                sessionId: sid,
+                loggerWarn: (msg) => logger.error(`Skillify spawn: ${msg}`),
+              });
+            } catch (e: any) {
+              logger.error(`Skillify spawn threw: ${e?.message ?? e}`);
+            }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Record the session only after a worker was actually launched.

Lines 1208-1209 mark the session as deduped before spawnOpenclawSkillifyWorker(...) can tell you whether anything was spawned. The helper has several non-exception failure paths (no delegate CLI, lock not acquired, config write failure, missing worker file), so one transient failure on the first turn suppresses every later retry for that session in this runtime.

Have spawnOpenclawSkillifyWorker return a boolean and add sid to skillifySpawnedFor only on success.

Suggested fix

-function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): void { +function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): boolean { if (!fsExists(OPENCLAW_SKILLIFY_WORKER_PATH)) { a.loggerWarn?.(`skillify worker missing at ${OPENCLAW_SKILLIFY_WORKER_PATH} — reinstall openclaw plugin`); - return; + return false; } const gateAgent = detectOpenclawGateAgent(); if (!gateAgent) { a.loggerWarn?.(`skillify spawn: no delegate gate CLI found on PATH (need one of: claude, codex, cursor-agent, hermes, pi). Mining skipped.`); - return; + return false; } const { key: projectKey, project } = deriveOpenclawProjectKey(a.channel); if (!tryAcquireOpenclawSkillifyLock(projectKey)) { - return; + return false; } ... - try { fsWriteFile(configPath, JSON.stringify(config), { mode: 0o600 }); } - catch (e: any) { a.loggerWarn?.(`skillify spawn: config write failed: ${e?.message ?? e}`); return; } + try { fsWriteFile(configPath, JSON.stringify(config), { mode: 0o600 }); } + catch (e: any) { a.loggerWarn?.(`skillify spawn: config write failed: ${e?.message ?? e}`); return false; } try { realSpawn(process.execPath, [OPENCLAW_SKILLIFY_WORKER_PATH, configPath], { detached: true, stdio: "ignore", env: { ...process.env, HIVEMIND_SKILLIFY_WORKER: "1", HIVEMIND_CAPTURE: "false" }, }).unref(); + return true; } catch (e: any) { a.loggerWarn?.(`skillify spawn: spawn failed: ${e?.message ?? e}`); + return false; } } ... if (!skillifySpawnedFor.has(sid)) { - skillifySpawnedFor.add(sid); try { - spawnOpenclawSkillifyWorker({ + if (spawnOpenclawSkillifyWorker({ apiUrl: cfg.apiUrl, token: cfg.token, orgId: cfg.orgId, workspaceId: cfg.workspaceId, userName: cfg.userName, channel: ev.channel || "openclaw", sessionId: sid, loggerWarn: (msg) => logger.error(`Skillify spawn: ${msg}`), - }); + })) { + skillifySpawnedFor.add(sid); + } } catch (e: any) { logger.error(`Skillify spawn threw: ${e?.message ?? e}`); } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openclaw/src/index.ts` around lines 1208 - 1223, The current code marks a session as deduped by adding sid to skillifySpawnedFor before spawnOpenclawSkillifyWorker can indicate success, which blocks retries on non-exception failure paths; change spawnOpenclawSkillifyWorker to return a boolean (true on worker actually launched, false on non-exception early exits), update the caller to await the result (if async) and only call skillifySpawnedFor.add(sid) when the function returns true, while preserving the existing try/catch to log thrown errors (logger.error) and still not add sid on failure.

Both findings are real correctness bugs; both fixed. 1. Lock-creation race window (lines 376-407) The acquire path was: fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose The file becomes visible after O_CREAT but BEFORE the timestamp write. A racing caller hitting the read path during that window saw an empty body, parsed NaN, treated it as 'immediately stale', unlinked the live lock, and acquired the same lock — overlapping workers across processes. Fix: when the parsed body is NaN/empty, fall back to the file's mtime to compute effective age. If the file is fresh (mtime within LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's older than the threshold, the previous holder genuinely crashed without writing the timestamp and we can recycle. Used statSync. 2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail spawnOpenclawSkillifyWorker has multiple non-exception failure paths (worker file missing, no delegate gate CLI, lock held by a fresh worker, mkdir/config write failure, spawn() throw). The previous code called unconditionally before invoking spawn — so one transient failure on the first turn suppressed every subsequent retry for that session in this runtime. Fix: changed spawnOpenclawSkillifyWorker's return type to boolean. Every failure path returns false; only the actual realSpawn().unref() returns true. Caller does so the session is recorded only on successful spawn. Also added explicit test assertions for the Set declaration + has/add guard so a future refactor that drops the dedup primitives fails the bundle-scan test (CodeRabbit nitpick on the test). Tests: 2380/2380 passing. Typecheck clean.

kaghni · 2026-05-18T06:05:20Z

Thanks @coderabbitai — both findings fixed on commit 358f759.

Fixed:

#	Finding	Action
1	Lock-creation race: O_CREAT	O_EXCL succeeds before timestamp is written; racing reader sees empty body → NaN → unlinks live lock → overlapping workers
2	`skillifySpawnedFor.add(sid)` ran BEFORE spawn could fail — one transient failure suppressed every retry for the session within the runtime	Changed `spawnOpenclawSkillifyWorker` return type from `void` to `boolean`. All non-exception failure paths (missing worker file, no delegate CLI, lock held by fresh worker, mkdir/config write failure, spawn throw) return `false`. Only successful `realSpawn(...).unref()` returns `true`. Caller does `if (spawn(...)) skillifySpawnedFor.add(sid)` — session recorded only on successful spawn.
3 (nit)	Test only checked proximity, not the actual dedup-Set primitives	Added explicit assertions in `tests/claude-code/skillify-session-start-injection.test.ts`: `const skillifySpawnedFor = new Set<string>()`, `if (!skillifySpawnedFor.has(sid))`, `skillifySpawnedFor.add(sid)`. A future refactor that drops the Set or its has/add usage will now fail this test.

2380/2380 tests pass. Typecheck clean. No bundle drift (openclaw/dist is gitignored).

coderabbitai · 2026-05-18T06:05:44Z

Tip

For best results, initiate chat on the files or code changes.

@kaghni All three fixes look correct and well-reasoned — thanks for the thorough write-up.

#1 (race): The mtime fallback is the right approach here. Using two independent signals (body timestamp + filesystem mtime) means the "stale" verdict requires both to agree, closing the window where a mid-write empty body was incorrectly treated as a legacy stale lock.
#2 (premature add): The boolean return type is a clean contract — every non-exception failure path returns false and the Set is only updated on a confirmed realSpawn(...).unref(). This correctly restores retry-on-transient-failure behaviour within a session.
#3 (nit): The explicit has/add assertions are exactly the right level of structural coverage — proximity checks alone couldn't catch a rename or removal of the Set.

Everything looks solid. 🐇✨

Both findings are real correctness bugs; both fixed. 1. Lock-creation race window (lines 376-407) The acquire path was: fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose The file becomes visible after O_CREAT but BEFORE the timestamp write. A racing caller hitting the read path during that window saw an empty body, parsed NaN, treated it as 'immediately stale', unlinked the live lock, and acquired the same lock — overlapping workers across processes. Fix: when the parsed body is NaN/empty, fall back to the file's mtime to compute effective age. If the file is fresh (mtime within LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's older than the threshold, the previous holder genuinely crashed without writing the timestamp and we can recycle. Used statSync. 2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail spawnOpenclawSkillifyWorker has multiple non-exception failure paths (worker file missing, no delegate gate CLI, lock held by a fresh worker, mkdir/config write failure, spawn() throw). The previous code called unconditionally before invoking spawn — so one transient failure on the first turn suppressed every subsequent retry for that session in this runtime. Fix: changed spawnOpenclawSkillifyWorker's return type to boolean. Every failure path returns false; only the actual realSpawn().unref() returns true. Caller does so the session is recorded only on successful spawn. Also added explicit test assertions for the Set declaration + has/add guard so a future refactor that drops the dedup primitives fails the bundle-scan test (CodeRabbit nitpick on the test). Tests: 2380/2380 passing. Typecheck clean.

efenocchi · 2026-05-18T06:37:27Z

+ * whether spawn succeeded, suppressing retries forever within the
+ * runtime.
+ */
+function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): boolean {


why do we have the specific function?

@efenocchi yes — openclaw has its own spawn function instead of reusing the shared src/skillify/spawn-skillify-worker.ts. The rationale lives in the comment block right above the openclaw spawn helper (openclaw/src/index.ts:300-310):

OpenClaw can't import the shared skillify TS modules — its bundle is stubbed for child_process and code-splits the gateway. Inline the spawn shape here, keyed off the bundled sibling skillify-worker.js.

Two specific constraints driving the duplication:

node:child_process is stubbed in the openclaw main bundle by the stub-unused-child-process esbuild plugin (esbuild.config.mjs:280-301). That stub replaces spawn/execSync/execFileSync with no-op exports to drop CC-specific dead code (e.g. browser-opening for SSO). The shared spawn helper would inherit the stub and silently fail. Openclaw works around it by using createRequire to bypass esbuild and pull the real spawn at runtime (openclaw/src/index.ts:78-80).

openclaw/src/index.ts uses code splitting (chunks/), and the shared spawn helper transitively imports modules the openclaw chunk graph deliberately avoids. Inlining the spawn keeps the openclaw bundle's chunk topology under control.

It IS code duplication, deliberately. The two functions stay in sync via the (slightly-pun-aware) header comment mirror of src/skillify/spawn-skillify-worker.ts plus the shared test (tests/claude-code/skillify-session-start-injection.test.ts asserts both surfaces have the same wiring).

If we ever drop the child_process stub on the openclaw main bundle, we can DRY this — but the stub itself is also load-bearing for the ClawHub static scan (PR #170), so it'd be a meaningful architectural shift, not a quick refactor.

#110) Two related spawn-lifecycle bugs in `openclaw/src/index.ts`: mine a session once. The existing on-disk lock prevents *overlapping* workers, but as soon as a worker exits and releases the lock, the NEXT `agent_end` (same session, same project_key) re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL round-trip and exits, but each spawn still costs Node cold-start + DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1. `O_CREAT | O_EXCL | O_WRONLY` to create the lock file and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its `finally`-block releases the lock, the empty lock file persists forever and EVERY subsequent `agent_end` silently no-ops mining permanently for that project_key. Hit live during the 2026-05-07 PR #98 E2E — a manual `rm <lockfile>` was needed to recover. Fixes: openclaw/src/index.ts: spawnOpenclawSkillifyWorker call site * Wrapped in `if (!skillifySpawnedFor.has(sid))` guard. New module-level `Set<string>` of session IDs that have already triggered a spawn in this runtime. ~5 lines, O(active-sessions) memory. Matches the spawn cadence the non-openclaw agents already get via their `tryAcquireWorkerLock` patterns in `src/skillify/state.ts`. The on-disk lock stays authoritative across processes; this Set only suppresses within-runtime redundancy. openclaw/src/index.ts: tryAcquireOpenclawSkillifyLock * Lock file now writes `String(Date.now())` on acquire (was an empty file before). * On O_EXCL failure, reads the existing lock and parses its ms timestamp. If `Date.now() - ts > 10 minutes` OR the body is unparseable, the lock is treated as stale → unlinked → retry acquire. Mirrors `src/skillify/state.ts:tryAcquireWorkerLock`. * Empty pre-existing lock files (left by older code) parse as NaN and are treated as immediately stale — clean migration on first patched run, no manual cleanup needed. * 10-minute max age is generous vs typical worker runtime (<30s + safety buffer); pathological hangs longer than that unlock the spawn slot for the next agent_end without leaking mining for the rest of the gateway's lifetime. Test harness: tests/claude-code/skillify-session-start-injection.test.ts * One regex distance bumped 500→1500 chars to accommodate the new dedup comment block between `Auto-captured` and the spawn site. Same intent — confirms agent_end calls spawn after the capture loop, just with the doc block landed. Tests: 2380/2380 passing. Fixes #100 Fixes #110

Both findings are real correctness bugs; both fixed. 1. Lock-creation race window (lines 376-407) The acquire path was: fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose The file becomes visible after O_CREAT but BEFORE the timestamp write. A racing caller hitting the read path during that window saw an empty body, parsed NaN, treated it as 'immediately stale', unlinked the live lock, and acquired the same lock — overlapping workers across processes. Fix: when the parsed body is NaN/empty, fall back to the file's mtime to compute effective age. If the file is fresh (mtime within LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's older than the threshold, the previous holder genuinely crashed without writing the timestamp and we can recycle. Used statSync. 2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail spawnOpenclawSkillifyWorker has multiple non-exception failure paths (worker file missing, no delegate gate CLI, lock held by a fresh worker, mkdir/config write failure, spawn() throw). The previous code called unconditionally before invoking spawn — so one transient failure on the first turn suppressed every subsequent retry for that session in this runtime. Fix: changed spawnOpenclawSkillifyWorker's return type to boolean. Every failure path returns false; only the actual realSpawn().unref() returns true. Caller does so the session is recorded only on successful spawn. Also added explicit test assertions for the Set declaration + has/add guard so a future refactor that drops the dedup primitives fails the bundle-scan test (CodeRabbit nitpick on the test). Tests: 2380/2380 passing. Typecheck clean.

`release.yml:250` runs `npm run audit:openclaw -- --criticals-only` as a hard gate before `npm publish` and `clawhub package publish`. That correctly stops a flagged bundle from shipping, but it runs post-merge — a critical regression would already be in main by then. Mirror the same step in ci.yml so a PR that re-trips the ClawHub static-scan rules (env-harvesting, dangerous-exec, etc.) fails its required check before merge. Placed right after the existing build step so openclaw/dist/ is populated. `--criticals-only` matches release.yml's flag: warns remain advisory (the skillify-worker's irreducible readFileSync + fetch warn is the canonical example).

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

kaghni force-pushed the fix/openclaw-tier2-spawn-lifecycle branch from 358f759 to 5020f4b Compare May 18, 2026 06:25

efenocchi reviewed May 18, 2026

View reviewed changes

efenocchi approved these changes May 18, 2026

View reviewed changes

kaghni added 2 commits May 18, 2026 17:43

kaghni force-pushed the fix/openclaw-tier2-spawn-lifecycle branch from 5020f4b to 7fd7d3a Compare May 18, 2026 17:47

kaghni merged commit 9383fea into main May 18, 2026
4 checks passed

Conversation

kaghni commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Per-runtime dedup (#100)

Stale-lock recovery (#110)

Tests

Test plan after merge

Summary by CodeRabbit

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

kaghni commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Uh oh!

efenocchi May 18, 2026

Choose a reason for hiding this comment

Uh oh!

kaghni May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaghni commented May 15, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading