Skip to content

openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110)#172

Merged
kaghni merged 3 commits into
mainfrom
fix/openclaw-tier2-spawn-lifecycle
May 18, 2026
Merged

openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110)#172
kaghni merged 3 commits into
mainfrom
fix/openclaw-tier2-spawn-lifecycle

Conversation

@kaghni
Copy link
Copy Markdown
Collaborator

@kaghni kaghni commented May 15, 2026

Fixes #100 and #110.

Why

Two spawn-lifecycle bugs in openclaw/src/index.ts:

#100 — Wasted re-spawns: agent_end fires on every turn. The on-disk lock at ~/.deeplake/state/skillify/<projectKey>.worker.lock prevents overlapping workers, but as soon as a worker exits and releases its lock, the NEXT agent_end re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL roundtrip, sees nothing new to mine, and exits — but each spawn costs ~50ms Node cold-start + ~200ms DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.

#110 — Stale locks halt mining permanently: tryAcquireOpenclawSkillifyLock does O_CREAT | O_EXCL | O_WRONLY and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its finally releases the lock, the lock persists forever and every subsequent agent_end silently no-ops mining for that project_key permanently. Hit live during the 2026-05-07 PR #98 E2E — a manual rm <lockfile> was needed to recover.

What changed

Per-runtime dedup (#100)

  • New module-level const skillifySpawnedFor = new Set<string>(). Tracks which session IDs have already triggered a spawn in this gateway runtime.
  • agent_end handler now wraps the spawnOpenclawSkillifyWorker(...) call in if (!skillifySpawnedFor.has(sid)) { skillifySpawnedFor.add(sid); … }.
  • The on-disk lock stays authoritative across processes (e.g. multiple gateway restarts). The new in-memory Set only suppresses within-runtime redundancy.

Stale-lock recovery (#110)

  • Lock file now writes String(Date.now()) on acquire (was an empty file).
  • On O_EXCL failure, reads the existing lock body, parses it as a ms timestamp. If Date.now() - ts > 10 minutes OR the body is unparseable (NaN), the lock is treated as stale → unlinked → retry acquire.
  • Mirrors the staleness logic in src/skillify/state.ts:tryAcquireWorkerLock for the non-openclaw agents.
  • Migration: empty pre-existing lock files (from earlier code) parse as NaN and are treated as immediately stale on the first patched run — no manual cleanup needed.
  • 10-minute max age is generous vs typical worker runtime (<30s + buffer). Pathological hangs longer than that release the spawn slot to the next agent_end, instead of leaking mining for the rest of the gateway's lifetime.

Tests

  • npm run typecheck — clean
  • npm test2380/2380 passing (one bundle-scan regex distance bumped 500→1500 to accommodate the new dedup comment block between Auto-captured and the spawn site; same assertion intent)

Test plan after merge

  • Long-running openclaw session (50+ turns). grep -c "Auto-captured" /tmp/openclaw/openclaw-*.log should be many; ls ~/.deeplake/state/skillify/*.worker.lock should show at most one mtime-bump per session (one spawn, not 2-5).
  • Kill a worker mid-mine (kill -9 $WORKER_PID). Wait 11 minutes. Next agent_end should successfully re-acquire the lock (stale-recovery path).

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of background worker spawning in extended agent sessions by preventing redundant spawn attempts
    • Enhanced detection and cleanup of stale worker states
    • Added error handling to gracefully manage worker startup failures
  • Tests

    • Updated test validations for worker spawning behavior

Review Change Stack

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Coverage Report

No src/*.ts files changed in this PR.

Generated for commit 5e8ef00.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4a57e427-77a7-45d8-b0ff-6f7bcbbb0141

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR reduces redundant skillify worker spawns in OpenClaw sessions from 2-5 per long session to exactly 1. It adds per-runtime session-ID deduplication via an in-memory Set<string>, and upgrades the per-projectKey lock file to detect stale locks using acquisition timestamps, unlinking stale ones to permit re-acquisition when needed.

Changes

Skillify spawn deduplication and stale lock handling

Layer / File(s) Summary
Stale lock detection and cleanup
openclaw/src/index.ts
Node fs imports expanded to include unlinkSync. Function tryAcquireOpenclawSkillifyLock() now writes acquisition timestamps into ${projectKey}.worker.lock, treats missing/empty locks and locks older than LOCK_MAX_AGE_MS as stale, unlinks them, and avoids spawn when the lock is fresh or lock-file operations fail.
Per-session spawn deduplication and error handling
openclaw/src/index.ts
Initialization of skillifySpawnedFor: Set<string>() to track spawned session IDs per runtime. The agent_end handler now checks this set and spawns the worker at most once per session_id within a single runtime, with try/catch logging to prevent spawn errors from interrupting capture.
Test regex tolerance update
tests/claude-code/skillify-session-start-injection.test.ts
The spawn-site assertion regex that verifies agent_end ... Auto-captured ... spawnOpenclawSkillifyWorker now permits a larger gap ({0,1500} instead of {0,500}), and related inline comments were updated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

  • #100: OpenClaw skilify worker re-spawns on every agent_end — This PR directly implements the proposed fix (option B) using an in-memory Set<string> to deduplicate spawns by session_id, reducing spawns from 2-5 per long session to 1, and adds timestamp-based stale lock detection to avoid deadlock from stuck lock files.

Poem

🐰 Spawning workers without the care,

Left locks scattered here and there.

With timestamps and sets held tight,

One worker spawns, done just right! 🔒✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title directly identifies both fixes: per-session dedup (#100) and stale-lock recovery (#110), matching the changeset's primary objectives.
Description check ✅ Passed Description fully details both bug fixes, implementation changes, testing status, and post-merge validation plan; aligns with the provided template structure.
Linked Issues check ✅ Passed Code changes implement the dedup Set pattern for #100 and timestamp-based stale-lock recovery for #110 as specified in objectives.
Out of Scope Changes check ✅ Passed All changes in index.ts and the test regex adjustment directly support the two linked issue objectives; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/openclaw-tier2-spawn-lifecycle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/claude-code/skillify-session-start-injection.test.ts (1)

240-243: ⚡ Quick win

Assert the new session-dedup guard explicitly.

This widened regex still only proves there's a spawn call near agent_end; it won't fail if skillifySpawnedFor.has/add disappear entirely. Since this PR's contract is “one spawn per session per runtime,” add direct assertions for the Set declaration and the has(sid) guard instead of only increasing the distance budget.

Example hardening
     expect(src).toMatch(/HIVEMIND_SKILLIFY_WORKER:\s*"1"/);
+    expect(src).toMatch(/const skillifySpawnedFor = new Set<string>\(\)/);
+    expect(src).toMatch(/if \(!skillifySpawnedFor\.has\(sid\)\)/);
+    expect(src).toMatch(/skillifySpawnedFor\.add\(sid\)/);
     expect(src).toMatch(/agent_end[\s\S]{0,3500}Auto-captured[\s\S]{0,1500}spawnOpenclawSkillifyWorker/);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/claude-code/skillify-session-start-injection.test.ts` around lines 240
- 243, The test currently only checks proximity between agent_end and
spawnOpenclawSkillifyWorker; add explicit assertions that the session-dedup
guard exists by asserting the Set declaration and its usage: assert the code
contains a declaration like "skillifySpawnedFor = new Set" (or "const
skillifySpawnedFor = new Set") and an explicit guard check
"skillifySpawnedFor.has(sid)" (e.g., `if (!skillifySpawnedFor.has(sid))`) as
well as the corresponding "skillifySpawnedFor.add(sid)" occurring before the
call to spawnOpenclawSkillifyWorker, so the test will fail if the Set or its
has/add logic are removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openclaw/src/index.ts`:
- Around line 1208-1223: The current code marks a session as deduped by adding
sid to skillifySpawnedFor before spawnOpenclawSkillifyWorker can indicate
success, which blocks retries on non-exception failure paths; change
spawnOpenclawSkillifyWorker to return a boolean (true on worker actually
launched, false on non-exception early exits), update the caller to await the
result (if async) and only call skillifySpawnedFor.add(sid) when the function
returns true, while preserving the existing try/catch to log thrown errors
(logger.error) and still not add sid on failure.
- Around line 381-400: The race occurs because acquire() creates an empty lock
file before writing the timestamp, so a reader can see an empty string and treat
it as stale; update the staleness check in the read path (the code that reads
lockPath and interprets ts) to not immediately unlink on Number.isNaN(ts): call
fsStat(lockPath) and use the file's mtime (or birthtime) to compute fileAge, and
only treat NaN/empty contents as stale if fileAge > min(LOCK_MAX_AGE_MS,
GRACE_MS) (pick a short GRACE_MS like a few seconds) or if the fileAge >
LOCK_MAX_AGE_MS; alternatively, avoid the window entirely by replacing acquire()
with an atomic create+write (e.g., write with 'wx'/'O_EXCL|O_WRONLY' so
timestamp is written on creation), but if you keep the current acquire(),
implement the mtime check before unlinking to prevent unlinking freshly-created
empty files.

---

Nitpick comments:
In `@tests/claude-code/skillify-session-start-injection.test.ts`:
- Around line 240-243: The test currently only checks proximity between
agent_end and spawnOpenclawSkillifyWorker; add explicit assertions that the
session-dedup guard exists by asserting the Set declaration and its usage:
assert the code contains a declaration like "skillifySpawnedFor = new Set" (or
"const skillifySpawnedFor = new Set") and an explicit guard check
"skillifySpawnedFor.has(sid)" (e.g., `if (!skillifySpawnedFor.has(sid))`) as
well as the corresponding "skillifySpawnedFor.add(sid)" occurring before the
call to spawnOpenclawSkillifyWorker, so the test will fail if the Set or its
has/add logic are removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: aba56b3c-5c10-409f-8979-6dcf914a6d47

📥 Commits

Reviewing files that changed from the base of the PR and between bdffdf0 and b164368.

📒 Files selected for processing (2)
  • openclaw/src/index.ts
  • tests/claude-code/skillify-session-start-injection.test.ts

Comment thread openclaw/src/index.ts
Comment on lines +381 to +400
const acquire = (): boolean => {
const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY);
try {
fsWriteFile(fd, String(Date.now()));
} finally {
fsClose(fd);
}
return true;
};
try {
return acquire();
} catch {
// O_EXCL failed → lock file already exists. Check staleness.
try {
const body = fsReadFile(lockPath, "utf-8");
const ts = Number.parseInt(body.trim(), 10);
const age = Date.now() - (Number.isFinite(ts) ? ts : 0);
if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) {
try { fsUnlink(lockPath); } catch { /* race; recheck below */ }
try { return acquire(); } catch { return false; }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid treating a freshly-created empty lock file as stale.

Line 382 creates the lock file before Line 384 writes the timestamp. A racing process that hits the read path during that window will see "", parse NaN, unlink the file, and acquire the same lock. That reintroduces overlapping workers across processes.

A simple fix is to only treat NaN / empty contents as stale when the file itself is older than the stale threshold (or at least older than a short grace window), instead of immediately unlinking on parse failure.

Suggested direction
-import {
+import {
   existsSync as fsExists, mkdirSync as fsMkdir, openSync as fsOpen,
   closeSync as fsClose, writeFileSync as fsWriteFile, constants as fsConstants,
-  readFileSync as fsReadFile, renameSync as fsRename, unlinkSync as fsUnlink,
+  readFileSync as fsReadFile, renameSync as fsRename, statSync as fsStat, unlinkSync as fsUnlink,
 } from "node:fs";
...
       try {
         const body = fsReadFile(lockPath, "utf-8");
         const ts = Number.parseInt(body.trim(), 10);
-        const age = Date.now() - (Number.isFinite(ts) ? ts : 0);
-        if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) {
+        const age = Number.isFinite(ts)
+          ? Date.now() - ts
+          : Date.now() - fsStat(lockPath).mtimeMs;
+        if (age > LOCK_MAX_AGE_MS) {
           try { fsUnlink(lockPath); } catch { /* race; recheck below */ }
           try { return acquire(); } catch { return false; }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const acquire = (): boolean => {
const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY);
try {
fsWriteFile(fd, String(Date.now()));
} finally {
fsClose(fd);
}
return true;
};
try {
return acquire();
} catch {
// O_EXCL failed → lock file already exists. Check staleness.
try {
const body = fsReadFile(lockPath, "utf-8");
const ts = Number.parseInt(body.trim(), 10);
const age = Date.now() - (Number.isFinite(ts) ? ts : 0);
if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) {
try { fsUnlink(lockPath); } catch { /* race; recheck below */ }
try { return acquire(); } catch { return false; }
const acquire = (): boolean => {
const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY);
try {
fsWriteFile(fd, String(Date.now()));
} finally {
fsClose(fd);
}
return true;
};
try {
return acquire();
} catch {
// O_EXCL failed → lock file already exists. Check staleness.
try {
const body = fsReadFile(lockPath, "utf-8");
const ts = Number.parseInt(body.trim(), 10);
const age = Number.isFinite(ts)
? Date.now() - ts
: Date.now() - fsStat(lockPath).mtimeMs;
if (age > LOCK_MAX_AGE_MS) {
try { fsUnlink(lockPath); } catch { /* race; recheck below */ }
try { return acquire(); } catch { return false; }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openclaw/src/index.ts` around lines 381 - 400, The race occurs because
acquire() creates an empty lock file before writing the timestamp, so a reader
can see an empty string and treat it as stale; update the staleness check in the
read path (the code that reads lockPath and interprets ts) to not immediately
unlink on Number.isNaN(ts): call fsStat(lockPath) and use the file's mtime (or
birthtime) to compute fileAge, and only treat NaN/empty contents as stale if
fileAge > min(LOCK_MAX_AGE_MS, GRACE_MS) (pick a short GRACE_MS like a few
seconds) or if the fileAge > LOCK_MAX_AGE_MS; alternatively, avoid the window
entirely by replacing acquire() with an atomic create+write (e.g., write with
'wx'/'O_EXCL|O_WRONLY' so timestamp is written on creation), but if you keep the
current acquire(), implement the mtime check before unlinking to prevent
unlinking freshly-created empty files.

Comment thread openclaw/src/index.ts
Comment on lines +1208 to +1223
if (!skillifySpawnedFor.has(sid)) {
skillifySpawnedFor.add(sid);
try {
spawnOpenclawSkillifyWorker({
apiUrl: cfg.apiUrl,
token: cfg.token,
orgId: cfg.orgId,
workspaceId: cfg.workspaceId,
userName: cfg.userName,
channel: ev.channel || "openclaw",
sessionId: sid,
loggerWarn: (msg) => logger.error(`Skillify spawn: ${msg}`),
});
} catch (e: any) {
logger.error(`Skillify spawn threw: ${e?.message ?? e}`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Record the session only after a worker was actually launched.

Lines 1208-1209 mark the session as deduped before spawnOpenclawSkillifyWorker(...) can tell you whether anything was spawned. The helper has several non-exception failure paths (no delegate CLI, lock not acquired, config write failure, missing worker file), so one transient failure on the first turn suppresses every later retry for that session in this runtime.

Have spawnOpenclawSkillifyWorker return a boolean and add sid to skillifySpawnedFor only on success.

Suggested fix
-function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): void {
+function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): boolean {
   if (!fsExists(OPENCLAW_SKILLIFY_WORKER_PATH)) {
     a.loggerWarn?.(`skillify worker missing at ${OPENCLAW_SKILLIFY_WORKER_PATH} — reinstall openclaw plugin`);
-    return;
+    return false;
   }
   const gateAgent = detectOpenclawGateAgent();
   if (!gateAgent) {
     a.loggerWarn?.(`skillify spawn: no delegate gate CLI found on PATH (need one of: claude, codex, cursor-agent, hermes, pi). Mining skipped.`);
-    return;
+    return false;
   }
   const { key: projectKey, project } = deriveOpenclawProjectKey(a.channel);
   if (!tryAcquireOpenclawSkillifyLock(projectKey)) {
-    return;
+    return false;
   }
   ...
-  try { fsWriteFile(configPath, JSON.stringify(config), { mode: 0o600 }); }
-  catch (e: any) { a.loggerWarn?.(`skillify spawn: config write failed: ${e?.message ?? e}`); return; }
+  try { fsWriteFile(configPath, JSON.stringify(config), { mode: 0o600 }); }
+  catch (e: any) { a.loggerWarn?.(`skillify spawn: config write failed: ${e?.message ?? e}`); return false; }

   try {
     realSpawn(process.execPath, [OPENCLAW_SKILLIFY_WORKER_PATH, configPath], {
       detached: true,
       stdio: "ignore",
       env: { ...process.env, HIVEMIND_SKILLIFY_WORKER: "1", HIVEMIND_CAPTURE: "false" },
     }).unref();
+    return true;
   } catch (e: any) {
     a.loggerWarn?.(`skillify spawn: spawn failed: ${e?.message ?? e}`);
+    return false;
   }
 }
...
           if (!skillifySpawnedFor.has(sid)) {
-            skillifySpawnedFor.add(sid);
             try {
-              spawnOpenclawSkillifyWorker({
+              if (spawnOpenclawSkillifyWorker({
                 apiUrl: cfg.apiUrl,
                 token: cfg.token,
                 orgId: cfg.orgId,
                 workspaceId: cfg.workspaceId,
                 userName: cfg.userName,
                 channel: ev.channel || "openclaw",
                 sessionId: sid,
                 loggerWarn: (msg) => logger.error(`Skillify spawn: ${msg}`),
-              });
+              })) {
+                skillifySpawnedFor.add(sid);
+              }
             } catch (e: any) {
               logger.error(`Skillify spawn threw: ${e?.message ?? e}`);
             }
           }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@openclaw/src/index.ts` around lines 1208 - 1223, The current code marks a
session as deduped by adding sid to skillifySpawnedFor before
spawnOpenclawSkillifyWorker can indicate success, which blocks retries on
non-exception failure paths; change spawnOpenclawSkillifyWorker to return a
boolean (true on worker actually launched, false on non-exception early exits),
update the caller to await the result (if async) and only call
skillifySpawnedFor.add(sid) when the function returns true, while preserving the
existing try/catch to log thrown errors (logger.error) and still not add sid on
failure.

kaghni added a commit that referenced this pull request May 18, 2026
Both findings are real correctness bugs; both fixed.

1. Lock-creation race window (lines 376-407)

The acquire path was:
  fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose
The file becomes visible after O_CREAT but BEFORE the timestamp
write. A racing caller hitting the read path during that window saw
an empty body, parsed NaN, treated it as 'immediately stale',
unlinked the live lock, and acquired the same lock — overlapping
workers across processes.

Fix: when the parsed body is NaN/empty, fall back to the file's
mtime to compute effective age. If the file is fresh (mtime within
LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's
older than the threshold, the previous holder genuinely crashed
without writing the timestamp and we can recycle. Used statSync.

2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail

spawnOpenclawSkillifyWorker has multiple non-exception failure paths
(worker file missing, no delegate gate CLI, lock held by a fresh
worker, mkdir/config write failure, spawn() throw). The previous
code called  unconditionally before invoking spawn — so
one transient failure on the first turn suppressed every subsequent
retry for that session in this runtime.

Fix: changed spawnOpenclawSkillifyWorker's return type to boolean.
Every failure path returns false; only the actual realSpawn().unref()
returns true. Caller does
so the session is recorded only on successful spawn.

Also added explicit test assertions for the Set declaration + has/add
guard so a future refactor that drops the dedup primitives fails the
bundle-scan test (CodeRabbit nitpick on the test).

Tests: 2380/2380 passing. Typecheck clean.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented May 18, 2026

Thanks @coderabbitai — both findings fixed on commit 358f759.

Fixed:

# Finding Action
1 Lock-creation race: O_CREAT O_EXCL succeeds before timestamp is written; racing reader sees empty body → NaN → unlinks live lock → overlapping workers
2 skillifySpawnedFor.add(sid) ran BEFORE spawn could fail — one transient failure suppressed every retry for the session within the runtime Changed spawnOpenclawSkillifyWorker return type from void to boolean. All non-exception failure paths (missing worker file, no delegate CLI, lock held by fresh worker, mkdir/config write failure, spawn throw) return false. Only successful realSpawn(...).unref() returns true. Caller does if (spawn(...)) skillifySpawnedFor.add(sid) — session recorded only on successful spawn.
3 (nit) Test only checked proximity, not the actual dedup-Set primitives Added explicit assertions in tests/claude-code/skillify-session-start-injection.test.ts: const skillifySpawnedFor = new Set<string>(), if (!skillifySpawnedFor.has(sid)), skillifySpawnedFor.add(sid). A future refactor that drops the Set or its has/add usage will now fail this test.

2380/2380 tests pass. Typecheck clean. No bundle drift (openclaw/dist is gitignored).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Tip

For best results, initiate chat on the files or code changes.

@kaghni All three fixes look correct and well-reasoned — thanks for the thorough write-up.

  • #1 (race): The mtime fallback is the right approach here. Using two independent signals (body timestamp + filesystem mtime) means the "stale" verdict requires both to agree, closing the window where a mid-write empty body was incorrectly treated as a legacy stale lock.
  • #2 (premature add): The boolean return type is a clean contract — every non-exception failure path returns false and the Set is only updated on a confirmed realSpawn(...).unref(). This correctly restores retry-on-transient-failure behaviour within a session.
  • #3 (nit): The explicit has/add assertions are exactly the right level of structural coverage — proximity checks alone couldn't catch a rename or removal of the Set.

Everything looks solid. 🐇✨

@kaghni kaghni force-pushed the fix/openclaw-tier2-spawn-lifecycle branch from 358f759 to 5020f4b Compare May 18, 2026 06:25
kaghni added a commit that referenced this pull request May 18, 2026
Both findings are real correctness bugs; both fixed.

1. Lock-creation race window (lines 376-407)

The acquire path was:
  fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose
The file becomes visible after O_CREAT but BEFORE the timestamp
write. A racing caller hitting the read path during that window saw
an empty body, parsed NaN, treated it as 'immediately stale',
unlinked the live lock, and acquired the same lock — overlapping
workers across processes.

Fix: when the parsed body is NaN/empty, fall back to the file's
mtime to compute effective age. If the file is fresh (mtime within
LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's
older than the threshold, the previous holder genuinely crashed
without writing the timestamp and we can recycle. Used statSync.

2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail

spawnOpenclawSkillifyWorker has multiple non-exception failure paths
(worker file missing, no delegate gate CLI, lock held by a fresh
worker, mkdir/config write failure, spawn() throw). The previous
code called  unconditionally before invoking spawn — so
one transient failure on the first turn suppressed every subsequent
retry for that session in this runtime.

Fix: changed spawnOpenclawSkillifyWorker's return type to boolean.
Every failure path returns false; only the actual realSpawn().unref()
returns true. Caller does
so the session is recorded only on successful spawn.

Also added explicit test assertions for the Set declaration + has/add
guard so a future refactor that drops the dedup primitives fails the
bundle-scan test (CodeRabbit nitpick on the test).

Tests: 2380/2380 passing. Typecheck clean.
Comment thread openclaw/src/index.ts
* whether spawn succeeded, suppressing retries forever within the
* runtime.
*/
function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): boolean {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have the specific function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efenocchi yes — openclaw has its own spawn function instead of reusing the shared src/skillify/spawn-skillify-worker.ts. The rationale lives in the comment block right above the openclaw spawn helper (openclaw/src/index.ts:300-310):

OpenClaw can't import the shared skillify TS modules — its bundle is stubbed for child_process and code-splits the gateway. Inline the spawn shape here, keyed off the bundled sibling skillify-worker.js.

Two specific constraints driving the duplication:

  1. node:child_process is stubbed in the openclaw main bundle by the stub-unused-child-process esbuild plugin (esbuild.config.mjs:280-301). That stub replaces spawn/execSync/execFileSync with no-op exports to drop CC-specific dead code (e.g. browser-opening for SSO). The shared spawn helper would inherit the stub and silently fail. Openclaw works around it by using createRequire to bypass esbuild and pull the real spawn at runtime (openclaw/src/index.ts:78-80).

  2. openclaw/src/index.ts uses code splitting (chunks/), and the shared spawn helper transitively imports modules the openclaw chunk graph deliberately avoids. Inlining the spawn keeps the openclaw bundle's chunk topology under control.

It IS code duplication, deliberately. The two functions stay in sync via the (slightly-pun-aware) header comment mirror of src/skillify/spawn-skillify-worker.ts plus the shared test (tests/claude-code/skillify-session-start-injection.test.ts asserts both surfaces have the same wiring).

If we ever drop the child_process stub on the openclaw main bundle, we can DRY this — but the stub itself is also load-bearing for the ClawHub static scan (PR #170), so it'd be a meaningful architectural shift, not a quick refactor.

kaghni added 2 commits May 18, 2026 17:43
#110)

Two related spawn-lifecycle bugs in `openclaw/src/index.ts`:

mine a session once. The existing on-disk lock prevents *overlapping*
workers, but as soon as a worker exits and releases the lock, the
NEXT `agent_end` (same session, same project_key) re-acquires it and
spawns a fresh worker. The fresh worker does one watermark-check SQL
round-trip and exits, but each spawn still costs Node cold-start +
DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.

`O_CREAT | O_EXCL | O_WRONLY` to create the lock file and treats any
pre-existing lock as "live worker, skip." There's no staleness check.
If a worker dies abnormally (host kill, OOM, segfault) before its
`finally`-block releases the lock, the empty lock file persists
forever and EVERY subsequent `agent_end` silently no-ops mining
permanently for that project_key. Hit live during the 2026-05-07
PR #98 E2E — a manual `rm <lockfile>` was needed to recover.

Fixes:

  openclaw/src/index.ts: spawnOpenclawSkillifyWorker call site
   * Wrapped in `if (!skillifySpawnedFor.has(sid))` guard. New
     module-level `Set<string>` of session IDs that have already
     triggered a spawn in this runtime. ~5 lines, O(active-sessions)
     memory. Matches the spawn cadence the non-openclaw agents
     already get via their `tryAcquireWorkerLock` patterns in
     `src/skillify/state.ts`. The on-disk lock stays authoritative
     across processes; this Set only suppresses within-runtime
     redundancy.

  openclaw/src/index.ts: tryAcquireOpenclawSkillifyLock
   * Lock file now writes `String(Date.now())` on acquire (was an
     empty file before).
   * On O_EXCL failure, reads the existing lock and parses its ms
     timestamp. If `Date.now() - ts > 10 minutes` OR the body is
     unparseable, the lock is treated as stale → unlinked → retry
     acquire. Mirrors `src/skillify/state.ts:tryAcquireWorkerLock`.
   * Empty pre-existing lock files (left by older code) parse as
     NaN and are treated as immediately stale — clean migration on
     first patched run, no manual cleanup needed.
   * 10-minute max age is generous vs typical worker runtime
     (<30s + safety buffer); pathological hangs longer than that
     unlock the spawn slot for the next agent_end without leaking
     mining for the rest of the gateway's lifetime.

Test harness:
  tests/claude-code/skillify-session-start-injection.test.ts
   * One regex distance bumped 500→1500 chars to accommodate the
     new dedup comment block between `Auto-captured` and the
     spawn site. Same intent — confirms agent_end calls spawn
     after the capture loop, just with the doc block landed.

Tests: 2380/2380 passing.

Fixes #100
Fixes #110
Both findings are real correctness bugs; both fixed.

1. Lock-creation race window (lines 376-407)

The acquire path was:
  fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose
The file becomes visible after O_CREAT but BEFORE the timestamp
write. A racing caller hitting the read path during that window saw
an empty body, parsed NaN, treated it as 'immediately stale',
unlinked the live lock, and acquired the same lock — overlapping
workers across processes.

Fix: when the parsed body is NaN/empty, fall back to the file's
mtime to compute effective age. If the file is fresh (mtime within
LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's
older than the threshold, the previous holder genuinely crashed
without writing the timestamp and we can recycle. Used statSync.

2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail

spawnOpenclawSkillifyWorker has multiple non-exception failure paths
(worker file missing, no delegate gate CLI, lock held by a fresh
worker, mkdir/config write failure, spawn() throw). The previous
code called  unconditionally before invoking spawn — so
one transient failure on the first turn suppressed every subsequent
retry for that session in this runtime.

Fix: changed spawnOpenclawSkillifyWorker's return type to boolean.
Every failure path returns false; only the actual realSpawn().unref()
returns true. Caller does
so the session is recorded only on successful spawn.

Also added explicit test assertions for the Set declaration + has/add
guard so a future refactor that drops the dedup primitives fails the
bundle-scan test (CodeRabbit nitpick on the test).

Tests: 2380/2380 passing. Typecheck clean.
@kaghni kaghni force-pushed the fix/openclaw-tier2-spawn-lifecycle branch from 5020f4b to 7fd7d3a Compare May 18, 2026 17:47
`release.yml:250` runs `npm run audit:openclaw -- --criticals-only`
as a hard gate before `npm publish` and `clawhub package publish`.
That correctly stops a flagged bundle from shipping, but it runs
post-merge — a critical regression would already be in main by
then. Mirror the same step in ci.yml so a PR that re-trips the
ClawHub static-scan rules (env-harvesting, dangerous-exec, etc.)
fails its required check before merge.

Placed right after the existing build step so openclaw/dist/ is
populated. `--criticals-only` matches release.yml's flag: warns
remain advisory (the skillify-worker's irreducible readFileSync +
fetch warn is the canonical example).
@kaghni kaghni merged commit 9383fea into main May 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenClaw skilify worker re-spawns on every agent_end (cumulative ~2-5 wasted spawns per long session)

2 participants