openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110)#172
Conversation
Coverage ReportNo Generated for commit 5e8ef00. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR reduces redundant skillify worker spawns in OpenClaw sessions from 2-5 per long session to exactly 1. It adds per-runtime session-ID deduplication via an in-memory ChangesSkillify spawn deduplication and stale lock handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/claude-code/skillify-session-start-injection.test.ts (1)
240-243: ⚡ Quick winAssert the new session-dedup guard explicitly.
This widened regex still only proves there's a spawn call near
agent_end; it won't fail ifskillifySpawnedFor.has/adddisappear entirely. Since this PR's contract is “one spawn per session per runtime,” add direct assertions for the Set declaration and thehas(sid)guard instead of only increasing the distance budget.Example hardening
expect(src).toMatch(/HIVEMIND_SKILLIFY_WORKER:\s*"1"/); + expect(src).toMatch(/const skillifySpawnedFor = new Set<string>\(\)/); + expect(src).toMatch(/if \(!skillifySpawnedFor\.has\(sid\)\)/); + expect(src).toMatch(/skillifySpawnedFor\.add\(sid\)/); expect(src).toMatch(/agent_end[\s\S]{0,3500}Auto-captured[\s\S]{0,1500}spawnOpenclawSkillifyWorker/);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/claude-code/skillify-session-start-injection.test.ts` around lines 240 - 243, The test currently only checks proximity between agent_end and spawnOpenclawSkillifyWorker; add explicit assertions that the session-dedup guard exists by asserting the Set declaration and its usage: assert the code contains a declaration like "skillifySpawnedFor = new Set" (or "const skillifySpawnedFor = new Set") and an explicit guard check "skillifySpawnedFor.has(sid)" (e.g., `if (!skillifySpawnedFor.has(sid))`) as well as the corresponding "skillifySpawnedFor.add(sid)" occurring before the call to spawnOpenclawSkillifyWorker, so the test will fail if the Set or its has/add logic are removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@openclaw/src/index.ts`:
- Around line 1208-1223: The current code marks a session as deduped by adding
sid to skillifySpawnedFor before spawnOpenclawSkillifyWorker can indicate
success, which blocks retries on non-exception failure paths; change
spawnOpenclawSkillifyWorker to return a boolean (true on worker actually
launched, false on non-exception early exits), update the caller to await the
result (if async) and only call skillifySpawnedFor.add(sid) when the function
returns true, while preserving the existing try/catch to log thrown errors
(logger.error) and still not add sid on failure.
- Around line 381-400: The race occurs because acquire() creates an empty lock
file before writing the timestamp, so a reader can see an empty string and treat
it as stale; update the staleness check in the read path (the code that reads
lockPath and interprets ts) to not immediately unlink on Number.isNaN(ts): call
fsStat(lockPath) and use the file's mtime (or birthtime) to compute fileAge, and
only treat NaN/empty contents as stale if fileAge > min(LOCK_MAX_AGE_MS,
GRACE_MS) (pick a short GRACE_MS like a few seconds) or if the fileAge >
LOCK_MAX_AGE_MS; alternatively, avoid the window entirely by replacing acquire()
with an atomic create+write (e.g., write with 'wx'/'O_EXCL|O_WRONLY' so
timestamp is written on creation), but if you keep the current acquire(),
implement the mtime check before unlinking to prevent unlinking freshly-created
empty files.
---
Nitpick comments:
In `@tests/claude-code/skillify-session-start-injection.test.ts`:
- Around line 240-243: The test currently only checks proximity between
agent_end and spawnOpenclawSkillifyWorker; add explicit assertions that the
session-dedup guard exists by asserting the Set declaration and its usage:
assert the code contains a declaration like "skillifySpawnedFor = new Set" (or
"const skillifySpawnedFor = new Set") and an explicit guard check
"skillifySpawnedFor.has(sid)" (e.g., `if (!skillifySpawnedFor.has(sid))`) as
well as the corresponding "skillifySpawnedFor.add(sid)" occurring before the
call to spawnOpenclawSkillifyWorker, so the test will fail if the Set or its
has/add logic are removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: aba56b3c-5c10-409f-8979-6dcf914a6d47
📒 Files selected for processing (2)
openclaw/src/index.tstests/claude-code/skillify-session-start-injection.test.ts
| const acquire = (): boolean => { | ||
| const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY); | ||
| try { | ||
| fsWriteFile(fd, String(Date.now())); | ||
| } finally { | ||
| fsClose(fd); | ||
| } | ||
| return true; | ||
| }; | ||
| try { | ||
| return acquire(); | ||
| } catch { | ||
| // O_EXCL failed → lock file already exists. Check staleness. | ||
| try { | ||
| const body = fsReadFile(lockPath, "utf-8"); | ||
| const ts = Number.parseInt(body.trim(), 10); | ||
| const age = Date.now() - (Number.isFinite(ts) ? ts : 0); | ||
| if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) { | ||
| try { fsUnlink(lockPath); } catch { /* race; recheck below */ } | ||
| try { return acquire(); } catch { return false; } |
There was a problem hiding this comment.
Avoid treating a freshly-created empty lock file as stale.
Line 382 creates the lock file before Line 384 writes the timestamp. A racing process that hits the read path during that window will see "", parse NaN, unlink the file, and acquire the same lock. That reintroduces overlapping workers across processes.
A simple fix is to only treat NaN / empty contents as stale when the file itself is older than the stale threshold (or at least older than a short grace window), instead of immediately unlinking on parse failure.
Suggested direction
-import {
+import {
existsSync as fsExists, mkdirSync as fsMkdir, openSync as fsOpen,
closeSync as fsClose, writeFileSync as fsWriteFile, constants as fsConstants,
- readFileSync as fsReadFile, renameSync as fsRename, unlinkSync as fsUnlink,
+ readFileSync as fsReadFile, renameSync as fsRename, statSync as fsStat, unlinkSync as fsUnlink,
} from "node:fs";
...
try {
const body = fsReadFile(lockPath, "utf-8");
const ts = Number.parseInt(body.trim(), 10);
- const age = Date.now() - (Number.isFinite(ts) ? ts : 0);
- if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) {
+ const age = Number.isFinite(ts)
+ ? Date.now() - ts
+ : Date.now() - fsStat(lockPath).mtimeMs;
+ if (age > LOCK_MAX_AGE_MS) {
try { fsUnlink(lockPath); } catch { /* race; recheck below */ }
try { return acquire(); } catch { return false; }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const acquire = (): boolean => { | |
| const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY); | |
| try { | |
| fsWriteFile(fd, String(Date.now())); | |
| } finally { | |
| fsClose(fd); | |
| } | |
| return true; | |
| }; | |
| try { | |
| return acquire(); | |
| } catch { | |
| // O_EXCL failed → lock file already exists. Check staleness. | |
| try { | |
| const body = fsReadFile(lockPath, "utf-8"); | |
| const ts = Number.parseInt(body.trim(), 10); | |
| const age = Date.now() - (Number.isFinite(ts) ? ts : 0); | |
| if (!Number.isFinite(ts) || age > LOCK_MAX_AGE_MS) { | |
| try { fsUnlink(lockPath); } catch { /* race; recheck below */ } | |
| try { return acquire(); } catch { return false; } | |
| const acquire = (): boolean => { | |
| const fd = fsOpen(lockPath, fsConstants.O_CREAT | fsConstants.O_EXCL | fsConstants.O_WRONLY); | |
| try { | |
| fsWriteFile(fd, String(Date.now())); | |
| } finally { | |
| fsClose(fd); | |
| } | |
| return true; | |
| }; | |
| try { | |
| return acquire(); | |
| } catch { | |
| // O_EXCL failed → lock file already exists. Check staleness. | |
| try { | |
| const body = fsReadFile(lockPath, "utf-8"); | |
| const ts = Number.parseInt(body.trim(), 10); | |
| const age = Number.isFinite(ts) | |
| ? Date.now() - ts | |
| : Date.now() - fsStat(lockPath).mtimeMs; | |
| if (age > LOCK_MAX_AGE_MS) { | |
| try { fsUnlink(lockPath); } catch { /* race; recheck below */ } | |
| try { return acquire(); } catch { return false; } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openclaw/src/index.ts` around lines 381 - 400, The race occurs because
acquire() creates an empty lock file before writing the timestamp, so a reader
can see an empty string and treat it as stale; update the staleness check in the
read path (the code that reads lockPath and interprets ts) to not immediately
unlink on Number.isNaN(ts): call fsStat(lockPath) and use the file's mtime (or
birthtime) to compute fileAge, and only treat NaN/empty contents as stale if
fileAge > min(LOCK_MAX_AGE_MS, GRACE_MS) (pick a short GRACE_MS like a few
seconds) or if the fileAge > LOCK_MAX_AGE_MS; alternatively, avoid the window
entirely by replacing acquire() with an atomic create+write (e.g., write with
'wx'/'O_EXCL|O_WRONLY' so timestamp is written on creation), but if you keep the
current acquire(), implement the mtime check before unlinking to prevent
unlinking freshly-created empty files.
| if (!skillifySpawnedFor.has(sid)) { | ||
| skillifySpawnedFor.add(sid); | ||
| try { | ||
| spawnOpenclawSkillifyWorker({ | ||
| apiUrl: cfg.apiUrl, | ||
| token: cfg.token, | ||
| orgId: cfg.orgId, | ||
| workspaceId: cfg.workspaceId, | ||
| userName: cfg.userName, | ||
| channel: ev.channel || "openclaw", | ||
| sessionId: sid, | ||
| loggerWarn: (msg) => logger.error(`Skillify spawn: ${msg}`), | ||
| }); | ||
| } catch (e: any) { | ||
| logger.error(`Skillify spawn threw: ${e?.message ?? e}`); | ||
| } |
There was a problem hiding this comment.
Record the session only after a worker was actually launched.
Lines 1208-1209 mark the session as deduped before spawnOpenclawSkillifyWorker(...) can tell you whether anything was spawned. The helper has several non-exception failure paths (no delegate CLI, lock not acquired, config write failure, missing worker file), so one transient failure on the first turn suppresses every later retry for that session in this runtime.
Have spawnOpenclawSkillifyWorker return a boolean and add sid to skillifySpawnedFor only on success.
Suggested fix
-function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): void {
+function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): boolean {
if (!fsExists(OPENCLAW_SKILLIFY_WORKER_PATH)) {
a.loggerWarn?.(`skillify worker missing at ${OPENCLAW_SKILLIFY_WORKER_PATH} — reinstall openclaw plugin`);
- return;
+ return false;
}
const gateAgent = detectOpenclawGateAgent();
if (!gateAgent) {
a.loggerWarn?.(`skillify spawn: no delegate gate CLI found on PATH (need one of: claude, codex, cursor-agent, hermes, pi). Mining skipped.`);
- return;
+ return false;
}
const { key: projectKey, project } = deriveOpenclawProjectKey(a.channel);
if (!tryAcquireOpenclawSkillifyLock(projectKey)) {
- return;
+ return false;
}
...
- try { fsWriteFile(configPath, JSON.stringify(config), { mode: 0o600 }); }
- catch (e: any) { a.loggerWarn?.(`skillify spawn: config write failed: ${e?.message ?? e}`); return; }
+ try { fsWriteFile(configPath, JSON.stringify(config), { mode: 0o600 }); }
+ catch (e: any) { a.loggerWarn?.(`skillify spawn: config write failed: ${e?.message ?? e}`); return false; }
try {
realSpawn(process.execPath, [OPENCLAW_SKILLIFY_WORKER_PATH, configPath], {
detached: true,
stdio: "ignore",
env: { ...process.env, HIVEMIND_SKILLIFY_WORKER: "1", HIVEMIND_CAPTURE: "false" },
}).unref();
+ return true;
} catch (e: any) {
a.loggerWarn?.(`skillify spawn: spawn failed: ${e?.message ?? e}`);
+ return false;
}
}
...
if (!skillifySpawnedFor.has(sid)) {
- skillifySpawnedFor.add(sid);
try {
- spawnOpenclawSkillifyWorker({
+ if (spawnOpenclawSkillifyWorker({
apiUrl: cfg.apiUrl,
token: cfg.token,
orgId: cfg.orgId,
workspaceId: cfg.workspaceId,
userName: cfg.userName,
channel: ev.channel || "openclaw",
sessionId: sid,
loggerWarn: (msg) => logger.error(`Skillify spawn: ${msg}`),
- });
+ })) {
+ skillifySpawnedFor.add(sid);
+ }
} catch (e: any) {
logger.error(`Skillify spawn threw: ${e?.message ?? e}`);
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openclaw/src/index.ts` around lines 1208 - 1223, The current code marks a
session as deduped by adding sid to skillifySpawnedFor before
spawnOpenclawSkillifyWorker can indicate success, which blocks retries on
non-exception failure paths; change spawnOpenclawSkillifyWorker to return a
boolean (true on worker actually launched, false on non-exception early exits),
update the caller to await the result (if async) and only call
skillifySpawnedFor.add(sid) when the function returns true, while preserving the
existing try/catch to log thrown errors (logger.error) and still not add sid on
failure.
Both findings are real correctness bugs; both fixed. 1. Lock-creation race window (lines 376-407) The acquire path was: fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose The file becomes visible after O_CREAT but BEFORE the timestamp write. A racing caller hitting the read path during that window saw an empty body, parsed NaN, treated it as 'immediately stale', unlinked the live lock, and acquired the same lock — overlapping workers across processes. Fix: when the parsed body is NaN/empty, fall back to the file's mtime to compute effective age. If the file is fresh (mtime within LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's older than the threshold, the previous holder genuinely crashed without writing the timestamp and we can recycle. Used statSync. 2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail spawnOpenclawSkillifyWorker has multiple non-exception failure paths (worker file missing, no delegate gate CLI, lock held by a fresh worker, mkdir/config write failure, spawn() throw). The previous code called unconditionally before invoking spawn — so one transient failure on the first turn suppressed every subsequent retry for that session in this runtime. Fix: changed spawnOpenclawSkillifyWorker's return type to boolean. Every failure path returns false; only the actual realSpawn().unref() returns true. Caller does so the session is recorded only on successful spawn. Also added explicit test assertions for the Set declaration + has/add guard so a future refactor that drops the dedup primitives fails the bundle-scan test (CodeRabbit nitpick on the test). Tests: 2380/2380 passing. Typecheck clean.
|
Thanks @coderabbitai — both findings fixed on commit 358f759. Fixed:
2380/2380 tests pass. Typecheck clean. No bundle drift (openclaw/dist is gitignored). |
|
Tip For best results, initiate chat on the files or code changes.
Everything looks solid. 🐇✨ |
358f759 to
5020f4b
Compare
Both findings are real correctness bugs; both fixed. 1. Lock-creation race window (lines 376-407) The acquire path was: fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose The file becomes visible after O_CREAT but BEFORE the timestamp write. A racing caller hitting the read path during that window saw an empty body, parsed NaN, treated it as 'immediately stale', unlinked the live lock, and acquired the same lock — overlapping workers across processes. Fix: when the parsed body is NaN/empty, fall back to the file's mtime to compute effective age. If the file is fresh (mtime within LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's older than the threshold, the previous holder genuinely crashed without writing the timestamp and we can recycle. Used statSync. 2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail spawnOpenclawSkillifyWorker has multiple non-exception failure paths (worker file missing, no delegate gate CLI, lock held by a fresh worker, mkdir/config write failure, spawn() throw). The previous code called unconditionally before invoking spawn — so one transient failure on the first turn suppressed every subsequent retry for that session in this runtime. Fix: changed spawnOpenclawSkillifyWorker's return type to boolean. Every failure path returns false; only the actual realSpawn().unref() returns true. Caller does so the session is recorded only on successful spawn. Also added explicit test assertions for the Set declaration + has/add guard so a future refactor that drops the dedup primitives fails the bundle-scan test (CodeRabbit nitpick on the test). Tests: 2380/2380 passing. Typecheck clean.
| * whether spawn succeeded, suppressing retries forever within the | ||
| * runtime. | ||
| */ | ||
| function spawnOpenclawSkillifyWorker(a: OpenclawSpawnArgs): boolean { |
There was a problem hiding this comment.
why do we have the specific function?
There was a problem hiding this comment.
@efenocchi yes — openclaw has its own spawn function instead of reusing the shared src/skillify/spawn-skillify-worker.ts. The rationale lives in the comment block right above the openclaw spawn helper (openclaw/src/index.ts:300-310):
OpenClaw can't import the shared skillify TS modules — its bundle is stubbed for child_process and code-splits the gateway. Inline the spawn shape here, keyed off the bundled sibling
skillify-worker.js.
Two specific constraints driving the duplication:
-
node:child_processis stubbed in the openclaw main bundle by thestub-unused-child-processesbuild plugin (esbuild.config.mjs:280-301). That stub replacesspawn/execSync/execFileSyncwith no-op exports to drop CC-specific dead code (e.g. browser-opening for SSO). The shared spawn helper would inherit the stub and silently fail. Openclaw works around it by usingcreateRequireto bypass esbuild and pull the realspawnat runtime (openclaw/src/index.ts:78-80). -
openclaw/src/index.tsuses code splitting (chunks/), and the shared spawn helper transitively imports modules the openclaw chunk graph deliberately avoids. Inlining the spawn keeps the openclaw bundle's chunk topology under control.
It IS code duplication, deliberately. The two functions stay in sync via the (slightly-pun-aware) header comment mirror of src/skillify/spawn-skillify-worker.ts plus the shared test (tests/claude-code/skillify-session-start-injection.test.ts asserts both surfaces have the same wiring).
If we ever drop the child_process stub on the openclaw main bundle, we can DRY this — but the stub itself is also load-bearing for the ClawHub static scan (PR #170), so it'd be a meaningful architectural shift, not a quick refactor.
#110) Two related spawn-lifecycle bugs in `openclaw/src/index.ts`: mine a session once. The existing on-disk lock prevents *overlapping* workers, but as soon as a worker exits and releases the lock, the NEXT `agent_end` (same session, same project_key) re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL round-trip and exits, but each spawn still costs Node cold-start + DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1. `O_CREAT | O_EXCL | O_WRONLY` to create the lock file and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its `finally`-block releases the lock, the empty lock file persists forever and EVERY subsequent `agent_end` silently no-ops mining permanently for that project_key. Hit live during the 2026-05-07 PR #98 E2E — a manual `rm <lockfile>` was needed to recover. Fixes: openclaw/src/index.ts: spawnOpenclawSkillifyWorker call site * Wrapped in `if (!skillifySpawnedFor.has(sid))` guard. New module-level `Set<string>` of session IDs that have already triggered a spawn in this runtime. ~5 lines, O(active-sessions) memory. Matches the spawn cadence the non-openclaw agents already get via their `tryAcquireWorkerLock` patterns in `src/skillify/state.ts`. The on-disk lock stays authoritative across processes; this Set only suppresses within-runtime redundancy. openclaw/src/index.ts: tryAcquireOpenclawSkillifyLock * Lock file now writes `String(Date.now())` on acquire (was an empty file before). * On O_EXCL failure, reads the existing lock and parses its ms timestamp. If `Date.now() - ts > 10 minutes` OR the body is unparseable, the lock is treated as stale → unlinked → retry acquire. Mirrors `src/skillify/state.ts:tryAcquireWorkerLock`. * Empty pre-existing lock files (left by older code) parse as NaN and are treated as immediately stale — clean migration on first patched run, no manual cleanup needed. * 10-minute max age is generous vs typical worker runtime (<30s + safety buffer); pathological hangs longer than that unlock the spawn slot for the next agent_end without leaking mining for the rest of the gateway's lifetime. Test harness: tests/claude-code/skillify-session-start-injection.test.ts * One regex distance bumped 500→1500 chars to accommodate the new dedup comment block between `Auto-captured` and the spawn site. Same intent — confirms agent_end calls spawn after the capture loop, just with the doc block landed. Tests: 2380/2380 passing. Fixes #100 Fixes #110
Both findings are real correctness bugs; both fixed. 1. Lock-creation race window (lines 376-407) The acquire path was: fd = O_CREAT|O_EXCL|O_WRONLY → fsWriteFile(fd, timestamp) → fsClose The file becomes visible after O_CREAT but BEFORE the timestamp write. A racing caller hitting the read path during that window saw an empty body, parsed NaN, treated it as 'immediately stale', unlinked the live lock, and acquired the same lock — overlapping workers across processes. Fix: when the parsed body is NaN/empty, fall back to the file's mtime to compute effective age. If the file is fresh (mtime within LOCK_MAX_AGE_MS), the competitor is mid-write and we yield. If it's older than the threshold, the previous holder genuinely crashed without writing the timestamp and we can recycle. Used statSync. 2. skillifySpawnedFor.add(sid) ran BEFORE spawn could fail spawnOpenclawSkillifyWorker has multiple non-exception failure paths (worker file missing, no delegate gate CLI, lock held by a fresh worker, mkdir/config write failure, spawn() throw). The previous code called unconditionally before invoking spawn — so one transient failure on the first turn suppressed every subsequent retry for that session in this runtime. Fix: changed spawnOpenclawSkillifyWorker's return type to boolean. Every failure path returns false; only the actual realSpawn().unref() returns true. Caller does so the session is recorded only on successful spawn. Also added explicit test assertions for the Set declaration + has/add guard so a future refactor that drops the dedup primitives fails the bundle-scan test (CodeRabbit nitpick on the test). Tests: 2380/2380 passing. Typecheck clean.
5020f4b to
7fd7d3a
Compare
`release.yml:250` runs `npm run audit:openclaw -- --criticals-only` as a hard gate before `npm publish` and `clawhub package publish`. That correctly stops a flagged bundle from shipping, but it runs post-merge — a critical regression would already be in main by then. Mirror the same step in ci.yml so a PR that re-trips the ClawHub static-scan rules (env-harvesting, dangerous-exec, etc.) fails its required check before merge. Placed right after the existing build step so openclaw/dist/ is populated. `--criticals-only` matches release.yml's flag: warns remain advisory (the skillify-worker's irreducible readFileSync + fetch warn is the canonical example).
Fixes #100 and #110.
Why
Two spawn-lifecycle bugs in
openclaw/src/index.ts:#100 — Wasted re-spawns:
agent_endfires on every turn. The on-disk lock at~/.deeplake/state/skillify/<projectKey>.worker.lockprevents overlapping workers, but as soon as a worker exits and releases its lock, the NEXTagent_endre-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL roundtrip, sees nothing new to mine, and exits — but each spawn costs ~50ms Node cold-start + ~200ms DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.#110 — Stale locks halt mining permanently:
tryAcquireOpenclawSkillifyLockdoesO_CREAT | O_EXCL | O_WRONLYand treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before itsfinallyreleases the lock, the lock persists forever and every subsequentagent_endsilently no-ops mining for thatproject_keypermanently. Hit live during the 2026-05-07 PR #98 E2E — a manualrm <lockfile>was needed to recover.What changed
Per-runtime dedup (#100)
const skillifySpawnedFor = new Set<string>(). Tracks which session IDs have already triggered a spawn in this gateway runtime.agent_endhandler now wraps thespawnOpenclawSkillifyWorker(...)call inif (!skillifySpawnedFor.has(sid)) { skillifySpawnedFor.add(sid); … }.Stale-lock recovery (#110)
String(Date.now())on acquire (was an empty file).O_EXCLfailure, reads the existing lock body, parses it as a ms timestamp. IfDate.now() - ts > 10 minutesOR the body is unparseable (NaN), the lock is treated as stale → unlinked → retry acquire.src/skillify/state.ts:tryAcquireWorkerLockfor the non-openclaw agents.NaNand are treated as immediately stale on the first patched run — no manual cleanup needed.agent_end, instead of leaking mining for the rest of the gateway's lifetime.Tests
npm run typecheck— cleannpm test— 2380/2380 passing (one bundle-scan regex distance bumped 500→1500 to accommodate the new dedup comment block betweenAuto-capturedand the spawn site; same assertion intent)Test plan after merge
grep -c "Auto-captured" /tmp/openclaw/openclaw-*.logshould be many;ls ~/.deeplake/state/skillify/*.worker.lockshould show at most one mtime-bump per session (one spawn, not 2-5).kill -9 $WORKER_PID). Wait 11 minutes. Nextagent_endshould successfully re-acquire the lock (stale-recovery path).Summary by CodeRabbit
Bug Fixes
Tests