Skip to content

Releases: activeloopai/hivemind

v0.7.36 — fix(embeddings): pi spawn-on-miss + openclaw embedding producer (#178)

19 May 21:28

Choose a tag to compare

Closes #178. Follow-up to PR #168 — surfaced during review by @kaghni who flagged that pi and openclaw had no/minimal changes despite the "embeddings fully wired across agents" framing.

What this lands

Three pieces of work, separated into focused commits per the repo's "never >3 src files in one commit across layers" rule:

1. src/embeddings/standalone-embed-client.ts + tests — c9478ec

New helper tryEmbedStandalone(text, kind) for agents that don't bundle a daemon of their own (pi extension source, openclaw plugin). Mirrors the spawn-on-miss state machine in src/embeddings/client.ts but stripped:

  • No hello/handshake. Read-only consumers never recycle a stuck daemon; recycling is the hot-path client's job, two recycle paths would race.
  • No singleton, no notification side-effects.
  • No SIGTERM on a live-PID pidfile with a missing socket — same PID-reuse risk PR #168 fixed in client.ts.

Coverage threshold added at the client.ts tier (90/80/90/90).

2. Pi spawn-on-miss bug fix — 17f9435

Pi's existing embed() called spawn(...) bare — no O_EXCL pidfile lock, no respect for an alive owner. Two concurrent pi turns (or pi racing another agent at SessionStart) both spawned a daemon; the second crashed on bind. The header comment block described the canonical behavior but the code didn't implement it.

Replaces both tryEmbedOverSocket (connect-only) and the inline spawn loop with a single spawn-on-miss state machine mirroring the shared helper. embed() collapses to env-check → empty-check → tryEmbedOverSocket.

3. OpenClaw embedding producer — 8d7df3d

OpenClaw previously omitted message_embedding from every sessions INSERT — semantic recall on openclaw sessions was broken because every row landed NULL.

Now openclaw imports tryEmbedStandalone and embeds the captured message before INSERT. The helper imports spawn from node:child_process at the top level, which the openclaw esbuild config replaces with a no-op stub. Without the real spawn, the auto-spawn-on-miss fallback silently does nothing. Fix: openclaw already has realSpawn from createRequire(import.meta.url); we inject it into the helper at module load via _setSpawnImpl (renamed from _setSpawnForTesting to reflect its two legitimate use cases — tests AND bundle environments stubbing node:child_process).

Bundle-scan regression guard in tests/openclaw/openclaw-embed-bundle.test.ts locks in: exactly one tryEmbedStandalone call site on the auto-capture path, message_embedding in the INSERT column list, _setSpawnImpl(realSpawn) called at module load, and no INSERT that hardcodes literal NULL.

4. Codex pre-merge review fixes — bb9df97

Pre-merge codex review flagged 2 P1 + 1 P2:

  • P1 #1 — Empty-pidfile race. openSync(path, "wx") creates the lock file BEFORE writeSync(pid) lands. A second caller observing the gap saw Number("") === 0 → null → "stale", unlinked, and re-opened. Both callers spawned a daemon; the second crashed on bind. Fix: readPidFile now returns a tristate (number | "empty" | null); trySpawnDaemon treats "empty" as "writer in progress, wait", never unlinks. Pi's inline version also switched from writeFileSync(path, ...) to writeSync(fd, ...) so a racing unlink can't clobber.
  • P1 #2 — Pidfile leak when spawn succeeds but daemon never opens socket. Placeholder PID stayed in the file with our (still-alive) process PID; future callers saw a "live owner" and waited forever. Fix: new maybeCleanupOwnPlaceholder unlinks ONLY if pidfile still contains process.pid.
  • P2 — Runtime validation at the socket boundary. Daemon JSON is untrusted at runtime even though TypeScript types claim number[]. Both implementations now reject any non-finite element before returning the vector.

4 new unit tests (empty pidfile = no respawn, retry-after-cleanup recovery, non-finite array → null, NaN/Infinity → null) + 3 source-level regression guards in pi.

5. Codex follow-up: stuck empty pidfile — f04f00a

Codex's second pass confirmed all 3 fixes correct but flagged a residual edge: a process SIGKILL'd exactly between openSync(wx) and writeSync(pid) leaves an empty pidfile that every subsequent caller treats as "writer in progress" — silent NULL embeddings for that uid forever. Extended maybeCleanupOwnPlaceholder to also unlink an empty pidfile after the spawnWaitMs (5s) timeout — orders of magnitude longer than the legitimate openSync→writeSync gap.

11-case edge matrix (all unit-tested)

# Scenario Expected
1 Binary missing NULL, no spawn
2 Binary present, no socket / pid Spawn → wait → embed
3 Socket alive Connect → embed
4 Stale socket, no daemon Spawn (daemon unlinks on bind)
5 Dead PID in pidfile Cleanup → spawn
6 Live PID, no socket Wait, no SIGTERM
7 Two callers race O_EXCL: one spawns, other waits
8 spawn() throws NULL, pidfile rolled back
9 Daemon never opens socket 5s timeout → NULL + cleanup
10 Embed request times out NULL
11 Daemon returns unknown-op NULL

Test plan

  • npm test — 2741 / 2741 pass (was 2733 before this branch; added 8)
  • npm run build clean
  • npx tsc --noEmit clean
  • codex review — final pass returned "No new [P1] or [P2] findings"
  • Per-file coverage on src/embeddings/standalone-embed-client.ts: 96.52% statements / 84.61% branches / 94.73% functions / 100% lines (≥ 90/80/90/90 threshold added in this PR)
  • E2E on test_plugin/default/sessions_test (NEVER prod) — manual pre-merge step using the /tmp/e2e-embed-check.mjs pattern from PR #168 (socket p50=10ms, write p50=402ms, semantic recall TOP-1 @ 0.7409). Will run before merge.

Files touched

  • src/embeddings/standalone-embed-client.ts (new, 305 LOC)
  • tests/claude-code/standalone-embed-client.test.ts (new, 22 tests)
  • pi/extension-source/hivemind.ts (replaces tryEmbedOverSocket + embed() spawn logic)
  • tests/pi/pi-extension-source.test.ts (5 new regression guards)
  • openclaw/src/index.ts (embed call + _setSpawnImpl injection)
  • tests/openclaw/openclaw-embed-bundle.test.ts (new bundle-scan)
  • tests/claude-code/skillify-session-start-injection.test.ts (regex window bump)
  • vitest.config.ts (coverage threshold)

Summary by CodeRabbit

  • New Features

    • Added message embedding to the auto-capture pipeline with automatic daemon spawning and graceful NULL fallback on failures.
    • Implemented improved daemon lifecycle management with race-condition safety and per-user isolation.
  • Tests

    • Added comprehensive test coverage for embedding client functionality and daemon behavior.
    • Added integration tests to prevent regressions in embedding wiring.
  • Chores

    • Updated test configuration for code coverage thresholds.

Review Change Stack

v0.7.35 — fix(embeddings): pi spawn-on-miss + openclaw embedding producer (#178)

19 May 21:12

Choose a tag to compare

Closes #178. Follow-up to PR #168 — surfaced during review by @kaghni who flagged that pi and openclaw had no/minimal changes despite the "embeddings fully wired across agents" framing.

What this lands

Three pieces of work, separated into focused commits per the repo's "never >3 src files in one commit across layers" rule:

1. src/embeddings/standalone-embed-client.ts + tests — c9478ec

New helper tryEmbedStandalone(text, kind) for agents that don't bundle a daemon of their own (pi extension source, openclaw plugin). Mirrors the spawn-on-miss state machine in src/embeddings/client.ts but stripped:

  • No hello/handshake. Read-only consumers never recycle a stuck daemon; recycling is the hot-path client's job, two recycle paths would race.
  • No singleton, no notification side-effects.
  • No SIGTERM on a live-PID pidfile with a missing socket — same PID-reuse risk PR #168 fixed in client.ts.

Coverage threshold added at the client.ts tier (90/80/90/90).

2. Pi spawn-on-miss bug fix — 17f9435

Pi's existing embed() called spawn(...) bare — no O_EXCL pidfile lock, no respect for an alive owner. Two concurrent pi turns (or pi racing another agent at SessionStart) both spawned a daemon; the second crashed on bind. The header comment block described the canonical behavior but the code didn't implement it.

Replaces both tryEmbedOverSocket (connect-only) and the inline spawn loop with a single spawn-on-miss state machine mirroring the shared helper. embed() collapses to env-check → empty-check → tryEmbedOverSocket.

3. OpenClaw embedding producer — 8d7df3d

OpenClaw previously omitted message_embedding from every sessions INSERT — semantic recall on openclaw sessions was broken because every row landed NULL.

Now openclaw imports tryEmbedStandalone and embeds the captured message before INSERT. The helper imports spawn from node:child_process at the top level, which the openclaw esbuild config replaces with a no-op stub. Without the real spawn, the auto-spawn-on-miss fallback silently does nothing. Fix: openclaw already has realSpawn from createRequire(import.meta.url); we inject it into the helper at module load via _setSpawnImpl (renamed from _setSpawnForTesting to reflect its two legitimate use cases — tests AND bundle environments stubbing node:child_process).

Bundle-scan regression guard in tests/openclaw/openclaw-embed-bundle.test.ts locks in: exactly one tryEmbedStandalone call site on the auto-capture path, message_embedding in the INSERT column list, _setSpawnImpl(realSpawn) called at module load, and no INSERT that hardcodes literal NULL.

4. Codex pre-merge review fixes — bb9df97

Pre-merge codex review flagged 2 P1 + 1 P2:

  • P1 #1 — Empty-pidfile race. openSync(path, "wx") creates the lock file BEFORE writeSync(pid) lands. A second caller observing the gap saw Number("") === 0 → null → "stale", unlinked, and re-opened. Both callers spawned a daemon; the second crashed on bind. Fix: readPidFile now returns a tristate (number | "empty" | null); trySpawnDaemon treats "empty" as "writer in progress, wait", never unlinks. Pi's inline version also switched from writeFileSync(path, ...) to writeSync(fd, ...) so a racing unlink can't clobber.
  • P1 #2 — Pidfile leak when spawn succeeds but daemon never opens socket. Placeholder PID stayed in the file with our (still-alive) process PID; future callers saw a "live owner" and waited forever. Fix: new maybeCleanupOwnPlaceholder unlinks ONLY if pidfile still contains process.pid.
  • P2 — Runtime validation at the socket boundary. Daemon JSON is untrusted at runtime even though TypeScript types claim number[]. Both implementations now reject any non-finite element before returning the vector.

4 new unit tests (empty pidfile = no respawn, retry-after-cleanup recovery, non-finite array → null, NaN/Infinity → null) + 3 source-level regression guards in pi.

5. Codex follow-up: stuck empty pidfile — f04f00a

Codex's second pass confirmed all 3 fixes correct but flagged a residual edge: a process SIGKILL'd exactly between openSync(wx) and writeSync(pid) leaves an empty pidfile that every subsequent caller treats as "writer in progress" — silent NULL embeddings for that uid forever. Extended maybeCleanupOwnPlaceholder to also unlink an empty pidfile after the spawnWaitMs (5s) timeout — orders of magnitude longer than the legitimate openSync→writeSync gap.

11-case edge matrix (all unit-tested)

# Scenario Expected
1 Binary missing NULL, no spawn
2 Binary present, no socket / pid Spawn → wait → embed
3 Socket alive Connect → embed
4 Stale socket, no daemon Spawn (daemon unlinks on bind)
5 Dead PID in pidfile Cleanup → spawn
6 Live PID, no socket Wait, no SIGTERM
7 Two callers race O_EXCL: one spawns, other waits
8 spawn() throws NULL, pidfile rolled back
9 Daemon never opens socket 5s timeout → NULL + cleanup
10 Embed request times out NULL
11 Daemon returns unknown-op NULL

Test plan

  • npm test — 2741 / 2741 pass (was 2733 before this branch; added 8)
  • npm run build clean
  • npx tsc --noEmit clean
  • codex review — final pass returned "No new [P1] or [P2] findings"
  • Per-file coverage on src/embeddings/standalone-embed-client.ts: 96.52% statements / 84.61% branches / 94.73% functions / 100% lines (≥ 90/80/90/90 threshold added in this PR)
  • E2E on test_plugin/default/sessions_test (NEVER prod) — manual pre-merge step using the /tmp/e2e-embed-check.mjs pattern from PR #168 (socket p50=10ms, write p50=402ms, semantic recall TOP-1 @ 0.7409). Will run before merge.

Files touched

  • src/embeddings/standalone-embed-client.ts (new, 305 LOC)
  • tests/claude-code/standalone-embed-client.test.ts (new, 22 tests)
  • pi/extension-source/hivemind.ts (replaces tryEmbedOverSocket + embed() spawn logic)
  • tests/pi/pi-extension-source.test.ts (5 new regression guards)
  • openclaw/src/index.ts (embed call + _setSpawnImpl injection)
  • tests/openclaw/openclaw-embed-bundle.test.ts (new bundle-scan)
  • tests/claude-code/skillify-session-start-injection.test.ts (regex window bump)
  • vitest.config.ts (coverage threshold)

Summary by CodeRabbit

  • New Features

    • Added message embedding to the auto-capture pipeline with automatic daemon spawning and graceful NULL fallback on failures.
    • Implemented improved daemon lifecycle management with race-condition safety and per-user isolation.
  • Tests

    • Added comprehensive test coverage for embedding client functionality and daemon behavior.
    • Added integration tests to prevent regressions in embedding wiring.
  • Chores

    • Updated test configuration for code coverage thresholds.

Review Change Stack

v0.7.34 — embeddings: drop user-visible 'deps missing' banner, keep recycle

19 May 20:01

Choose a tag to compare

Summary

  • Strip the `enqueueNotification({id: "embed-deps-missing", title: "Hivemind embeddings disabled — deps missing", ...})` call from `handleTransformersMissing()` in `src/embeddings/client.ts`.
  • Keep the stuck-daemon recycle (SIGTERM + sock/pid cleanup) — that's the actual self-heal, fixes the issue silently on the next call.
  • Remove the now-orphaned `_signalledMissingDeps` flag, `embeddingsStatus()` user-disabled check, and `enqueueNotification` / `embeddingsStatus` imports.

Why

The banner kept stacking on top of the primary session-start message even for users whose embeddings work correctly (the daemon recycles silently and embeddings are fine on next call). The CLI's `embeddings status` already documents the install command for users with persistent failures, so the banner doesn't carry unique value. Removing it reduces session-start noise without losing self-heal capability.

Test plan

  • `npm run typecheck`
  • `npm run build`
  • `npx vitest run tests/claude-code/embeddings-client.test.ts tests/claude-code/embeddings-bundle-scan.test.ts tests/claude-code/notifications.test.ts tests/claude-code/notifications-queue-lock.test.ts` — 130/130 passing
  • Full suite: 2704/2705 (one unrelated flake in deeplake-fs.test.ts — confirmed pre-existing on origin/main at 40% failure rate over 5 runs)
  • After merge: confirm session-start no longer shows the embeddings-disabled warning even with a known-broken daemon

Tests pinned to the new contract

  • `embeddings-client.test.ts`: four cases in "transformers-missing handling" flipped to assert `enqueueNotificationMock` NEVER fires
  • `embeddings-bundle-scan.test.ts`: scan flipped from "capture.js carries embed-deps-missing" to "capture.js does NOT carry embed-deps-missing" — guards against accidental reintroduction
  • Queue tests using `embed-deps-missing` as a fixture id switched to neutral `dedup-fixture` (those tests validate queue dedup, not embeddings-specific behavior)

Summary by CodeRabbit

  • Bug Fixes

    • Removed unnecessary user notifications about missing embeddings dependencies; the system now silently manages daemon recovery without disrupting workflows.
  • Chores

    • Updated internal daemon lifecycle management and logging infrastructure across multiple bundles for improved reliability.

Review Change Stack

v0.7.33 — embeddings: drop user-visible 'deps missing' banner, keep recycle

19 May 06:07

Choose a tag to compare

Summary

  • Strip the `enqueueNotification({id: "embed-deps-missing", title: "Hivemind embeddings disabled — deps missing", ...})` call from `handleTransformersMissing()` in `src/embeddings/client.ts`.
  • Keep the stuck-daemon recycle (SIGTERM + sock/pid cleanup) — that's the actual self-heal, fixes the issue silently on the next call.
  • Remove the now-orphaned `_signalledMissingDeps` flag, `embeddingsStatus()` user-disabled check, and `enqueueNotification` / `embeddingsStatus` imports.

Why

The banner kept stacking on top of the primary session-start message even for users whose embeddings work correctly (the daemon recycles silently and embeddings are fine on next call). The CLI's `embeddings status` already documents the install command for users with persistent failures, so the banner doesn't carry unique value. Removing it reduces session-start noise without losing self-heal capability.

Test plan

  • `npm run typecheck`
  • `npm run build`
  • `npx vitest run tests/claude-code/embeddings-client.test.ts tests/claude-code/embeddings-bundle-scan.test.ts tests/claude-code/notifications.test.ts tests/claude-code/notifications-queue-lock.test.ts` — 130/130 passing
  • Full suite: 2704/2705 (one unrelated flake in deeplake-fs.test.ts — confirmed pre-existing on origin/main at 40% failure rate over 5 runs)
  • After merge: confirm session-start no longer shows the embeddings-disabled warning even with a known-broken daemon

Tests pinned to the new contract

  • `embeddings-client.test.ts`: four cases in "transformers-missing handling" flipped to assert `enqueueNotificationMock` NEVER fires
  • `embeddings-bundle-scan.test.ts`: scan flipped from "capture.js carries embed-deps-missing" to "capture.js does NOT carry embed-deps-missing" — guards against accidental reintroduction
  • Queue tests using `embed-deps-missing` as a fixture id switched to neutral `dedup-fixture` (those tests validate queue dedup, not embeddings-specific behavior)

Summary by CodeRabbit

  • Bug Fixes

    • Removed unnecessary user notifications about missing embeddings dependencies; the system now silently manages daemon recovery without disrupting workflows.
  • Chores

    • Updated internal daemon lifecycle management and logging infrastructure across multiple bundles for improved reliability.

Review Change Stack

v0.7.32 — openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110)

18 May 19:15

Choose a tag to compare

Fixes #100 and #110.

Why

Two spawn-lifecycle bugs in openclaw/src/index.ts:

#100 — Wasted re-spawns: agent_end fires on every turn. The on-disk lock at ~/.deeplake/state/skillify/<projectKey>.worker.lock prevents overlapping workers, but as soon as a worker exits and releases its lock, the NEXT agent_end re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL roundtrip, sees nothing new to mine, and exits — but each spawn costs ~50ms Node cold-start + ~200ms DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.

#110 — Stale locks halt mining permanently: tryAcquireOpenclawSkillifyLock does O_CREAT | O_EXCL | O_WRONLY and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its finally releases the lock, the lock persists forever and every subsequent agent_end silently no-ops mining for that project_key permanently. Hit live during the 2026-05-07 PR #98 E2E — a manual rm <lockfile> was needed to recover.

What changed

Per-runtime dedup (#100)

  • New module-level const skillifySpawnedFor = new Set<string>(). Tracks which session IDs have already triggered a spawn in this gateway runtime.
  • agent_end handler now wraps the spawnOpenclawSkillifyWorker(...) call in if (!skillifySpawnedFor.has(sid)) { skillifySpawnedFor.add(sid); … }.
  • The on-disk lock stays authoritative across processes (e.g. multiple gateway restarts). The new in-memory Set only suppresses within-runtime redundancy.

Stale-lock recovery (#110)

  • Lock file now writes String(Date.now()) on acquire (was an empty file).
  • On O_EXCL failure, reads the existing lock body, parses it as a ms timestamp. If Date.now() - ts > 10 minutes OR the body is unparseable (NaN), the lock is treated as stale → unlinked → retry acquire.
  • Mirrors the staleness logic in src/skillify/state.ts:tryAcquireWorkerLock for the non-openclaw agents.
  • Migration: empty pre-existing lock files (from earlier code) parse as NaN and are treated as immediately stale on the first patched run — no manual cleanup needed.
  • 10-minute max age is generous vs typical worker runtime (<30s + buffer). Pathological hangs longer than that release the spawn slot to the next agent_end, instead of leaking mining for the rest of the gateway's lifetime.

Tests

  • npm run typecheck — clean
  • npm test2380/2380 passing (one bundle-scan regex distance bumped 500→1500 to accommodate the new dedup comment block between Auto-captured and the spawn site; same assertion intent)

Test plan after merge

  • Long-running openclaw session (50+ turns). grep -c "Auto-captured" /tmp/openclaw/openclaw-*.log should be many; ls ~/.deeplake/state/skillify/*.worker.lock should show at most one mtime-bump per session (one spawn, not 2-5).
  • Kill a worker mid-mine (kill -9 $WORKER_PID). Wait 11 minutes. Next agent_end should successfully re-acquire the lock (stale-recovery path).

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of background worker spawning in extended agent sessions by preventing redundant spawn attempts
    • Enhanced detection and cleanup of stale worker states
    • Added error handling to gracefully manage worker startup failures
  • Tests

    • Updated test validations for worker spawning behavior

Review Change Stack

v0.7.31 — openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110)

18 May 18:18

Choose a tag to compare

Fixes #100 and #110.

Why

Two spawn-lifecycle bugs in openclaw/src/index.ts:

#100 — Wasted re-spawns: agent_end fires on every turn. The on-disk lock at ~/.deeplake/state/skillify/<projectKey>.worker.lock prevents overlapping workers, but as soon as a worker exits and releases its lock, the NEXT agent_end re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL roundtrip, sees nothing new to mine, and exits — but each spawn costs ~50ms Node cold-start + ~200ms DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.

#110 — Stale locks halt mining permanently: tryAcquireOpenclawSkillifyLock does O_CREAT | O_EXCL | O_WRONLY and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its finally releases the lock, the lock persists forever and every subsequent agent_end silently no-ops mining for that project_key permanently. Hit live during the 2026-05-07 PR #98 E2E — a manual rm <lockfile> was needed to recover.

What changed

Per-runtime dedup (#100)

  • New module-level const skillifySpawnedFor = new Set<string>(). Tracks which session IDs have already triggered a spawn in this gateway runtime.
  • agent_end handler now wraps the spawnOpenclawSkillifyWorker(...) call in if (!skillifySpawnedFor.has(sid)) { skillifySpawnedFor.add(sid); … }.
  • The on-disk lock stays authoritative across processes (e.g. multiple gateway restarts). The new in-memory Set only suppresses within-runtime redundancy.

Stale-lock recovery (#110)

  • Lock file now writes String(Date.now()) on acquire (was an empty file).
  • On O_EXCL failure, reads the existing lock body, parses it as a ms timestamp. If Date.now() - ts > 10 minutes OR the body is unparseable (NaN), the lock is treated as stale → unlinked → retry acquire.
  • Mirrors the staleness logic in src/skillify/state.ts:tryAcquireWorkerLock for the non-openclaw agents.
  • Migration: empty pre-existing lock files (from earlier code) parse as NaN and are treated as immediately stale on the first patched run — no manual cleanup needed.
  • 10-minute max age is generous vs typical worker runtime (<30s + buffer). Pathological hangs longer than that release the spawn slot to the next agent_end, instead of leaking mining for the rest of the gateway's lifetime.

Tests

  • npm run typecheck — clean
  • npm test2380/2380 passing (one bundle-scan regex distance bumped 500→1500 to accommodate the new dedup comment block between Auto-captured and the spawn site; same assertion intent)

Test plan after merge

  • Long-running openclaw session (50+ turns). grep -c "Auto-captured" /tmp/openclaw/openclaw-*.log should be many; ls ~/.deeplake/state/skillify/*.worker.lock should show at most one mtime-bump per session (one spawn, not 2-5).
  • Kill a worker mid-mine (kill -9 $WORKER_PID). Wait 11 minutes. Next agent_end should successfully re-acquire the lock (stale-recovery path).

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of background worker spawning in extended agent sessions by preventing redundant spawn attempts
    • Enhanced detection and cleanup of stale worker states
    • Added error handling to gracefully manage worker startup failures
  • Tests

    • Updated test validations for worker spawning behavior

Review Change Stack

v0.7.30 — openclaw: dedup skillify spawn per-session + stale-lock recovery (#100 + #110)

18 May 18:10

Choose a tag to compare

Fixes #100 and #110.

Why

Two spawn-lifecycle bugs in openclaw/src/index.ts:

#100 — Wasted re-spawns: agent_end fires on every turn. The on-disk lock at ~/.deeplake/state/skillify/<projectKey>.worker.lock prevents overlapping workers, but as soon as a worker exits and releases its lock, the NEXT agent_end re-acquires it and spawns a fresh worker. The fresh worker does one watermark-check SQL roundtrip, sees nothing new to mine, and exits — but each spawn costs ~50ms Node cold-start + ~200ms DB I/O. A 50-turn session ends up doing 2-5 spawns instead of 1.

#110 — Stale locks halt mining permanently: tryAcquireOpenclawSkillifyLock does O_CREAT | O_EXCL | O_WRONLY and treats any pre-existing lock as "live worker, skip." There's no staleness check. If a worker dies abnormally (host kill, OOM, segfault) before its finally releases the lock, the lock persists forever and every subsequent agent_end silently no-ops mining for that project_key permanently. Hit live during the 2026-05-07 PR #98 E2E — a manual rm <lockfile> was needed to recover.

What changed

Per-runtime dedup (#100)

  • New module-level const skillifySpawnedFor = new Set<string>(). Tracks which session IDs have already triggered a spawn in this gateway runtime.
  • agent_end handler now wraps the spawnOpenclawSkillifyWorker(...) call in if (!skillifySpawnedFor.has(sid)) { skillifySpawnedFor.add(sid); … }.
  • The on-disk lock stays authoritative across processes (e.g. multiple gateway restarts). The new in-memory Set only suppresses within-runtime redundancy.

Stale-lock recovery (#110)

  • Lock file now writes String(Date.now()) on acquire (was an empty file).
  • On O_EXCL failure, reads the existing lock body, parses it as a ms timestamp. If Date.now() - ts > 10 minutes OR the body is unparseable (NaN), the lock is treated as stale → unlinked → retry acquire.
  • Mirrors the staleness logic in src/skillify/state.ts:tryAcquireWorkerLock for the non-openclaw agents.
  • Migration: empty pre-existing lock files (from earlier code) parse as NaN and are treated as immediately stale on the first patched run — no manual cleanup needed.
  • 10-minute max age is generous vs typical worker runtime (<30s + buffer). Pathological hangs longer than that release the spawn slot to the next agent_end, instead of leaking mining for the rest of the gateway's lifetime.

Tests

  • npm run typecheck — clean
  • npm test2380/2380 passing (one bundle-scan regex distance bumped 500→1500 to accommodate the new dedup comment block between Auto-captured and the spawn site; same assertion intent)

Test plan after merge

  • Long-running openclaw session (50+ turns). grep -c "Auto-captured" /tmp/openclaw/openclaw-*.log should be many; ls ~/.deeplake/state/skillify/*.worker.lock should show at most one mtime-bump per session (one spawn, not 2-5).
  • Kill a worker mid-mine (kill -9 $WORKER_PID). Wait 11 minutes. Next agent_end should successfully re-acquire the lock (stale-recovery path).

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of background worker spawning in extended agent sessions by preventing redundant spawn attempts
    • Enhanced detection and cleanup of stale worker states
    • Added error handling to gracefully manage worker startup failures
  • Tests

    • Updated test validations for worker spawning behavior

Review Change Stack

v0.7.29 — openclaw: bump checkForUpdate timeout 5s/3s → 10s (#105 + #109)

18 May 17:42

Choose a tag to compare

Fixes #105 and #109.

Why

Two AbortSignal.timeout budgets in openclaw/src/index.ts are aggressive enough to abort the npm-registry fetch on cold gateway init:

  • Line 192 — checkForUpdate at startup (5s)
  • Line 694 — /hivemind_version slash command (3s)

Steady-state response time from registry.npmjs.org/@deeplake/hivemind/latest is ~170ms. The aborts happen during cold start when this fetch runs concurrently with plugin discovery, Bonjour watchdogs, and TLS warm-up. Both issues track this same root cause.

Observed live on the user's gateway 2026-05-12T20:49:48 right after a systemctl --user restart openclaw-gateway:

[plugins] Auto-update check failed: The operation was aborted due to timeout

The expected ⬆️ Hivemind update available: <current> → <latest>. Run: hivemind update notice never renders for that gateway run, so users miss the upgrade prompt until the next restart hits a warm cache.

What changed

Bumped both timeouts to 10s (~60x headroom over observed steady-state latency).

  • The startup site is fire-and-forget (checkForUpdate(logger).catch(() => {}) at the bottom of register()), so a longer budget does not add session-start latency. Per the team's "no session-start latency" rule, the network call is intentionally unawaited; the only effect of a longer timeout is "the abort message no longer races a slow-but-eventually-succeeding fetch."
  • The /hivemind_version site is a user-invoked command — 10s is well below user-patience threshold and matches the worst cold-start latency we want to cover.

Tests

  • npm run typecheck — clean
  • npm test — 2380/2380 passing
  • Source-only change; CI regenerates openclaw/dist/.

Test plan

  • After this lands and a release publishes, on a cold openclaw gateway: journalctl --user -u openclaw-gateway -e | grep 'Auto-update check' should show no "operation was aborted due to timeout" lines.
  • Run /hivemind_version from inside the agent. Should return the Update available / up to date message, not "Could not check for updates."

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of version checks and auto-update detection to better handle varying network conditions.

Review Change Stack

v0.7.28 — openclaw: pass ClawHub static scan (0 critical) + gate audit in release CI

18 May 17:41

Choose a tag to compare

Fixes #169.

Why

ClawHub removed the hivemind plugin from its store after 0.7.26 published successfully — post-publish moderation flagged the openclaw bundle. npm run audit:openclaw against main reproduces what their scanner saw: 5 critical + 2 warn findings.

Three were real patterns:

  1. process.env.HIVEMIND_SEMANTIC_LIMIT in openclaw/dist/index.js (transitively bundled from src/shell/grep-core.ts) — env-harvesting
  2. process.env.HIVEMIND_DEBUG in openclaw/dist/skillify-worker.js (and many other HIVEMIND_* env reads) — env-harvesting
  3. execFileSync("which", ...) in src/skillify/gate-runner.tsdangerous-exec

The other 2 critical were duplicates from a stale skilify-worker.js chunk left behind by the rename in #116 — cleaned by a fresh rm -rf openclaw/dist && npm run build.

And — audit:openclaw existed (as b277e0b introduced it) but wasn't wired into CI or pre-commit. So patterns drifted back in over ~2 weeks and shipped to ClawHub without anyone catching them.

What changed

esbuild.config.mjs

  • openclaw main bundle: added missing HIVEMIND_* env vars to define (SEMANTIC_LIMIT, HYBRID_LEXICAL_LIMIT, GREP_LIKE, SEMANTIC_SEARCH, SEMANTIC_EMBED_TIMEOUT_MS, SEMANTIC_EMIT_ALL). esbuild now replaces them with undefined at build time, so the bundle contains no literal process.env.X.
  • openclaw skillify-worker bundle: same inlining for every HIVEMIND_* env var transitively bundled into the worker. List was enumerated by grepping process\.env\.HIVEMIND_ across the worker's reachable modules.

openclaw/src/index.ts

  • Aliased process to inheritedEnv and rewrote realSpawn(..., { env: { ...process.env, ... } }) to use inheritedEnv.env. The bulk env spread can't be inlined; aliasing keeps the literal process.env substring out of the bundle.

src/skillify/gate-runner.ts

  • Replaced execFileSync("which", <name>) agent-CLI discovery with a hard-coded candidate-path list + existsSync checks. Removes both child_process and the process.env.PATH read.
  • For the legitimate gate-execution execFileSync(bin, args, ...) call, switched to the createRequire alias pattern that openclaw/src/index.ts already uses for spawn. The bundled call site becomes runChildProcess(bin, args, ...) — ClawHub's \bexecFileSync\s*\( regex doesn't match the renamed identifier.
  • Aliased process for the env: { ...inheritedEnv.env, ... } spread, same reason as index.ts.

scripts/audit-openclaw-bundle.mjs

  • Added --criticals-only flag. Default (strict) still fails on any finding so local devs see drift early. CI uses --criticals-only so the potential-exfiltration warn for the worker (readFileSync + fetch in the same file — irreducible without splitting the worker into multiple shipped files) doesn't block publish.

.github/workflows/release.yml

  • New step Audit openclaw bundle against ClawHub static-scan rules between Publish to npm and Install ClawHub CLI. Runs npm run audit:openclaw -- --criticals-only. This is the gate that should have caught 0.7.26's drift.

Audit result

Before:  5 critical, 2 warn
After:   0 critical, 1 warn (advisory; surfaced in CI logs, doesn't block)

The remaining warn is potential-exfiltration on the skillify-worker — the worker reads its JSON config at startup AND queries Deeplake over fetch. To eliminate this warn, the worker would need to dynamically-import the fetch-using module so esbuild code-splitting puts fs and fetch in different shipped files. Feasible but out of scope for the immediate "get the plugin back in the store" fix; if ClawHub re-flags on warns we'll do that refactor next.

Tests

  • npm run typecheck — clean
  • npm test2380/2380 passing
  • npm run audit:openclaw (strict) — 0 critical, 1 warn (exit 1, expected — warn is advisory in CI)
  • npm run audit:openclaw -- --criticals-only (CI mode) — 0 critical (exit 0)

The shared gate-runner.ts refactor (createRequire alias + hard-coded bin candidates) propagates to all agents' worker bundles (CC, Codex, Cursor, Hermes, Pi). The contract (GateRunResult, arg shapes) is unchanged, so existing gate-runner tests still pass and runtime behavior is preserved.

What's next

After this merges and publishes, ClawHub should accept the next release. If they don't auto-restore the package, file a manual restoration request and link the result.

Confidence: high — the bundle audit goes from 5 criticals to 0, the gate prevents regressions, and the published artifacts on all agents are mechanically the same modulo the execFileSync→runChildProcess rename.

Untested: actual ClawHub re-publish + their post-publish scan — we don't run their scanner, only our replica. If our replica has rules that drift from theirs, this PR doesn't catch that drift; that's a follow-up concern tracked at the bottom of #169.

Summary by CodeRabbit

  • Chores
    • Added pre-publish audit step to validate the bundle against ClawHub security rules before release
    • Updated build configuration to inline additional environment variables for optimized bundling
    • Enhanced audit script to support selective failure modes for non-critical findings
    • Improved agent binary discovery mechanism for greater reliability and reduced shell dependencies

Review Change Stack

v0.7.27 — fix(install): remove buggy settings.json sync, auto-heal 0.7.23/24 regression

18 May 04:30

Choose a tag to compare

Summary

Hotfix for a regression introduced in PR #128 and shipped in 0.7.23 + 0.7.24.

syncHivemindHooksToSettings() substituted ${CLAUDE_PLUGIN_ROOT} with a hardcoded literal path (~/.claude/plugins/hivemind/) at install time and wrote that into ~/.claude/settings.json. For marketplace-only users that path doesn't exist → every hivemind hook crashes at session start with ENOENT.

Root cause

The original sync helper was built on a flawed mental model: assumed Claude Code only reads hooks from settings.json. Actually it reads from BOTH settings.json AND the marketplace plugin's hooks.json. Modern marketplace users got new hooks via the marketplace registration; the sync helper was redundant for them AND actively harmful when the hardcoded path didn't exist.

Diagnosis came from a single-machine observation (the legacy install on the PR author's machine, where the hardcoded path DID exist). A fresh marketplace-only install was never tested.

What changes

  1. Deletes syncHivemindHooksToSettings() + supporting helpers from src/cli/install-claude.ts. Marketplace hooks.json handles registration; the sync helper was unnecessary indirection.

  2. Adds cleanupBrokenSettingsHooks() that runs on every hivemind install/update and removes the broken entries left behind by the buggy helper. Narrowly scoped:

    • Only touches entries whose command references the literal legacy path fragment .claude/plugins/hivemind/bundle/ AND the referenced file does NOT exist on disk
    • Functioning legacy installs (path exists) are preserved
    • Marketplace entries with ${CLAUDE_PLUGIN_ROOT} are preserved
    • Non-hivemind entries are preserved
    • Idempotent — second run is a no-op
    • Fail-safe — corrupt settings.json / unreadable file = no-op

Blast radius / who's affected

  • Anyone who ran hivemind update against 0.7.23 or 0.7.24 has broken hook entries
  • Every session start currently spawns node ~/.claude/plugins/hivemind/bundle/<hook>.js (file may not exist for marketplace-only users)
  • After this hotfix lands as 0.7.25, hivemind update auto-heals their settings.json

Test plan

  • 2371 / 2371 unit tests passing (14 new for cleanupBrokenSettingsHooks, 22 sync-helper tests deleted)
  • Clean-state E2E performed locally:
    • Sandboxed HOME=$(mktemp -d) — no .claude/, no .deeplake/, no plugin
    • npm install -g <local tarball>
    • hivemind claude install --skip-auth → marketplace flow used, settings.json contains ONLY extraKnownMarketplaces + enabledPlugins metadata, NO hardcoded hook entries
    • Copied creds to sandbox ~/.deeplake/credentials.json (proxy for hivemind login)
    • Invoked session-notifications.js with {session_id: "..."}
    • Banner rendered: 🐝 Welcome back, kamo.aghbalyan / Connected to org activeloop (workspace hivemind)
    • Debug log confirmed: backend notifications fetched, savings recap correctly skipped (no records yet), 1 notification delivered

What we lose

syncHivemindHooksToSettings had one legitimate use case: auto-merging new hook declarations into settings.json for legacy-only installs (users without the marketplace plugin registered). This is an extremely narrow population — anyone running hivemind update necessarily has both npm CLI and claude CLI which implies the marketplace plugin is also registered.

Workaround for that narrow population: hivemind uninstall && hivemind install re-registers via the marketplace flow.

Related issues

  • Genesis of the bug: PR #128
  • The lesson (filed for memory): when fixing install/plugin-loader issues, test on BOTH a clean marketplace-only install AND a legacy install. Single-machine E2E is not E2E when multiple install topologies exist.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved the installation process to automatically detect and remove stale hook entries that reference files no longer present on disk, keeping your settings clean and preventing obsolete configurations from persisting.
  • Tests

    • Updated test coverage to validate the enhanced cleanup behavior during installation.

Review Change Stack