Skip to content

feat(e2e): tier-1 cross-agent matrix harness#122

Open
kaghni wants to merge 5 commits into
mainfrom
feat/e2e-agent-matrix
Open

feat(e2e): tier-1 cross-agent matrix harness#122
kaghni wants to merge 5 commits into
mainfrom
feat/e2e-agent-matrix

Conversation

@kaghni
Copy link
Copy Markdown
Collaborator

@kaghni kaghni commented May 11, 2026

Summary

Adds a tier-1 cross-agent E2E harness. Drives the five headless agent CLIs (claude-code, codex, cursor-agent, hermes, pi) through real prompts against a dedicated Deeplake test workspace, and asserts on the side effects that source + bundle byte-checks can't catch: hook-loader runtime failures, per-agent install drift, cross-agent inconsistency in the memory mount.

This PR is the harness only — fix-agnostic by design. Any feature branch can validate cross-agent behavior by triggering this workflow against itself after merge here.

Why now

The recurring class of bugs source tests miss is "wires correctly, fails at runtime under one agent's loader". Manual cross-agent passes are the only safety net today and they take multiple hours per release. This automates that pass: 4 cases × 5 agents = 20 assertions per run, ~10 min wall-clock, ~$1.50 in provider API costs.

Architecture (high level)

tests/e2e/
  runner.ts           orchestrator + CLI flags (--case --agent --keep-sandbox --list)
  sandbox.ts          mkdtemp HOME + write creds + buildSessionId
  assertions.ts       4 assertion types + post-run row cleanup
  cost.ts             per-agent cost parsing + per-run summary.json
  types.ts            typed AgentDriver / E2ECase / Assertion contracts
  matrix.ts           cross-product (case × agent) + skip-list
  agents/             5 drivers (~50-80 lines each)
  cases/              4 behavioral cases (~40 lines each)
  README.md           how to run + how to add a case
.github/workflows/e2e.yml   workflow_dispatch ONLY — manual trigger

Total: 16 TS files, ~1470 lines + workflow + README. Existing test suite unchanged (2179 tests still passing).

Decisions made (documented in the plan)

  1. Test workspace — dedicated hivemind-e2e workspace inside activeloop org. CI reads HIVEMIND_E2E_CREDS_JSON (full credentials.json blob); runner writes it to ${tmpHome}/.deeplake/credentials.json per case.
  2. Provider keys — standard env-var convention forwarded into spawn (ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY). CI secrets are namespaced (HIVEMIND_E2E_*); workflow does the translation.
  3. Cadence — manual workflow_dispatch only. No schedule, no PR trigger. Reasons: cost (~$1.50/run × many PRs/day), flake-class (upstream agent CLIs change flag shapes), wall time (~10 min vs 23s current npm test). Promote later in a separate PR.
  4. Isolation — tmp HOME via mkdtempSync + process.env.HOME override. Docker-per-case deferred — promote only if v1 develops $HOME bleed-through flakes.
  5. Cost tracking — best-effort regex parse per agent (claude/codex/pi print final usage; cursor/hermes vary). Summary JSON uploaded as workflow artifact.

Prior art steered the design

  • Princeton HAL — cost as first-class output field, per-case isolation, max-concurrent throttle. Adopted these.
  • Promptfoo — assertion vocabulary (stdout-contains, metadata-trace). Adopted vocabulary, rejected SDK boundary (we need real CLI spawn to exercise the hook loader).
  • SWE-bench mini-agent — keep drivers thin so the agent is the variable, not the harness.

Hivemind's matrix shape (plugin behavior × agent runtime) is novel — no prior framework tests one plugin across 5+ agent CLIs. The infra ends up simpler than HAL's docker-per-task setup because our cases assert on side effects, not task completion.

How to run

npm run e2e                              # full matrix
npm run e2e -- --list                    # print matrix, no spawns
npm run e2e -- --case 02-cat-index-md
npm run e2e -- --agent claude-code
npm run e2e -- --case 01-capture-smoke --agent claude-code  # fastest dev loop

Or trigger .github/workflows/e2e.yml from the Actions tab with optional case_filter / agent_filter inputs.

What's deferred

  • Tier 2 (Cursor IDE GUI inside Snap, OpenClaw gateway). README documents what each would need (long-lived test VM + Xvfb / tmux-driven session). Tier 2 lives in tests/e2e-tier2/ when built.
  • Live verification of the harness against a real workspace. The --list dry-run + typecheck + existing-tests-still-pass demonstrate the harness loads and the matrix shape works. A live run requires the hivemind-e2e workspace and HIVEMIND_E2E_CREDS_JSON secret to be provisioned in the activeloop org — see the README setup section.
  • Record/replay cassettes (llm-test-harness-style) for replaying cached runs cheaply. Not v1.

Setup before first real run

  1. Create hivemind-e2e workspace under activeloop Deeplake org. Generate a token with read/write on sessions + memory tables there.
  2. Save the resulting credentials.json blob as the HIVEMIND_E2E_CREDS_JSON GH secret. Mirror into provider-key secrets (HIVEMIND_E2E_ANTHROPIC_API_KEY etc.).
  3. Locally: export HIVEMIND_E2E_CREDS_JSON="$(cat /path/to/test-creds.json)" + provider keys + npm run e2e -- --case 01-capture-smoke --agent claude-code to smoke-test the loop.
  4. Once green, promote tier-2 work or use this matrix as the release-readiness gate.

Confidence: 75% — harness scaffolding compiles, dry-runs cleanly, matrix expands to 20 points, existing tests unaffected. Untested: any live agent-CLI spawn against a real workspace (gated on the test workspace + secrets being provisioned, scoped out of this PR per the manual-only cadence decision).

Untested: live spawn of any agent driver; install subprocess output for codex/cursor/hermes/pi installers under tmp HOME (relies on the existing installer code paths which have their own unit tests); cost-line regex match against current versions of each CLI's stdout format; the hook-log-contains substring matches against current hook log lines.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced end-to-end testing infrastructure supporting cross-agent test matrix execution
  • Tests

    • Added comprehensive E2E test suite with automated setup, execution, and teardown
    • Implemented assertion framework for output validation, database queries, and log verification
  • Documentation

    • Added end-to-end testing documentation with environment setup and execution guidelines
  • Chores

    • Configured automated testing workflow and npm scripts for testing lifecycle management

Review Change Stack

Drives the five headless agent CLIs (claude-code, codex, cursor-agent,
hermes, pi) through real prompts against a dedicated Deeplake test
workspace, asserting on real side effects (DB rows, hook log lines,
captured stdout, inject text). Replaces the multi-hour manual cross-
agent test pass each release; surfaces plugin bugs source + bundle
byte-checks can't reach (hook-loader runtime failures, per-agent install
drift, cross-agent inconsistency).

Architecture:
  tests/e2e/runner.ts       orchestrator + CLI flag parsing
  tests/e2e/sandbox.ts      mkdtemp HOME + write creds + per-agent install
  tests/e2e/assertions.ts   typed assertion runners + cleanup helper
  tests/e2e/cost.ts         per-agent cost parsing + summary writer
  tests/e2e/types.ts        AgentDriver / E2ECase / Assertion interfaces
  tests/e2e/matrix.ts       cross-product (case x agent) + skip-list
  tests/e2e/agents/*.ts     one ~50-80 line driver per agent CLI
  tests/e2e/cases/*.ts      four behavioral cases (capture-smoke,
                            cat-index-md, grep-memory-summaries,
                            session-start-inject)
  tests/e2e/README.md       how to run + how to add a case
  .github/workflows/e2e.yml manual-trigger workflow (workflow_dispatch only)

Cadence: manual only. No schedule, no PR trigger. Expected use: dev
finishes a feature, manually triggers the workflow against their branch,
reviews the cost+results artifact, opens PR with the run URL. The
unit/source/bundle tests in `npm test` keep gating merges.

Isolation: tmp HOME via mkdtempSync + process.env.HOME override per case.
With HOME overridden, every per-agent install path
(~/.codex/, ~/.cursor/, ~/.hermes/, ~/.pi/, ~/.deeplake/credentials.json)
redirects under the tmp dir; cross-case pollution is impossible at the
FS level. Docker-per-case promoted only if v1 develops bleed-through
flakes.

Credentials: dedicated hivemind-e2e workspace under the activeloop org;
CI secret HIVEMIND_E2E_CREDS_JSON contains the full credentials.json
blob; runner writes it to <tmpHome>/.deeplake/credentials.json per case.
Provider keys use the standard env var convention (ANTHROPIC_API_KEY,
OPENAI_API_KEY, GOOGLE_API_KEY) and missing keys cause a clean skip
rather than a fail.

Cleanup: each case picks a fresh e2e-<runId>-<case>-<agent> session_id
seed; driver discovers the agent's actual session_id from the hook log
post-run; cleanup DELETEs sessions+memory rows by ILIKE on path. Best-
effort cleanup (a failure is warned but doesn't fail the case).

Cost: each driver parses an agent-specific cost line from stdout where
available (claude/codex/pi print final usage). runner writes
tests/e2e/results/<runId>/summary.json with per-point cost + duration.
CI uploads as workflow artifact.

Prior art steered the design: HAL (cost-as-first-class field, per-case
isolation, max-concurrent throttle), Promptfoo (assertion vocabulary),
SWE-bench mini-agent (thin uniform drivers). Hivemind's matrix shape is
(plugin behavior x agent runtime), not (agent capability x task), so
the infra ends up simpler than HAL's docker-per-task setup.

Tier 2 (Cursor IDE GUI inside Snap, OpenClaw gateway) is scoped out;
README documents what each would need.

Files: 16 new TypeScript files (~1470 lines), one new workflow,
package.json + README.md additions. Existing test suite unchanged
(111 files / 2179 tests still passing).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 0defa5b0-1f71-48b6-b5a7-0a4a2f3b6c12

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A Tier-1 cross-agent E2E testing harness is added to validate plugin behavior against five headless agent CLIs (Claude Code, Codex, Cursor, Hermes, Pi) using real Deeplake workspace side effects, with four test cases, cost tracking, assertion evaluation, and automated session cleanup.

Changes

E2E Harness Implementation

Layer / File(s) Summary
Type Definitions & Contracts
tests/e2e/types.ts
Comprehensive type definitions: AgentId enum (5 agents), AgentDriver interface (install/run/cleanup), RunOpts/RunResult I/O shapes, E2ECase model with optional setup/skipFor, assertion union (stdout-contains, stdout-matches, select-from-db, hook-log-contains), TestCredentials, CaseContext, and MatrixResult reporting.
Sandbox Management
tests/e2e/sandbox.ts
Temporary HOME directory provisioning with seeded Deeplake credentials, restrictive file permissions, idempotent cleanup, and deterministic session ID generation via buildSessionId().
CLI Installation Infrastructure
tests/e2e/agents/install-via-cli.ts
Shared subprocess runner for hivemind <agent> install via npx tsx, stdout/stderr capture, timeout enforcement with SIGKILL, and error-propagating installOrThrow() wrapper.
Agent Drivers (5 implementations)
tests/e2e/agents/claude-code.ts, codex.ts, cursor-agent.ts, hermes.ts, pi.ts
Five AgentDriver implementations: each calls installOrThrow() for setup, constructs isolated process environment (HOME, HIVEMIND_DEBUG, API keys), invokes CLI via shared runProcess() helper with timeout/session ID, captures stdout/stderr, and extracts session ID from hook-debug.log or output. Shared runProcess() in claude-code.ts spawns processes, enforces wall-clock timeout, and parses cost via agent-specific patterns.
Cost Extraction & Reporting
tests/e2e/cost.ts
Agent-specific regex patterns for cost extraction from stdout (parseCostCents()), RunSummary interface (counts, totals, MatrixResult[]), writeSummary() to persist results JSON, and formatCents() for USD formatting.
Matrix Construction
tests/e2e/matrix.ts
Explicit ALL_DRIVERS and ALL_CASES arrays in fixed order, MatrixPoint interface for (case, agent) pairs, and buildMatrix() that cross-products cases/drivers while respecting each case's skipFor list.
Assertion Execution & Cleanup
tests/e2e/assertions.ts
AssertionRunner dispatch for four assertion types (stdout substring/regex, SQL SELECT with custom expect, hook-log substring), labeled failure reporting, and cleanupSessionRows() for best-effort SQL DELETE from sessions/memory tables with error aggregation.
Test Cases (4 scenarios)
tests/e2e/cases/01-capture-smoke.ts, 02-cat-index-md.ts, 03-grep-memory-summaries.ts, 04-session-start-inject.ts
Four cases: capture-smoke verifies hook log + DB row creation; cat-index-md checks virtual mount interception and index headers; grep-memory-summaries seeds a row and validates SQL fast-path; session-start-inject confirms injected context appears in agent output.
Main Test Runner
tests/e2e/runner.ts
CLI parser (--case, --agent, --keep-sandbox, --list, --help), credential loading (HIVEMIND_E2E_CREDS_JSON), provider-key-based skipping, per-matrix-point execution (sandbox create → install → setup → run → assert → cleanup), failure aggregation, and exit codes (0 pass, 1 failure, 2 harness error). Writes cost summary and optional per-case logs.

Documentation & Configuration

Layer / File(s) Summary
E2E Harness Documentation
tests/e2e/README.md
Comprehensive guide to tier-1 E2E matrix: running all cases or filtering by case/agent, required/optional environment variables, case/driver architecture, session ID lifecycle, cleanup semantics, provider-key-based skipping, and guidance on tier-2 separation.
CI/CD Workflow
.github/workflows/e2e.yml
Manual GitHub Actions trigger (workflow_dispatch), fork gate on activeloopai/hivemind, Node.js 22 setup, dependency/bundle build, CLI installation (claude, codex, pi, cursor-agent), conditional E2E run with case_filter and agent_filter dispatch inputs, and test summary artifact upload (30-day retention).
Script Registration & README
package.json, README.md
New npm run e2e script entry calling tsx tests/e2e/runner.ts, and README Development section documenting the cross-agent E2E tier-1 matrix with usage examples and link to tests/e2e/README.md.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • activeloopai/hivemind#96: Adds session-notifications SessionStart hook and related claude-code output changes that directly affect session ID extraction and hook log parsing in the new E2E drivers and test assertions.

Suggested reviewers

  • efenocchi

🐰 A test harness for five agents runs, cross-product dreams take flight,
Sessions captured, costs extracted, assertions burning bright,
Sandbox homes and cleanup paths, deterministic IDs in place,
Tier-one validation dancing, keeping plugin bugs in check!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main addition: a tier-1 cross-agent E2E matrix harness for testing multiple agents.
Description check ✅ Passed The description provides comprehensive coverage: summary, architecture overview, decisions made, setup instructions, and deferred work. All required sections are present and well-documented.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/e2e-agent-matrix

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@coderabbitai coderabbitai Bot requested a review from efenocchi May 11, 2026 23:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Coverage Report

No src/*.ts files changed in this PR.

Generated for commit 473d539.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
tests/e2e/agents/install-via-cli.ts (1)

58-65: ⚡ Quick win

Prefer close + single-settle guard for subprocess completion.

Using exit can race with final stdio flush. Switching to close and guarding settlement makes captured diagnostics more reliable.

Suggested refactor
 return new Promise((resolveP) => {
+  let settled = false;
+  const settle = (r: InstallResult) => {
+    if (settled) return;
+    settled = true;
+    clearTimeout(killTimer);
+    resolveP(r);
+  };
+
   const child = spawn(
     "npx",
     ["--yes", "tsx", cliEntry, agentArg, "install"],
@@
-  const killTimer = setTimeout(() => child.kill("SIGKILL"), timeoutMs);
-  child.on("exit", (code) => {
-    clearTimeout(killTimer);
-    resolveP({ exitCode: code ?? -1, stdout, stderr });
-  });
+  const killTimer = setTimeout(() => child.kill("SIGKILL"), timeoutMs);
+  child.on("close", (code) => {
+    settle({ exitCode: code ?? -1, stdout, stderr });
+  });
   child.on("error", (err) => {
-    clearTimeout(killTimer);
-    resolveP({ exitCode: -1, stdout, stderr: `${stderr}\nspawn error: ${err.message}` });
+    settle({ exitCode: -1, stdout, stderr: `${stderr}\nspawn error: ${err.message}` });
   });
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/agents/install-via-cli.ts` around lines 58 - 65, The handler
currently listens to child.on("exit", ...) and child.on("error", ...) which can
race with stdio flush; change to child.on("close", ...) and add a single-settle
guard (e.g., a boolean settled) so resolveP is only called once; in both the
"close" and "error" handlers clearTimeout(killTimer), set settled = true before
calling resolveP, and ensure you still return exitCode (code ?? -1) and include
combined stdout/stderr, appending the spawn error message to stderr in the
"error" path.
tests/e2e/assertions.ts (1)

155-170: ⚡ Quick win

LIKE wildcards in cleanup queries are unescaped but practically safe given controlled inputs.

Lines 155 and 169 use ILIKE '${sidLike.replace(/'/g, "''")}' without escaping % and _ metacharacters. However, the practical risk is minimal: sessionIds are internally generated in the fixed format e2e-${runId}-${caseId}-${agent} (e.g., e2e-2026-05-11T23-57-59-738546-01-capture-smoke-claude-code) and never contain these characters.

For defensive robustness, consider escaping LIKE metacharacters anyway:

Suggested fix
-  const sidLike = `%${sessionId}%`;
+  const escapeLike = (v: string) =>
+    v
+      .replace(/\\/g, "\\\\")
+      .replace(/%/g, "\\%")
+      .replace(/_/g, "\\_")
+      .replace(/'/g, "''");
+  const sidLike = `%${escapeLike(sessionId)}%`;
@@
-      `DELETE FROM "${ctx.creds.sessionsTable}" WHERE path ILIKE '${sidLike.replace(/'/g, "''")}'`,
+      `DELETE FROM "${ctx.creds.sessionsTable}" WHERE path ILIKE '${sidLike}' ESCAPE '\\'`,
@@
-      `DELETE FROM "${ctx.creds.memoryTable}" WHERE path ILIKE '${sidLike.replace(/'/g, "''")}'`,
+      `DELETE FROM "${ctx.creds.memoryTable}" WHERE path ILIKE '${sidLike}' ESCAPE '\\'`,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/assertions.ts` around lines 155 - 170, The ILIKE patterns built for
sessionsApi.query and memoryApi.query use sidLike without escaping SQL LIKE
metacharacters (% and _), so update the code that creates sidLike (used in the
DELETE statements passed to sessionsApi.query and memoryApi.query) to escape %
and _ (e.g., replace '%' and '_' with escaped variants) and include an explicit
ESCAPE clause or use a parameterized query to ensure the escaped pattern is
respected; reference the sidLike variable and the calls to sessionsApi.query and
memoryApi.query when making the change.
tests/e2e/runner.ts (1)

212-214: ⚡ Quick win

Run driver cleanup before tearing down sandbox.home.

When keepSandbox is false, sandbox.destroy() can remove the same HOME path you pass into a.cleanup(). Any cleanup that needs files under the sandbox will silently become a no-op on the default path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/runner.ts` around lines 212 - 214, The cleanup caller currently
destroys the sandbox before invoking action-specific cleanup, which can remove
the HOME path passed to a.cleanup(sandbox.home); change the order so that if
a.cleanup exists you await it (inside the existing try/catch/“best-effort”
block) before calling sandbox.destroy(), but only do this reorder when
keepSandbox is false (leave behavior unchanged when keepSandbox is true); keep
the error swallowing behavior and the call signature a.cleanup(sandbox.home)
intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e.yml:
- Around line 59-64: Pin the CLI installs and remove the insecure curl|bash by
specifying explicit versions for the npm installs (replace "npm install -g
`@anthropic-ai/claude-code` `@openai/codex`" and "npm install -g `@piapp/cli` || true"
with locked version specifiers like `@version`) and replace the cursor installer
pipeline ("curl -fsSL https://cursor.com/install-cli.sh | bash -s -- --print")
with a verified download-and-verify flow: download the release artifact to a
temp file, validate its SHA256 (or signature) against a checked-in or CI-managed
fingerprint, then execute the verified binary/installer; ensure CI fails if
checksum verification fails and avoid swallowing errors with "|| true".

In `@tests/e2e/agents/claude-code.ts`:
- Around line 89-104: Replace the child.on("exit", ...) handler with
child.on("close", ...) so you only resolve once stdout/stderr streams are fully
drained; inside the new "close" callback use a simple boolean guard (e.g., let
resolved = false; if (resolved) return; resolved = true;) to prevent duplicate
resolution, then compute durationMs, sessionId via extractSessionId(stdout,
stderr, home) (falling back to seedSessionId), inferAgentFromBin(bin),
parseCostCents(agent, stdout), and call resolve({...}) exactly once with stdout,
stderr, exitCode (use code ?? -1), sessionId, costCents, and durationMs.

In `@tests/e2e/cases/01-capture-smoke.ts`:
- Around line 33-35: The test's SQL builder uses raw ILIKE with run.sessionId
which can contain SQL LIKE wildcards (%) or (_) and thus over-match; replace the
current string interpolation in the sql: ({ ctx, run }) => ... block with a call
to the shared sqlLike() helper from src/utils/sql.ts to escape the session id
and produce a pattern like ILIKE sqlLike(run.sessionId) ESCAPE '\\' (or
otherwise use sqlLike to produce the escaped '%...%' pattern), ensuring you
reference the existing sql property in this test and the run.sessionId value
when applying the fix.

In `@tests/e2e/cases/02-cat-index-md.ts`:
- Around line 35-37: The current regex (/Last
Updated|Created|Project|Description/) is too permissive; replace it with a
stricter pattern that requires the index header tokens together in order (for
example match the full header line like /Last
Updated\s+Created\s+Project\s+Description/ or use positive lookaheads to assert
all four tokens are present) in the test case where the regex is defined (the
"type: 'stdout-matches'" assertion labeled "agent saw the virtual index's table
headers") so the assertion only passes when the actual header line appears.

In `@tests/e2e/cases/03-grep-memory-summaries.ts`:
- Around line 38-50: The INSERT builds a SQL string with unescaped
interpolations (path, filename derived from ctx.sessionId, and ctx.agent) passed
to memoryApi.query, which can break if values contain single quotes; fix by
using a parameterized query or escaping those values before concatenation:
convert the query to use placeholders and pass [path, `${ctx.sessionId}.md`,
message, 'e2e', Buffer.byteLength(message, "utf-8"), 'e2e', 'grep-sentinel',
ctx.agent] as parameters to memoryApi.query, or at minimum replace single quotes
in path, filename and ctx.agent (e.g. .replace(/'/g, "''")) before embedding
them; keep the table identifier ctx.creds.memoryTable as-is but ensure proper
quoting when using parameters.

In `@tests/e2e/cases/04-session-start-inject.ts`:
- Around line 12-15: The test docstring promises anchoring on the "THREE tiers"
phrase but the assertions never check for it; update the test in
tests/e2e/cases/04-session-start-inject.ts to assert that the agent's response
(the variable holding the reply/response used for the existing "index.md" and
"summaries" checks) contains the substring "THREE tiers", and add the identical
assertion to the related cases covering lines 25-41 so all three anchors ("THREE
tiers", "index.md", "summaries") are validated.

In `@tests/e2e/runner.ts`:
- Around line 152-154: The early-return for point.skipped currently returns
failure: null and passed: true which makes skips count as passed; update the
returned result object for the skipped branch (the block referencing
point.skipped and returning { case: c.id, agent: a.id, ... }) to mark the test
as skipped—e.g. set passed: false and set a clear skip indicator in the failure
or status field (such as failure: { skipped: true } or status: "skipped" and
include any skip reason) so the reporting logic can treat it as skipped instead
of passed.

---

Nitpick comments:
In `@tests/e2e/agents/install-via-cli.ts`:
- Around line 58-65: The handler currently listens to child.on("exit", ...) and
child.on("error", ...) which can race with stdio flush; change to
child.on("close", ...) and add a single-settle guard (e.g., a boolean settled)
so resolveP is only called once; in both the "close" and "error" handlers
clearTimeout(killTimer), set settled = true before calling resolveP, and ensure
you still return exitCode (code ?? -1) and include combined stdout/stderr,
appending the spawn error message to stderr in the "error" path.

In `@tests/e2e/assertions.ts`:
- Around line 155-170: The ILIKE patterns built for sessionsApi.query and
memoryApi.query use sidLike without escaping SQL LIKE metacharacters (% and _),
so update the code that creates sidLike (used in the DELETE statements passed to
sessionsApi.query and memoryApi.query) to escape % and _ (e.g., replace '%' and
'_' with escaped variants) and include an explicit ESCAPE clause or use a
parameterized query to ensure the escaped pattern is respected; reference the
sidLike variable and the calls to sessionsApi.query and memoryApi.query when
making the change.

In `@tests/e2e/runner.ts`:
- Around line 212-214: The cleanup caller currently destroys the sandbox before
invoking action-specific cleanup, which can remove the HOME path passed to
a.cleanup(sandbox.home); change the order so that if a.cleanup exists you await
it (inside the existing try/catch/“best-effort” block) before calling
sandbox.destroy(), but only do this reorder when keepSandbox is false (leave
behavior unchanged when keepSandbox is true); keep the error swallowing behavior
and the call signature a.cleanup(sandbox.home) intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7fa78723-d157-4317-a189-c517320f4d8f

📥 Commits

Reviewing files that changed from the base of the PR and between e6f4a02 and 9d0e181.

📒 Files selected for processing (20)
  • .github/workflows/e2e.yml
  • README.md
  • package.json
  • tests/e2e/README.md
  • tests/e2e/agents/claude-code.ts
  • tests/e2e/agents/codex.ts
  • tests/e2e/agents/cursor-agent.ts
  • tests/e2e/agents/hermes.ts
  • tests/e2e/agents/install-via-cli.ts
  • tests/e2e/agents/pi.ts
  • tests/e2e/assertions.ts
  • tests/e2e/cases/01-capture-smoke.ts
  • tests/e2e/cases/02-cat-index-md.ts
  • tests/e2e/cases/03-grep-memory-summaries.ts
  • tests/e2e/cases/04-session-start-inject.ts
  • tests/e2e/cost.ts
  • tests/e2e/matrix.ts
  • tests/e2e/runner.ts
  • tests/e2e/sandbox.ts
  • tests/e2e/types.ts

Comment thread .github/workflows/e2e.yml
Comment on lines +59 to +64
npm install -g @anthropic-ai/claude-code @openai/codex
# Pi ships via npm too.
npm install -g @piapp/cli || true
# cursor-agent and hermes — install via curl when available;
# if not, their points fail loudly rather than silently skip.
curl -fsSL https://cursor.com/install-cli.sh | bash -s -- --print 2>/dev/null || echo "cursor-agent install skipped"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "e2e.yml" -o -name "e2e.yaml" | head -5

Repository: activeloopai/hivemind

Length of output: 93


🏁 Script executed:

cat -n .github/workflows/e2e.yml

Repository: activeloopai/hivemind

Length of output: 4198


Pin and verify the agent installers.

This step pulls unpinned CLI versions, making runs non-reproducible across days or re-runs. More significantly, the curl-piped installer at line 64 executes a mutable remote script from cursor.com without checksum verification—a supply-chain risk. Pin CLI versions and replace the curl installer with a verified binary or checksum-validated script.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e.yml around lines 59 - 64, Pin the CLI installs and
remove the insecure curl|bash by specifying explicit versions for the npm
installs (replace "npm install -g `@anthropic-ai/claude-code` `@openai/codex`" and
"npm install -g `@piapp/cli` || true" with locked version specifiers like
`@version`) and replace the cursor installer pipeline ("curl -fsSL
https://cursor.com/install-cli.sh | bash -s -- --print") with a verified
download-and-verify flow: download the release artifact to a temp file, validate
its SHA256 (or signature) against a checked-in or CI-managed fingerprint, then
execute the verified binary/installer; ensure CI fails if checksum verification
fails and avoid swallowing errors with "|| true".

Comment on lines +89 to +104
child.on("exit", (code) => {
clearTimeout(killTimer);
const durationMs = Date.now() - startedAt;
const home = env.HOME ?? "";
const sessionId = extractSessionId(stdout, stderr, home) ?? seedSessionId;
const agent = inferAgentFromBin(bin);
const costCents = parseCostCents(agent, stdout);
resolve({
stdout,
stderr,
exitCode: code ?? -1,
sessionId,
costCents,
durationMs,
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n tests/e2e/agents/claude-code.ts | head -120

Repository: activeloopai/hivemind

Length of output: 5156


🏁 Script executed:

# Check the broader context around the issue
sed -n '71,120p' tests/e2e/agents/claude-code.ts

Repository: activeloopai/hivemind

Length of output: 1579


Resolve the process on close, not exit.

exit fires before stdout/stderr streams are fully drained. Since this code depends on fully accumulated stdout and stderr for extractSessionId (line 93) and parseCostCents (line 95), using exit creates a race condition where buffered data may be lost, causing flaky failures at the pass/fail boundary.

Switch to the close event and add a guard flag to prevent duplicate resolution:

Suggested fix
+    let exitCode = -1;
+    let settled = false;
     child.on("exit", (code) => {
+      exitCode = code ?? -1;
+    });
+    child.on("close", () => {
+      if (settled) return;
+      settled = true;
       clearTimeout(killTimer);
       const durationMs = Date.now() - startedAt;
       const home = env.HOME ?? "";
       const sessionId = extractSessionId(stdout, stderr, home) ?? seedSessionId;
       const agent = inferAgentFromBin(bin);
       const costCents = parseCostCents(agent, stdout);
       resolve({
         stdout,
         stderr,
-        exitCode: code ?? -1,
+        exitCode,
         sessionId,
         costCents,
         durationMs,
       });
     });
     child.on("error", (err) => {
+      if (settled) return;
+      settled = true;
       clearTimeout(killTimer);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/agents/claude-code.ts` around lines 89 - 104, Replace the
child.on("exit", ...) handler with child.on("close", ...) so you only resolve
once stdout/stderr streams are fully drained; inside the new "close" callback
use a simple boolean guard (e.g., let resolved = false; if (resolved) return;
resolved = true;) to prevent duplicate resolution, then compute durationMs,
sessionId via extractSessionId(stdout, stderr, home) (falling back to
seedSessionId), inferAgentFromBin(bin), parseCostCents(agent, stdout), and call
resolve({...}) exactly once with stdout, stderr, exitCode (use code ?? -1),
sessionId, costCents, and durationMs.

Comment on lines +33 to +35
sql: ({ ctx, run }) =>
`SELECT count(*) AS n FROM "${ctx.creds.sessionsTable}" ` +
`WHERE path ILIKE '%${run.sessionId.replace(/'/g, "''")}%'`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "01-capture-smoke.ts" -type f

Repository: activeloopai/hivemind

Length of output: 103


🏁 Script executed:

sed -n '30,40p' ./tests/e2e/cases/01-capture-smoke.ts

Repository: activeloopai/hivemind

Length of output: 725


🏁 Script executed:

rg -A 5 -B 5 "sessionId" ./tests/e2e/cases/01-capture-smoke.ts | head -50

Repository: activeloopai/hivemind

Length of output: 837


🏁 Script executed:

rg "sessionId\s*=" --type ts -A 3 | head -80

Repository: activeloopai/hivemind

Length of output: 5360


🏁 Script executed:

rg "resolveSessionId|extractSessionId" --type ts -A 5 | head -100

Repository: activeloopai/hivemind

Length of output: 5129


🏁 Script executed:

rg "ILIKE|LIKE" --type ts -B 3 -A 1

Repository: activeloopai/hivemind

Length of output: 47750


🏁 Script executed:

cat -n src/utils/sql.ts | head -50

Repository: activeloopai/hivemind

Length of output: 1333


🏁 Script executed:

cat -n tests/e2e/assertions.ts | grep -A 10 "sidLike"

Repository: activeloopai/hivemind

Length of output: 1095


🏁 Script executed:

head -20 tests/e2e/cases/01-capture-smoke.ts

Repository: activeloopai/hivemind

Length of output: 879


🏁 Script executed:

rg "Deeplake" tests/e2e/sandbox.ts -A 2 -B 2 | head -20

Repository: activeloopai/hivemind

Length of output: 47


🏁 Script executed:

rg "postgres|postgresql|ESCAPE" tests/e2e/ -l

Repository: activeloopai/hivemind

Length of output: 47


Escape LIKE wildcards in the session-id assertion query.

Line 35 can over-match when run.sessionId contains % or _, causing false-positive assertion passes. The codebase already uses sqlLike() from src/utils/sql.ts with ESCAPE '\\' for this purpose (see grep-core.ts, virtual-table-query.ts, mcp-server.ts).

Suggested fix
       // is run.sessionId, captured by the driver from the hook log.
       sql: ({ ctx, run }) =>
+        {
+          const sid = sqlLike(run.sessionId);
+          return (
         `SELECT count(*) AS n FROM "${ctx.creds.sessionsTable}" ` +
-        `WHERE path ILIKE '%${run.sessionId.replace(/'/g, "''")}%'`,
+        `WHERE path ILIKE '%${sid}%' ESCAPE '\\'`
+          );
+        },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sql: ({ ctx, run }) =>
`SELECT count(*) AS n FROM "${ctx.creds.sessionsTable}" ` +
`WHERE path ILIKE '%${run.sessionId.replace(/'/g, "''")}%'`,
sql: ({ ctx, run }) => {
const sid = sqlLike(run.sessionId);
return (
`SELECT count(*) AS n FROM "${ctx.creds.sessionsTable}" ` +
`WHERE path ILIKE '%${sid}%' ESCAPE '\\'`
);
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cases/01-capture-smoke.ts` around lines 33 - 35, The test's SQL
builder uses raw ILIKE with run.sessionId which can contain SQL LIKE wildcards
(%) or (_) and thus over-match; replace the current string interpolation in the
sql: ({ ctx, run }) => ... block with a call to the shared sqlLike() helper from
src/utils/sql.ts to escape the session id and produce a pattern like ILIKE
sqlLike(run.sessionId) ESCAPE '\\' (or otherwise use sqlLike to produce the
escaped '%...%' pattern), ensuring you reference the existing sql property in
this test and the run.sessionId value when applying the fix.

Comment on lines +35 to +37
type: "stdout-matches",
regex: /Last Updated|Created|Project|Description/,
label: "agent saw the virtual index's table headers",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the index-header assertion stricter to avoid false passes.

Line 36 passes if any single token appears. That can green-light unrelated stdout and weaken this case’s signal.

Suggested fix
     {
       type: "stdout-matches",
-      regex: /Last Updated|Created|Project|Description/,
-      label: "agent saw the virtual index's table headers",
+      regex: /(?:Last Updated|Created)/,
+      label: "agent saw a timestamp column in the virtual index",
+    },
+    {
+      type: "stdout-contains",
+      substring: "Project",
+      label: "agent saw Project column",
+    },
+    {
+      type: "stdout-contains",
+      substring: "Description",
+      label: "agent saw Description column",
     },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
type: "stdout-matches",
regex: /Last Updated|Created|Project|Description/,
label: "agent saw the virtual index's table headers",
{
type: "stdout-matches",
regex: /(?:Last Updated|Created)/,
label: "agent saw a timestamp column in the virtual index",
},
{
type: "stdout-contains",
substring: "Project",
label: "agent saw Project column",
},
{
type: "stdout-contains",
substring: "Description",
label: "agent saw Description column",
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cases/02-cat-index-md.ts` around lines 35 - 37, The current regex
(/Last Updated|Created|Project|Description/) is too permissive; replace it with
a stricter pattern that requires the index header tokens together in order (for
example match the full header line like /Last
Updated\s+Created\s+Project\s+Description/ or use positive lookaheads to assert
all four tokens are present) in the test case where the regex is defined (the
"type: 'stdout-matches'" assertion labeled "agent saw the virtual index's table
headers") so the assertion only passes when the actual header line appears.

Comment on lines +38 to +50
const path = `/summaries/e2e/${ctx.sessionId}.md`;
const message = JSON.stringify({
type: "summary",
session_id: ctx.sessionId,
content: `## E2E grep sentinel\n\nMarker: ${SENTINEL}\n`,
}).replace(/'/g, "''");
await memoryApi.query(
`INSERT INTO "${ctx.creds.memoryTable}" ` +
`(id, path, filename, message, author, size_bytes, project, description, agent, creation_date, last_update_date) ` +
`VALUES (gen_random_uuid(), '${path}', '${ctx.sessionId}.md', '${message}'::jsonb, ` +
`'e2e', ${Buffer.byteLength(message, "utf-8")}, 'e2e', 'grep-sentinel', '${ctx.agent}', ` +
`CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)`,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n tests/e2e/cases/03-grep-memory-summaries.ts | head -60

Repository: activeloopai/hivemind

Length of output: 2894


Escape all interpolated SQL string values in the INSERT statement.

Lines 47–48 interpolate path, filename, and ctx.agent directly without escaping. If these inputs contain single quotes, the query syntax will break. The message variable is already escaped, but the other string values must be escaped consistently.

Suggested fix
+const sqlQuote = (v: string) => v.replace(/'/g, "''");
+
 const path = `/summaries/e2e/${ctx.sessionId}.md`;
-const message = JSON.stringify({
+const messageJson = JSON.stringify({
   type: "summary",
   session_id: ctx.sessionId,
   content: `## E2E grep sentinel\n\nMarker: ${SENTINEL}\n`,
-}).replace(/'/g, "''");
+});
+const message = sqlQuote(messageJson);
+const filename = sqlQuote(`${ctx.sessionId}.md`);
+const pathSql = sqlQuote(path);
+const agentSql = sqlQuote(ctx.agent);

 await memoryApi.query(
   `INSERT INTO "${ctx.creds.memoryTable}" ` +
   `(id, path, filename, message, author, size_bytes, project, description, agent, creation_date, last_update_date) ` +
-  `VALUES (gen_random_uuid(), '${path}', '${ctx.sessionId}.md', '${message}'::jsonb, ` +
-  `'e2e', ${Buffer.byteLength(message, "utf-8")}, 'e2e', 'grep-sentinel', '${ctx.agent}', ` +
+  `VALUES (gen_random_uuid(), '${pathSql}', '${filename}', '${message}'::jsonb, ` +
+  `'e2e', ${Buffer.byteLength(messageJson, "utf-8")}, 'e2e', 'grep-sentinel', '${agentSql}', ` +
   `CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)`,
 );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const path = `/summaries/e2e/${ctx.sessionId}.md`;
const message = JSON.stringify({
type: "summary",
session_id: ctx.sessionId,
content: `## E2E grep sentinel\n\nMarker: ${SENTINEL}\n`,
}).replace(/'/g, "''");
await memoryApi.query(
`INSERT INTO "${ctx.creds.memoryTable}" ` +
`(id, path, filename, message, author, size_bytes, project, description, agent, creation_date, last_update_date) ` +
`VALUES (gen_random_uuid(), '${path}', '${ctx.sessionId}.md', '${message}'::jsonb, ` +
`'e2e', ${Buffer.byteLength(message, "utf-8")}, 'e2e', 'grep-sentinel', '${ctx.agent}', ` +
`CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)`,
);
const sqlQuote = (v: string) => v.replace(/'/g, "''");
const path = `/summaries/e2e/${ctx.sessionId}.md`;
const messageJson = JSON.stringify({
type: "summary",
session_id: ctx.sessionId,
content: `## E2E grep sentinel\n\nMarker: ${SENTINEL}\n`,
});
const message = sqlQuote(messageJson);
const filename = sqlQuote(`${ctx.sessionId}.md`);
const pathSql = sqlQuote(path);
const agentSql = sqlQuote(ctx.agent);
await memoryApi.query(
`INSERT INTO "${ctx.creds.memoryTable}" ` +
`(id, path, filename, message, author, size_bytes, project, description, agent, creation_date, last_update_date) ` +
`VALUES (gen_random_uuid(), '${pathSql}', '${filename}', '${message}'::jsonb, ` +
`'e2e', ${Buffer.byteLength(messageJson, "utf-8")}, 'e2e', 'grep-sentinel', '${agentSql}', ` +
`CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)`,
);
🧰 Tools
🪛 OpenGrep (1.20.0)

[ERROR] 44-50: SQL query built via string concatenation or template literal passed to query()/execute(). Use parameterized queries instead.

(coderabbit.sql-injection.raw-query-concat-js)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cases/03-grep-memory-summaries.ts` around lines 38 - 50, The INSERT
builds a SQL string with unescaped interpolations (path, filename derived from
ctx.sessionId, and ctx.agent) passed to memoryApi.query, which can break if
values contain single quotes; fix by using a parameterized query or escaping
those values before concatenation: convert the query to use placeholders and
pass [path, `${ctx.sessionId}.md`, message, 'e2e', Buffer.byteLength(message,
"utf-8"), 'e2e', 'grep-sentinel', ctx.agent] as parameters to memoryApi.query,
or at minimum replace single quotes in path, filename and ctx.agent (e.g.
.replace(/'/g, "''")) before embedding them; keep the table identifier
ctx.creds.memoryTable as-is but ensure proper quoting when using parameters.

Comment on lines +12 to +15
* Anchoring on three independently-stable strings: "THREE tiers",
* "index.md", "summaries". If any of them is missing from the agent's
* reply, either the inject didn't fire or the runtime stripped it.
*/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing the “three tiers” anchor weakens this case’s signal.

The docstring says this case anchors on the “THREE tiers” framing, but the assertions never validate it. Adding that check tightens intent and reduces false positives.

Suggested patch
   assertions: [
+    {
+      type: "stdout-matches",
+      regex: /\b(?:three|3)\s+tiers?\b/i,
+      label: "agent recalls three-tier framing",
+    },
     {
       type: "stdout-matches",
       regex: /index\.md/i,
       label: "agent recalls index.md tier",
     },

Also applies to: 25-41

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cases/04-session-start-inject.ts` around lines 12 - 15, The test
docstring promises anchoring on the "THREE tiers" phrase but the assertions
never check for it; update the test in
tests/e2e/cases/04-session-start-inject.ts to assert that the agent's response
(the variable holding the reply/response used for the existing "index.md" and
"summaries" checks) contains the substring "THREE tiers", and add the identical
assertion to the related cases covering lines 25-41 so all three anchors ("THREE
tiers", "index.md", "summaries") are validated.

Comment thread tests/e2e/runner.ts
Comment on lines +152 to +154
if (point.skipped) {
return { case: c.id, agent: a.id, passed: true, failure: null, costCents: null, durationMs: 0, sessionId: "" };
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve matrix-defined skips as skips in the result.

This branch returns failure: null, so skipFor combinations are printed as ok and counted under passed instead of skipped. That makes the summary falsely green even though nothing ran.

Suggested fix
   if (point.skipped) {
-    return { case: c.id, agent: a.id, passed: true, failure: null, costCents: null, durationMs: 0, sessionId: "" };
+    return {
+      case: c.id,
+      agent: a.id,
+      passed: true,
+      failure: `[skip] ${point.skipReason ?? "matrix skip"}`,
+      costCents: null,
+      durationMs: 0,
+      sessionId: "",
+    };
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/runner.ts` around lines 152 - 154, The early-return for
point.skipped currently returns failure: null and passed: true which makes skips
count as passed; update the returned result object for the skipped branch (the
block referencing point.skipped and returning { case: c.id, agent: a.id, ... })
to mark the test as skipped—e.g. set passed: false and set a clear skip
indicator in the failure or status field (such as failure: { skipped: true } or
status: "skipped" and include any skip reason) so the reporting logic can treat
it as skipped instead of passed.

kaghni added 4 commits May 12, 2026 01:43
Adds `tests/e2e/creds-bootstrap.ts` with two resolution modes:

1. CI: `HIVEMIND_E2E_CREDS_JSON` env var contains a full credentials.json
   blob — used unchanged, no API lookup.

2. Local: read the operator's real `~/.deeplake/credentials.json` (token
   + orgId stay) and resolve a fresh workspaceId by NAME from the
   workspace named `hivemind_e2e_test` (override with
   HIVEMIND_E2E_WORKSPACE_NAME). The real creds file is read-only here —
   no `saveCredentials()` call, no `hivemind workspace <id>` invocation
   — so a harness crash mid-run cannot leave the operator on the wrong
   workspace.

This replaces the previous design where local devs had to maintain a
separate HIVEMIND_E2E_CREDS_JSON blob. Now `npm run e2e` "just works"
for anyone with a working `hivemind login` and access to the
hivemind_e2e_test workspace. CI still uses the explicit blob mode
because there's no logged-in operator on the runner.

Both modes share the table-suffix logic (HIVEMIND_E2E_TABLE_SUFFIX) so
concurrent dev runs don't collide on row paths.

Updates README + plan to document the two modes. Renames the canonical
test workspace from `hivemind-e2e` to `hivemind_e2e_test` to match the
intended convention.

Untested still: live spawn against the real workspace; the workspace
name lookup against listWorkspaces() (the helper itself is well-tested
in the existing CLI suite, but the harness-side glue isn't).
Two small fixes that came up in the "things that may bite" list:

1. install-via-cli.ts used `npx --yes tsx src/cli/index.ts <agent> install`
   to install hivemind into the tmp HOME. That worked on a local machine
   with npm's offline cache populated, but on a fresh runner (or a CI box
   that hasn't seen tsx before) `npx --yes` would silently fetch tsx from
   the network mid-test, occasionally fail, and leave a confusing "exit
   1, no stderr" failure on whichever per-agent point fired first.

   Now spawn `process.execPath bundle/cli.js <agent> install`. That:
     - removes the tsx runtime dependency (the harness only needs tsx
       at its own invocation seam, via `npm run e2e`),
     - exercises the actual artifact users get on `npm install -g`,
       catching bundling regressions (esbuild dropping a helper,
       wrong flag default) at the e2e layer too,
     - uses process.execPath instead of "node" so the spawn picks up
       the correct node binary in nvm-managed setups.

   Added a pre-flight check: if bundle/cli.js is missing the harness
   exits with a clear "run npm run build before npm run e2e" message
   instead of a cryptic "Cannot find module" stderr.

2. README's HIVEMIND_E2E_TABLE_SUFFIX guidance was misleading. It
   claimed concurrent runs would collide on row paths without the
   suffix; in fact every session_id embeds a unique runId timestamp
   (see sandbox.ts:buildSessionId), so concurrent runs are naturally
   isolated. Rewrote the guidance: the suffix is only useful when the
   e2e workspace deliberately maintains per-dev tables.
Three changes that collapse the engineer-facing UX to one command and
make the matrix's role in release discipline explicit.

1. Auto-build pre-flight in tests/e2e/runner.ts.

   Drivers other than claude-code spawn `node bundle/cli.js <agent>
   install`. A missing bundle/cli.js used to fail per-point with a
   confusing "no such file" stderr; now the runner detects it before
   any spawns, runs `npm run build` once, and continues. Honors
   HIVEMIND_E2E_SKIP_BUILD=1 for inner-loop iteration on the harness
   itself when the bundle is current.

   Result: `npm run e2e` from a fresh checkout works without a
   separate `npm run build` step. Steady state is one command.

2. tests/e2e/README.md collapses to that single command.

   Lead with "Steady state: one command — `npm run e2e`". Drops the
   pre-merge `e2e:setup` shortcut + the "running against another
   branch" section — both are transient pre-merge crutches that
   stop making sense once the harness lands on main. Adds a
   "coverage today + growth target" section: 4 seed cases is smoke;
   target ≥1 case per behavioral surface, ≥2 for high-risk.
   Documents the CI-promotion criteria (stable week of manual runs,
   per-surface coverage, flake budget < 5%) explicitly so the flip
   from workflow_dispatch to PR-gating is a measurable decision,
   not a vibes call.

3. RELEASE_CHECKLIST.md sections 2, 3, and 10 updated.

   Section 2 previously pointed at /tmp/skilify-pull-e2e.mjs as the
   canonical e2e pattern ("lives outside the repo by design — the
   e2e matrix is per-feature scratch"). That's no longer true:
   tests/e2e/ replaces the scratch approach for the five hook-driven
   agents. Section 3's per-agent matrix bullet now points at the
   in-repo case + select-from-db assertion type. Section 10's final
   sign-off step rewords "Per-agent matrix script" to "npm run e2e"
   with the coverage-growth + PR-gating-promotion clause inline.
Brings the matrix to its designed scope: every agent hivemind ships
to, every behavioral surface RELEASE_CHECKLIST.md mandates that an
e2e harness can deterministically assert. No more tier-1/tier-2
split; openclaw lives in the same matrix as the five CLI agents,
driven through a different shape.

Drivers (6 total, was 5)

  - openclaw (new): loads the installed plugin module from
    ~/.openclaw/extensions/hivemind/dist/index.js into the test
    process with a fake pluginApi that captures registered event
    handlers + tools. fires synthetic agent_end events (for capture
    cases) or invokes registered MCP tools directly (for the openclaw
    tool case). all plugin code paths run end-to-end against the real
    Deeplake API; gateway-side concerns (event parsing, multi-agent
    ordering, lifecycle) are explicitly out of scope and documented
    in README's "OpenClaw driver caveats".
  - extended AgentDriver interface with providerKey: ProviderKey to
    distinguish drivers that need a model API key vs ones that don't
    (openclaw fires hooks programmatically with no LLM in the loop).
    runner's isReady() now reads providerKey instead of a hard-coded
    switch.

Cases (8 total, was 4)

  01 capture-smoke              all 6   one turn -> one row
  02 cat-index-md               5 CLI   skip openclaw (no bash)
  03 grep-memory-summaries      5 CLI   skip openclaw (no bash)
  04 session-start-inject       5 CLI   skip openclaw (SKILL.md path)
  05 sql-injection-probe        all 6   memory table survives
                                        ' DROP TABLE memory --
  06 missing-table-self-heal    all 6   DROP sessions, capture
                                        recreates + lands the row
  07 unicode-roundtrip          all 6   emoji + RTL + smart quotes
                                        + backslash survive JSONB
                                        roundtrip byte-for-byte
  08 openclaw-tools             openclaw only   hivemind_search
                                                returns seeded
                                                sentinel via tool
                                                registration

Total: 48 matrix points (40 live, 8 explicitly skipped with rationale
comments in each case file). Cases 05/06/07 are direct mappings of
the RELEASE_CHECKLIST.md sections that were previously gap-only:

  - 05 covers section 5 (Security: SQL identifiers + strings)
  - 06 covers section 6 (Backend quirks: lazy CREATE TABLE)
  - 07 covers section 2 (Real e2e: unicode + quotes + backslash
    edge content)

README + RELEASE_CHECKLIST.md updated

  - tests/e2e/README.md: agent-shapes table explaining the CLI-vs-
    openclaw driver distinction; case-coverage table mapping each
    case to the checklist section it satisfies; "What the matrix
    does NOT cover" section listing the checklist items that aren't
    e2e-deterministic by nature (UPDATE coalescing, async hook
    completion timing, per-agent dispatch model selection -- all
    handled at source-test layer).
  - RELEASE_CHECKLIST.md: tier-1/tier-2 wording removed throughout;
    sections 3 and 10 now reference all six agents explicitly.

Untested: live spawn against the real workspace; the workspace name
lookup against listWorkspaces(); SQL DROP TABLE behavior on the
specific Deeplake deployment for case 06; openclaw plugin module
load via cache-busted dynamic import in repeated cases of the same
runner invocation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant