Skip to content

fix(onboard): poll forward list instead of waiting on openshell CLI exit#4072

Merged
ericksoa merged 14 commits into
mainfrom
fix/forward-start-detach-poll-4064
May 23, 2026
Merged

fix(onboard): poll forward list instead of waiting on openshell CLI exit#4072
ericksoa merged 14 commits into
mainfrom
fix/forward-start-detach-poll-4064

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 22, 2026

Summary

ensureDashboardForward invoked openshell forward start --background via spawnSync with a 30-second timeout. On hosts using the Docker compatibility gateway (host glibc older than the openshell-gateway requirement), the daemonised forward inherits the parent CLI's stdio, so spawnSync waited on those file descriptors well past 30s, producing spawnSync ... openshell ETIMEDOUT and tripping the rollback path even though the dashboard was healthy inside the sandbox.

Switch from spawnSync-with-timeout to a detached spawn plus a poll of openshell forward list. The CLI's exit code is no longer the success signal — the appearance of a matching (port, sandboxName) entry in the live forward list is.

Related Issue

Fixes #4064

Changes

  • src/lib/onboard/forward-start.ts — new helpers replace runBackgroundForwardStart*:
    • runDetachedForwardStartWithDiagnostics / runDetachedForwardStartWithPortReleaseRetries write child stdio to a temp file pair, close the host fds, and poll openshell forward list until a match appears or the deadline expires.
    • buildDetachedForwardStartSpawn(argv) preflights argv[0] with fs.accessSync(_, X_OK), returns an immediate spawn-error when the spawn produces no pid, and registers a no-op error listener so a belated async failure cannot crash onboard.
    • buildForwardStartProgressLogger(port) prints a periodic "still waiting" line during the poll.
    • Default deadline is 180s; on timeout the diagnostic includes a last forward list: … tail. On timeout / port-conflict outcomes the helper sends a best-effort SIGTERM to the detached child to reduce the risk of an orphan registering a forward after rollback.
  • src/lib/onboard/forward-cleanup.ts — new bestEffortForwardStopForSandbox(run, fetch, port, sandbox):
    • Consults openshell forward list (timeout OPENSHELL_PROBE_TIMEOUT_MS, throws on failure) and returns stopped / owned-other / no-entry / list-failed.
    • Uses the sandbox-scoped forward stop <port> <sandbox> form; skips entirely when forward list fails so a port-only fallback cannot kill an unrelated sandbox's forward.
  • src/lib/onboard.ts — wire ensureDashboardForward to the new helpers and pass timeout: OPENSHELL_PROBE_TIMEOUT_MS to the forward-list poll. Entrypoint budget against main is net-negative.
  • Tests — new src/lib/onboard/forward-start.test.ts and src/lib/onboard/forward-cleanup.test.ts suites cover the spawn/poll/SIGTERM paths and the four ownership outcomes (including the list-failed branch); existing test/onboard*.test.ts spawn mocks are updated to capture full argv and emit a matching forward list entry for the polled port.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Switched to a detached forward-start flow with polling-based readiness, safer cleanup that only stops forwards owned by the target sandbox, targeted retries for port conflicts, progress updates while waiting, clearer timeout/conflict diagnostics, and graceful termination on failures.
    • Dashboard now falls back to the dashboard port by default and accepts a helper to construct openshell commands.
  • Tests

    • Expanded coverage for forward-starts, timeouts, retries, diagnostics and cleanup; updated mocks to more robustly record spawned commands and simulate realistic forward-list outputs.

Review Change Stack

…xit (#4064)

`ensureDashboardForward` invoked `openshell forward start --background`
via `spawnSync` with a 30-second timeout. On hosts that fall back to
the Docker compatibility gateway (host glibc older than the
openshell-gateway requirement), the daemonised forward inherits the
parent CLI's stdio, so spawnSync waits on those file descriptors long
after `--background` returned. The wait routinely exceeded the
30-second timeout, producing
`spawnSync ... openshell ETIMEDOUT` and tripping the rollback path
even though the dashboard was healthy inside the sandbox (port 18789
listening, `/health` returning 200).

Replace the spawnSync-with-timeout path with a detached spawn plus a
poll of `openshell forward list`. The CLI's exit code is no longer the
success signal — the appearance of a matching `(port, sandboxName)`
entry in the live forward list is. EADDRINUSE-style diagnostics from
the parent process are still surfaced before the deadline so the
existing port-conflict retry path keeps working.

Adds focused unit tests under `src/lib/onboard/forward-start.test.ts`
for the new helpers and updates the broader onboard tests that
exercise `createSandbox` so their `childProcess.spawn` mocks capture
the full argv and their `forward list` mocks include a matching
entry for the polled port.

Fixes #4064

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74d1e2cd-1744-48ba-b610-175dc65b7dde

📥 Commits

Reviewing files that changed from the base of the PR and between 707d647 and 6ce79ad.

📒 Files selected for processing (2)
  • src/lib/onboard/dashboard.ts
  • test/onboard-dashboard.test.ts

📝 Walkthrough

Walkthrough

Switches dashboard forward creation from a background runner to a detached spawn + polling flow; adds detached spawn builder, diagnostic capture, readiness polling via openshell forward list, port-conflict-aware retries, sandbox-scoped forward-stop probing, onboarding wiring updates, and tests/fixture adjustments.

Changes

Detached Forward-Start Implementation and Integration

Layer / File(s) Summary
Detached forward types & core implementation
src/lib/onboard/forward-start.ts
Adds ForwardListFetcher, DetachedForwardSpawnRunner, DetachedForwardStartOutcome/Options, blocking sleep/poll helpers, buildDetachedForwardStartSpawn, progress logger, diagnostic temp-file handling, readiness polling loop, SIGTERM best-effort, and retry wrapper.
Forward-start tests and helpers
src/lib/onboard/forward-start.test.ts
Adds forwardListWith test helper and scenarios for success, timeout, transient fetch errors, spawn errors, onProgress, SIGTERM behavior, retry-on-port-conflict, maxRetries, and looksLikeForwardPortConflict unit tests.
Forward cleanup impl (sandbox-scoped) and tests
src/lib/onboard/forward-cleanup.ts, src/lib/onboard/forward-cleanup.test.ts
Adds ForwardListRunner type and bestEffortForwardStopForSandbox that probes forward list, resolves ownership with getOccupiedPorts, conditionally issues sandbox-scoped forward stop, and returns explicit status strings; tests cover ownership, no-entry, non-live status, and list failures.
Onboarding dashboard wiring
src/lib/onboard/dashboard.ts, src/lib/onboard.ts
Rewires ensureDashboardForward to build detached spawn argv via openshellArgv, use buildDetachedForwardStartSpawn + runDetachedForwardStartWithPortReleaseRetries, probe readiness with runCaptureOpenshell(["forward","list"]) using OPENSHELL_PROBE_TIMEOUT_MS, perform sandbox-scoped stop between retries, and replace CONTROL_UI_PORT fallback with DASHBOARD_PORT.
Test harness & fixture updates (spawn & forward-list)
test/onboard.test.ts, test/onboard-messaging.test.ts, test/onboard-custom-dockerfile.test.ts, test/shellquote-sandbox.test.ts, src/lib/onboard/forward-cleanup.test.ts, src/lib/onboard/forward-start.test.ts
Updates many test mocks: add child.unref() no-op, set child.pid where appropriate, flatten/serialize spawn args into recorded commands, sometimes record returned child for assertions, and return running forward-list lines in affected fixtures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3919: Follow-up changes evolving onboarding dashboard helper wiring and forward-start orchestration.
  • NVIDIA/NemoClaw#3997: Related refactor around forward-stop helpers and ownership probing.

Suggested labels

NemoClaw CLI, OpenShell, Networking, v0.0.50

Suggested reviewers

  • ericksoa
  • cv
  • jyaunches

Poem

🐰 I spawned a child and tucked it out of sight, unref set soft and light,
I polled the forward list from dusk until the dawn,
When ports would bicker, I retried and read the temp logs on the lawn,
I SIGTERMed gently when conflicts would persist,
Then cleaned the burrow logs and hopped away with a hopeful twist.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: replacing spawnSync-with-timeout waiting on openshell CLI exit with a polling mechanism that checks the forward list.
Linked Issues check ✅ Passed All primary objectives from #4064 are addressed: detached spawn replaces spawnSync to prevent timeout, polling forward list replaces CLI-exit detection, port-conflict detection preserved, diagnostics improved with stdout/stderr and last-forward-list tail.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #4064 objectives: detached forward-start flow, forward-list polling, port-conflict handling, diagnostic improvements, and supporting test updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/forward-start-detach-poll-4064

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@laitingsheng laitingsheng added fix Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. labels May 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

E2E Advisor Recommendation

Required E2E: ubuntu-repo-cloud-openclaw, dashboard-remote-bind-e2e
Optional E2E: wsl-repo-cloud-openclaw, ubuntu-repo-cloud-hermes

Dispatch hint: dashboard-remote-bind-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • ubuntu-repo-cloud-openclaw (medium): Required baseline scenario: this PR changes core onboard sandbox creation and dashboard forward startup. A clean Ubuntu repo-current cloud OpenClaw onboard must still create a sandbox, register the dashboard forward, and pass smoke/baseline checks.
  • dashboard-remote-bind-e2e (medium): Required dashboard regression: changed ensureDashboardForward/forward start/stop code is exercised by connect and dashboard forwarding. This existing E2E verifies the dashboard forward can be restarted and observed through OpenShell forward list.

Optional E2E

  • wsl-repo-cloud-openclaw (high): Optional platform confidence: forward/list/stop behavior and dashboard host access can differ under WSL networking. Useful because the changed code touches OpenShell forward handling, but not merge-blocking unless WSL is a targeted platform for this PR.
  • ubuntu-repo-cloud-hermes (medium): Optional agent variant: verifies the same onboard/dashboard forward path for the Hermes agent, whose configured forward port can differ from OpenClaw defaults.

New E2E recommendations

  • dashboard forward ownership isolation (high): Existing E2E does not appear to prove that onboarding/connecting one sandbox will not stop a live dashboard forward owned by a different sandbox on the same host port during a TOCTOU race.
    • Suggested test: Add a multi-sandbox dashboard-forward ownership E2E that creates two sandboxes/forwards, attempts cleanup/reconnect for one sandbox, and asserts the other sandbox's forward remains alive.
  • detached OpenShell forward startup (medium): The new detached forward-start implementation treats forward-list registration as the success signal and kills the detached child on timeout/conflict. Unit tests cover this with mocks, but no live E2E appears to force a slow or conflicting forward registration path.
    • Suggested test: Add a dashboard forward port-conflict/slow-registration E2E that holds the dashboard port, runs onboard/connect, verifies retry/diagnostic behavior, and confirms no orphan forward process remains.

Dispatch hint

  • Workflow: regression-e2e.yaml
  • jobs input: dashboard-remote-bind-e2e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8647-8682: The new detached forward-launching block (building
forwardStartArgv and the inline launcher passed into
runDetachedForwardStartWithPortReleaseRetries) should be extracted into a new
module (e.g., src/lib/onboard/forward-start.ts) so ensureDashboardForward
remains a thin call site; move the logic that constructs forwardStartArgv, the
detached spawn wrapper (the function that does spawn(..., { stdio: ["ignore",
stdout, stderr], detached: true }) and child.unref()), the retry/backoff wiring
that calls runCaptureOpenshell(["forward","list"], ...) and the cleanup callback
(sleep + bestEffortForwardStop) into exported helper(s) like
startDetachedForwardWithRetries or runDetachedForwardStart, then update
ensureDashboardForward to call that helper (passing actualTarget, sandboxName,
actualPort, runOpenshell, etc.) and return the existing fwdOutcome handling
unchanged.

In `@src/lib/onboard/forward-start.ts`:
- Around line 153-158: The current try/catch around fetchForwardList() swallows
errors so the timeout diagnostic loses the last failure; change the logic in
forward-start.ts where list is set (the try/catch using fetchForwardList()) and
the similar block later to capture and store the last thrown Error (e.g.,
lastFetchError) whenever fetchForwardList() throws, and then include that
error.message (or the Error object) in the timeout/“did not appear” diagnostic
so the timeout path reports the actual fetch failure instead of a generic
message. Ensure both occurrences (the initial fetch and the later polling block)
follow the same pattern and propagate the preserved error into the final
diagnostic.
- Around line 65-75: blockingSleepMs currently spawns a child with an unref'd
timer so the child can exit immediately; change the spawn to a blocking child
(remove .unref() and ensure the timer keeps the process alive, e.g., run node -e
"setTimeout(()=>{}, ms)" or otherwise use a synchronous sleep primitive) so
spawnSync truly waits for ms, and keep the injectable spawnSync pattern in
blockingSleepMs; also stop swallowing fetchForwardList() errors — in the polling
loop capture the thrown error (do not use catch { list = "" }) and include that
error message/stack in the returned diagnostic (or attach it to the
stderr/stdout context) so failures to fetch the list are reflected alongside the
child spawn output when reporting timeouts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae78cde6-68e6-47af-86f3-5c211a1c1459

📥 Commits

Reviewing files that changed from the base of the PR and between 617be54 and b072c10.

📒 Files selected for processing (7)
  • src/lib/onboard.ts
  • src/lib/onboard/forward-start.test.ts
  • src/lib/onboard/forward-start.ts
  • test/onboard-custom-dockerfile.test.ts
  • test/onboard-messaging.test.ts
  • test/onboard.test.ts
  • test/shellquote-sandbox.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard/forward-start.ts
Comment thread src/lib/onboard/forward-start.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Runtime validation is still needed for detached OpenShell forwarding (src/lib/onboard/forward-start.test.ts:23): The PR adds substantial unit coverage for detached spawn, forward-list polling, timeout diagnostics, transient list failures, SIGTERM cleanup, sandbox-scoped forward-stop behavior, and port-conflict retry behavior. However, this is a host/OpenShell sandbox lifecycle path, and the reviewed tests still mock spawn, forward-list output, sleeps, and process signaling rather than exercising a real OpenShell gateway/forward daemonization flow.
    • Recommendation: Add or identify targeted runtime/integration validation for `openshell forward start --background`, Docker compatibility gateway behavior, inherited stdio detachment, forward-list polling, timeout diagnostics, SIGTERM/retry cleanup, sandbox-scoped forward-stop ownership, and rollback behavior. Keep this separate from external E2E status reporting; the key evidence needed is behavioral validation of the changed host/sandbox lifecycle path.
    • Evidence: `src/lib/onboard/forward-start.test.ts` uses mocked `spawn`, `fetchList`, `sleep`, and synthetic `openshell forward list` rows; `src/lib/onboard/forward-cleanup.test.ts` similarly mocks `run`/`fetch`. Trusted test-depth context remains `runtime_validation_recommended` for `src/lib/onboard.ts`, `src/lib/onboard/dashboard.ts`, `src/lib/onboard/forward-cleanup.ts`, and `src/lib/onboard/forward-start.ts`. Linked issue comment 4521101866 reported an earlier branch revision still failed with `forward did not appear in list within 60000ms`; current code raises the default deadline to 180s and improves diagnostics, but no non-mocked runtime evidence is present in the reviewed diff.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Runtime validation is still needed for detached OpenShell forwarding (src/lib/onboard/forward-start.test.ts:23): The PR adds substantial unit coverage for detached spawn, forward-list polling, timeout diagnostics, transient list failures, SIGTERM cleanup, sandbox-scoped forward-stop behavior, and port-conflict retry behavior. However, this is a host/OpenShell sandbox lifecycle path, and the reviewed tests still mock spawn, forward-list output, sleeps, and process signaling rather than exercising a real OpenShell gateway/forward daemonization flow.
    • Recommendation: Add or identify targeted runtime/integration validation for `openshell forward start --background`, Docker compatibility gateway behavior, inherited stdio detachment, forward-list polling, timeout diagnostics, SIGTERM/retry cleanup, sandbox-scoped forward-stop ownership, and rollback behavior. Keep this separate from external E2E status reporting; the key evidence needed is behavioral validation of the changed host/sandbox lifecycle path.
    • Evidence: `src/lib/onboard/forward-start.test.ts` uses mocked `spawn`, `fetchList`, `sleep`, and synthetic `openshell forward list` rows; `src/lib/onboard/forward-cleanup.test.ts` similarly mocks `run`/`fetch`. Trusted test-depth context remains `runtime_validation_recommended` for `src/lib/onboard.ts`, `src/lib/onboard/dashboard.ts`, `src/lib/onboard/forward-cleanup.ts`, and `src/lib/onboard/forward-start.ts`. Linked issue comment 4521101866 reported an earlier branch revision still failed with `forward did not appear in list within 60000ms`; current code raises the default deadline to 180s and improves diagnostics, but no non-mocked runtime evidence is present in the reviewed diff.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26295520377
Target ref: b072c108f125261359ad253707ff0ea1fc6344ac
Workflow ref: main
Requested jobs: double-onboard-e2e,cloud-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
double-onboard-e2e ✅ success

… list errors

PR #4072 review fixes.

- `forward-start.ts`: export `buildDetachedForwardStartSpawn(argv)` so
  the detached-spawn wiring lives next to the consumer. `onboard.ts`'s
  call site shrinks from a 27-line lambda to a single helper call,
  putting `src/lib/onboard.ts` back inside the entrypoint budget.
- `forward-start.ts:blockingSleepMs`: drop `setTimeout(...).unref()`.
  The unref'd timer let the sleep child's event loop drain immediately
  so spawnSync returned right away and the caller spun the poll loop
  instead of pausing between forward-list fetches.
- `forward-start.ts:runDetachedForwardStartWithDiagnostics`: record the
  last `fetchForwardList()` error and append it to the diagnostic on
  timeout. Previously a persistent gateway/list failure was swallowed
  and the user only saw a generic "forward did not appear" message.
- `test/onboard.test.ts`: align the last remaining spawn mock with the
  full-argv capture pattern used elsewhere.
- `src/lib/onboard/forward-start.test.ts`: new regression test for the
  diagnostic suffix when `fetchForwardList()` keeps throwing.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

8647-8652: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Trim this callsite back under the onboard-entrypoint budget.

The helper extraction is good, but this file still fails CI at +6 net lines. These new inline explanation/build-up lines are enough to keep onboard-entrypoint-budget red, so please collapse or move them out of src/lib/onboard.ts.

Suggested minimal trim
-  // Detached spawn + forward-list poll (`#4064`) so the openshell CLI's
-  // inherited stdio cannot trip spawnSync's timeout on Docker-compat hosts.
   const fwdOutcome = runDetachedForwardStartWithPortReleaseRetries(
-    buildDetachedForwardStartSpawn(
-      openshellArgv(["forward", "start", "--background", actualTarget, sandboxName]),
-    ),
+    buildDetachedForwardStartSpawn(openshellArgv(["forward", "start", "--background", actualTarget, sandboxName])),
     () => runCaptureOpenshell(["forward", "list"], { ignoreError: true }),
     { port: actualPort, sandboxName },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 8647 - 8652, The extra inline
explanatory/build-up lines around the detached-forward call cause the
onboard-entrypoint budget to exceed; collapse this callsite by removing or
relocating the commentary and intermediate staging lines in src/lib/onboard.ts
and replace them with a single concise invocation of
runDetachedForwardStartWithPortReleaseRetries(buildDetachedForwardStartSpawn(openshellArgv(["forward","start","--background",
actualTarget, sandboxName]))) (or move the expanded explanation to a separate
helper file or doc), keeping only the minimal assignment to fwdOutcome and any
essential error handling; ensure the unique identifiers
runDetachedForwardStartWithPortReleaseRetries, buildDetachedForwardStartSpawn,
openshellArgv, and fwdOutcome are preserved so callers remain unchanged.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

8647-8659: Please run the onboarding E2E slice for this path.

This changes dashboard-forward creation/detection inside src/lib/onboard.ts, so I'd want the relevant onboarding regressions exercised before merge: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, and openshell-gateway-upgrade-e2e.

As per coding guidelines, "src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow." and the listed E2E test recommendation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 8647 - 8659, The change affects
dashboard-forward creation/detection in src/lib/onboard.ts around
runDetachedForwardStartWithPortReleaseRetries / buildDetachedForwardStartSpawn /
openshellArgv / runCaptureOpenshell / bestEffortForwardStop (and uses
runOpenshell, actualPort, sandboxName), so run the onboarding E2E slice
exercising those flows: execute cloud-e2e, sandbox-operations-e2e,
rebuild-openclaw-e2e, channels-stop-start-e2e, and openshell-gateway-upgrade-e2e
(locally or in CI) targeting the modified path; if any test fails, reproduce
with logs for the forward start/list/stop sequence, trace through the above
functions to fix detection/creation issues, and re-run the listed E2E tests
until they pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 94-104: The current buildDetachedForwardStartSpawn only catches
synchronous spawnChild errors; fix it by attaching a one-time "error" listener
to the ChildProcess returned from spawnChild inside
buildDetachedForwardStartSpawn so async launch failures (ENOENT/EACCES) are
captured and propagated into the DetachedForwardSpawnRunner outcome used by
runDetachedForwardStartWithDiagnostics; specifically, after calling
spawnChild(...) in buildDetachedForwardStartSpawn, call child.once("error", err
=> set spawnResult.error = err instanceof Error ? err : new Error(String(err)))
and ensure the function waits/returns a result that reflects that error (or
otherwise resolves the runner) before calling child.unref() or returning { pid:
child.pid } so runDetachedForwardStartWithDiagnostics can see reason:
"spawn-error" when pid is undefined.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 8647-8652: The extra inline explanatory/build-up lines around the
detached-forward call cause the onboard-entrypoint budget to exceed; collapse
this callsite by removing or relocating the commentary and intermediate staging
lines in src/lib/onboard.ts and replace them with a single concise invocation of
runDetachedForwardStartWithPortReleaseRetries(buildDetachedForwardStartSpawn(openshellArgv(["forward","start","--background",
actualTarget, sandboxName]))) (or move the expanded explanation to a separate
helper file or doc), keeping only the minimal assignment to fwdOutcome and any
essential error handling; ensure the unique identifiers
runDetachedForwardStartWithPortReleaseRetries, buildDetachedForwardStartSpawn,
openshellArgv, and fwdOutcome are preserved so callers remain unchanged.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 8647-8659: The change affects dashboard-forward creation/detection
in src/lib/onboard.ts around runDetachedForwardStartWithPortReleaseRetries /
buildDetachedForwardStartSpawn / openshellArgv / runCaptureOpenshell /
bestEffortForwardStop (and uses runOpenshell, actualPort, sandboxName), so run
the onboarding E2E slice exercising those flows: execute cloud-e2e,
sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, and
openshell-gateway-upgrade-e2e (locally or in CI) targeting the modified path; if
any test fails, reproduce with logs for the forward start/list/stop sequence,
trace through the above functions to fix detection/creation issues, and re-run
the listed E2E tests until they pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d8bc0303-3a73-45e7-8649-9e385d8f5c1e

📥 Commits

Reviewing files that changed from the base of the PR and between b072c10 and 45f5671.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/forward-start.test.ts
  • src/lib/onboard/forward-start.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard/forward-start.ts
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26298405268
Target ref: 45f56715bc12369b8261c1490989273fc0cb648c
Workflow ref: main
Requested jobs: double-onboard-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
double-onboard-e2e ✅ success

…, fix sleep child

PR #4072 review fixes.

- `onboard.ts`: collapse `ensureDashboardForward` callsite so net diff
  against main stays negative (entrypoint budget). `fwdDiagnostic` is
  destructured directly from the helper outcome and the explicit
  `{ ignoreError: true }` on `forward list` is dropped so genuine
  gateway/list failures propagate into the diagnostic instead of being
  silently coerced to `""`.
- `forward-start.ts`: extend `DetachedForwardSpawnRunner` with an
  optional `onAsyncError` callback. `buildDetachedForwardStartSpawn`
  registers `child.on("error", …)` so post-spawn ENOENT/EACCES events
  no longer escape as unhandled errors; the helper surfaces them via
  the existing `spawn-error` outcome path.
- `forward-start.test.ts`: new regression test simulating an async
  spawn error fired during the helper's poll-loop sleep.
- `test/onboard.test.ts`: fix two leftover spawn mocks that came in
  during the merge with main — they used the legacy `args[1][1]`
  capture shape and lacked `child.unref`, so the new detached spawn
  path would crash with `child.unref is not a function` in CI.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ach-poll-4064

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/forward-start.ts (1)

164-221: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Ensure asyncSpawnError can be delivered during the polling loop
buildDetachedForwardStartSpawn() registers child.on("error", onAsyncError), but runDetachedForwardStartWithDiagnostics() runs a synchronous while loop that repeatedly calls fetchForwardList() and (by default) blocks in sleepImpl via blockingSleepMs() (spawnSync). While this code is running, Node can’t dispatch the queued ChildProcess "error" event handler, so asyncSpawnError is unlikely to be set before the loop ends—making the reason: "spawn-error" path unreliable and turning async launch failures into "timeout".
Change the loop to yield to the event loop between polls (e.g., make the helper async and use real async sleep) or use an approach that doesn’t depend on a queued "error" callback.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/forward-start.ts` around lines 164 - 221, The polling loop in
runDetachedForwardStartWithDiagnostics relies on a synchronous while/sleepImpl
(blockingSleepMs) so Node cannot deliver the ChildProcess "error" event to set
asyncSpawnError; change the loop to be asynchronous so events can run: make
runDetachedForwardStartWithDiagnostics (or the helper containing the while)
async, replace the blocking sleepImpl/blockingSleepMs call with a non-blocking
awaitable sleep (e.g., await new Promise(r => setTimeout(r, pollIntervalMs)) or
an exported async sleepImpl) and ensure runDetachedSpawn/registering via
buildDetachedForwardStartSpawn still sets asyncSpawnError; afterward the loop
should check asyncSpawnError between awaits so the "spawn-error" path is
reachable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8641-8645: The retry cleanup currently calls
bestEffortForwardStop(runOpenshell, actualPort) unconditionally and can stop a
forward owned by a different sandbox; update the retry callback passed to
runDetachedForwardStartWithPortReleaseRetries so it first queries current
forwards (e.g., via runCaptureOpenshell(["forward", "list"]) or the same helper
used elsewhere), finds the entry for actualPort and verifies its sandbox/owner
equals sandboxName, and only then calls bestEffortForwardStop(runOpenshell,
actualPort); keep all existing symbols
(runDetachedForwardStartWithPortReleaseRetries, buildDetachedForwardStartSpawn,
runCaptureOpenshell, bestEffortForwardStop, runOpenshell, actualPort,
sandboxName) and ensure the ownership check prevents stopping another sandbox's
forward.

---

Outside diff comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 164-221: The polling loop in
runDetachedForwardStartWithDiagnostics relies on a synchronous while/sleepImpl
(blockingSleepMs) so Node cannot deliver the ChildProcess "error" event to set
asyncSpawnError; change the loop to be asynchronous so events can run: make
runDetachedForwardStartWithDiagnostics (or the helper containing the while)
async, replace the blocking sleepImpl/blockingSleepMs call with a non-blocking
awaitable sleep (e.g., await new Promise(r => setTimeout(r, pollIntervalMs)) or
an exported async sleepImpl) and ensure runDetachedSpawn/registering via
buildDetachedForwardStartSpawn still sets asyncSpawnError; afterward the loop
should check asyncSpawnError between awaits so the "spawn-error" path is
reachable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4650746e-cdba-45c8-a2c7-707f5a8771d5

📥 Commits

Reviewing files that changed from the base of the PR and between 45f5671 and 2a7d7b9.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/forward-start.test.ts
  • src/lib/onboard/forward-start.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26324778429
Target ref: 9b635e4f2d7e053900ac3b94dff718c26a935536
Workflow ref: main
Requested jobs: cloud-onboard-e2e,double-onboard-e2e,device-auth-health-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
device-auth-health-e2e ✅ success
double-onboard-e2e ✅ success

…ate on timeout (#4064)

User reproduced #4064 on the merged branch: spawnSync ETIMEDOUT is gone
but the helper now times out at 60s with "forward did not appear in
list". The user's openshell-gateway runs inside the Docker compatibility
container (host glibc < required) and per-call gRPC latency pushes the
forward-registration handshake past 60s. Manual
`openshell forward start <port> <sandbox> -d` against an unrelated
sandbox completes immediately, so the CLI path is healthy.

- `forward-start.ts`: bump default `overallTimeoutMs` from 60_000 to
  180_000 so Docker-compat gateways have headroom for the
  forward-registration handshake.
- `forward-start.ts`: add `onProgress` / `progressIntervalMs` options so
  the helper can emit a periodic "still waiting" line during the
  longer wait, and append the last `openshell forward list` snapshot to
  the timeout diagnostic so users (and triage) can see whether the
  gateway returned an empty list or an entry in an unexpected state.
- `forward-start.ts`: export `buildForwardStartProgressLogger(port)` so
  the onboard call site stays a one-liner — the entrypoint budget for
  `src/lib/onboard.ts` is still net-zero against main.
- `forward-start.test.ts`: new regression test for the progress callback
  cadence and the `last forward list:` diagnostic suffix.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ht openshell argv[0]

PR #4072 review fixes.

- `forward-cleanup.ts`: add `bestEffortForwardStopForSandbox` which
  consults `openshell forward list` before stopping a port. Skips the
  stop when the port is owned by a different sandbox, so a TOCTOU /
  port-conflict retry in `ensureDashboardForward` can no longer kill
  another sandbox's forward.
- `onboard.ts`: use `bestEffortForwardStopForSandbox` for both the
  pre-start cleanup and the `beforeRetry` callback. Pass an explicit
  `timeout: 5_000` to `runCaptureOpenshell(["forward", "list"])` so a
  hung gateway probe cannot bypass the helper's overall deadline.
- `forward-start.ts`: preflight `argv[0]` with `fs.accessSync(_, X_OK)`
  before spawning. The detached-spawn path runs inside a synchronous
  poll loop, so Node's async `error` event cannot reach the helper
  while it sleeps in `spawnSync`; surfacing ENOENT/EACCES at preflight
  time turns the would-be 180s timeout into an immediate spawn-error.
  The async error listener remains as belt-and-braces.
- `forward-start.ts`: move `maxRetries` into the options object so the
  onboard callsite stays a single positional argument and the
  entrypoint budget remains net-zero against main.
- `forward-cleanup.test.ts`: new file covering the four owner outcomes
  (stopped, owned-other, no-entry, list-failed) plus the non-live
  status filter.
- `forward-start.test.ts`: replace the synthetic async-callback test
  with a real `buildDetachedForwardStartSpawn(["/missing"])` assertion
  so we exercise the actual preflight; update existing
  `runDetachedForwardStartWithPortReleaseRetries` callers to pass
  `maxRetries` via the options object.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8620-8622: The poll of the external CLI inside
runDetachedForwardStartWithPortReleaseRetries is currently fatal because
runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }) will surface
errors; change the polling call to be non-fatal by passing ignoreError: true
(e.g. runCaptureOpenshell(["forward", "list"], { timeout: 5_000, ignoreError:
true })) or otherwise catch/log errors and return a falsy/neutral diagnostic so
that runDetachedForwardStartWithPortReleaseRetries can continue its
retry/diagnostic flow; update the call site that constructs the polling lambda
(the second argument to runDetachedForwardStartWithPortReleaseRetries) used
alongside buildDetachedForwardStartSpawn and openshellArgv to ensure transient
openshell failures do not abort onboarding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0361f22d-7b7f-41ef-9b01-4e342619df38

📥 Commits

Reviewing files that changed from the base of the PR and between 9b635e4 and 5a98412.

📒 Files selected for processing (5)
  • src/lib/onboard.ts
  • src/lib/onboard/forward-cleanup.test.ts
  • src/lib/onboard/forward-cleanup.ts
  • src/lib/onboard/forward-start.test.ts
  • src/lib/onboard/forward-start.ts

Comment thread src/lib/onboard.ts Outdated
Comment on lines +8620 to +8622
const { ok: fwdOk, diagnostic: fwdDiagnostic } = runDetachedForwardStartWithPortReleaseRetries(
buildDetachedForwardStartSpawn(openshellArgv(["forward", "start", "--background", actualTarget, sandboxName])),
() => runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make forward list polling non-fatal.

Line 8622 polls an external CLI without ignoreError: true. A transient openshell forward list failure can bypass runDetachedForwardStartWithPortReleaseRetries's retry/diagnostic path and still abort onboarding or trigger the rollback path for a healthy sandbox.

Suggested fix
   const { ok: fwdOk, diagnostic: fwdDiagnostic } = runDetachedForwardStartWithPortReleaseRetries(
     buildDetachedForwardStartSpawn(openshellArgv(["forward", "start", "--background", actualTarget, sandboxName])),
-    () => runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }),
+    () => runCaptureOpenshell(["forward", "list"], {
+      ignoreError: true,
+      timeout: 5_000,
+    }),
     { port: actualPort, sandboxName },
     () => { sleep(1); bestEffortForwardStopForSandbox(runOpenshell, runCaptureOpenshell, actualPort, sandboxName); },
     { onProgress: buildForwardStartProgressLogger(actualPort) },
   );

As per coding guidelines, src/lib/onboard.ts: “This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 8620 - 8622, The poll of the external CLI
inside runDetachedForwardStartWithPortReleaseRetries is currently fatal because
runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }) will surface
errors; change the polling call to be non-fatal by passing ignoreError: true
(e.g. runCaptureOpenshell(["forward", "list"], { timeout: 5_000, ignoreError:
true })) or otherwise catch/log errors and return a falsy/neutral diagnostic so
that runDetachedForwardStartWithPortReleaseRetries can continue its
retry/diagnostic flow; update the call site that constructs the polling lambda
(the second argument to runDetachedForwardStartWithPortReleaseRetries) used
alongside buildDetachedForwardStartSpawn and openshellArgv to ensure transient
openshell failures do not abort onboarding.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26325788526
Target ref: 5a984120cde9f1b0feeb2e33464905d3d2edc264
Workflow ref: main
Requested jobs: double-onboard-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
double-onboard-e2e ✅ success

PR #4072 review fixes.

- `forward-cleanup.ts`: when `openshell forward list` fails, the owner-
  scoped stop now skips entirely instead of falling through to a
  port-only `forward stop` that could kill an unrelated sandbox's
  forward. Owner-confirmed and no-entry paths use the sandbox-scoped
  `forward stop <port> <sandbox>` form so a TOCTOU window between list
  and stop cannot collateral another sandbox's forward either. Pulled
  the probe timeout from the shared `OPENSHELL_PROBE_TIMEOUT_MS`
  constant (15s) instead of an ad-hoc 5_000 — a slow Docker-compat
  gateway can otherwise time out every list call and flip the helper
  into false-rollback territory.
- `onboard.ts`: import `OPENSHELL_PROBE_TIMEOUT_MS` and use it for the
  forward-list poll's per-call timeout, matching the other openshell
  probe sites.
- `forward-start.ts`: drop the `onAsyncError` parameter and the
  `asyncSpawnError` check. The detached-spawn path runs inside a
  synchronous poll loop, so Node's async `error` event cannot reach the
  helper while it sleeps inside `spawnSync`. The preflight
  `fs.accessSync(argv[0], X_OK)` already catches the common
  ENOENT/EACCES cases; the helper now keeps only a no-op `error`
  listener to swallow any late event so it cannot crash the process.
- `forward-start.ts`: on timeout / port-conflict outcomes, send SIGTERM
  to the detached `openshell forward start --background` pid so the
  child cannot still register a forward minutes after onboard rolled
  the sandbox back (which would race the next onboard attempt for the
  same port).
- `forward-cleanup.test.ts`: cover the new behaviours — list-failed now
  skips stop entirely, owner-confirmed and no-entry use the
  sandbox-scoped 4-arg `forward stop` form, and the probe timeout
  asserts `OPENSHELL_PROBE_TIMEOUT_MS` (15s) rather than the previous
  5_000 literal.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/onboard/forward-start.ts (1)

113-121: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat child.pid === undefined as a spawn failure.

This still has a false-timeout path: spawnChild() can return a ChildProcess with no pid and then emit "error" asynchronously. Right now that event is swallowed and the helper falls through into the 180s poll loop, so a launch failure is reported as timeout instead of spawn-error.

🛠️ Minimal fix
       const child = spawnChild(argv[0], argv.slice(1), {
         stdio: ["ignore", stdout, stderr],
         detached: true,
       });
       // Swallow any belated `error` event so a race between accessSync and
       // execve does not crash the process via an unhandled emitter.
       child.on("error", () => {});
+      if (child.pid == null) {
+        return {
+          error: new Error(`Failed to spawn detached forward start: ${argv[0]}`),
+        };
+      }
       child.unref();
       return { pid: child.pid };
In Node.js child_process.spawn(), if process launch fails, can spawn() return a ChildProcess with pid undefined and emit an "error" event asynchronously instead of throwing?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/forward-start.ts` around lines 113 - 121, The spawned
ChildProcess may have an undefined pid and later emit "error", so treat
child.pid === undefined as an immediate spawn failure: after calling
spawnChild(...) check if child.pid is undefined and if so remove the no-op error
swallow, attach an error handler that surfaces/returns a spawn-error (or throw)
and ensure you unref/cleanup the child if necessary; keep the existing
child.on("error", () => {}) behavior only for cases with a valid pid and
otherwise return/propagate a spawn failure immediately from the function that
calls spawnChild.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 113-121: The spawned ChildProcess may have an undefined pid and
later emit "error", so treat child.pid === undefined as an immediate spawn
failure: after calling spawnChild(...) check if child.pid is undefined and if so
remove the no-op error swallow, attach an error handler that surfaces/returns a
spawn-error (or throw) and ensure you unref/cleanup the child if necessary; keep
the existing child.on("error", () => {}) behavior only for cases with a valid
pid and otherwise return/propagate a spawn failure immediately from the function
that calls spawnChild.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b7147b9-541b-4549-8303-3407c6ad44b7

📥 Commits

Reviewing files that changed from the base of the PR and between 5a98412 and 10f4f32.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/forward-cleanup.test.ts
  • src/lib/onboard/forward-cleanup.ts
  • src/lib/onboard/forward-start.ts

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26327142805
Target ref: 10f4f329d6e075861ceb43a876444b78bba6c6b7
Workflow ref: main
Requested jobs: double-onboard-e2e,cloud-onboard-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
double-onboard-e2e ✅ success

…tests

PR #4072 review fixes.

- `forward-start.ts`: `buildDetachedForwardStartSpawn` now returns a
  synchronous spawn-error when `spawn` came back with `child.pid ==
  null`. Without this the swallowed `error` listener would let the
  caller wait the full 180s deadline for a child that never actually
  ran. `child.unref()` only runs once the pid is known.
- `forward-start.ts`: clear `lastFetchError` after a successful
  `fetchForwardList` so a recovered gateway does not leave a stale
  "openshell forward list failed: …" tail on the eventual timeout
  diagnostic.
- `forward-start.ts` / `forward-start.test.ts`: drop the issue/PR
  references from the new comments; keep the generic rationale.
- `forward-start.test.ts`: cover the SIGTERM paths — timeout SIGTERMs
  the detached pid, port-conflict diagnostic SIGTERMs the detached pid
  (driven by writing a real EADDRINUSE line to the captured stderr
  fd), and a spawn-error outcome (no pid) leaves `process.kill`
  untouched. Also pin the lastFetchError clearing behaviour.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26327552584
Target ref: 169d51fb79d90581e15ffe0026a3b082eb644c53
Workflow ref: main
Requested jobs: double-onboard-e2e,cloud-onboard-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
double-onboard-e2e ✅ success

…ach-poll-4064

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26329354954
Target ref: 3e80e72b1fdf086857e47d7841c7fd65098d26d2
Workflow ref: main
Requested jobs: double-onboard-e2e,cloud-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
double-onboard-e2e ✅ success

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26334005293
Target ref: dddbe5d16042aa6ebb5d597a493ed1f001124c36
Workflow ref: main
Requested jobs: double-onboard-e2e,tunnel-lifecycle-e2e,cloud-onboard-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
double-onboard-e2e ✅ success
tunnel-lifecycle-e2e ✅ success

… in test

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added the v0.0.51 Release target label May 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26335995740
Target ref: 707d647a8a7d6fbb0c51a5871b5bc8f148221e82
Workflow ref: main
Requested jobs: cloud-onboard-e2e,double-onboard-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
double-onboard-e2e ✅ success

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa merged commit 0f48781 into main May 23, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. v0.0.51 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nemoclaw onboard rolls back after dashboard forward detection fails, although OpenShell forwarding works independently

2 participants