fix(onboard): poll forward list instead of waiting on openshell CLI exit#4072
Conversation
…xit (#4064) `ensureDashboardForward` invoked `openshell forward start --background` via `spawnSync` with a 30-second timeout. On hosts that fall back to the Docker compatibility gateway (host glibc older than the openshell-gateway requirement), the daemonised forward inherits the parent CLI's stdio, so spawnSync waits on those file descriptors long after `--background` returned. The wait routinely exceeded the 30-second timeout, producing `spawnSync ... openshell ETIMEDOUT` and tripping the rollback path even though the dashboard was healthy inside the sandbox (port 18789 listening, `/health` returning 200). Replace the spawnSync-with-timeout path with a detached spawn plus a poll of `openshell forward list`. The CLI's exit code is no longer the success signal — the appearance of a matching `(port, sandboxName)` entry in the live forward list is. EADDRINUSE-style diagnostics from the parent process are still surfaced before the deadline so the existing port-conflict retry path keeps working. Adds focused unit tests under `src/lib/onboard/forward-start.test.ts` for the new helpers and updates the broader onboard tests that exercise `createSandbox` so their `childProcess.spawn` mocks capture the full argv and their `forward list` mocks include a matching entry for the polled port. Fixes #4064 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughSwitches dashboard forward creation from a background runner to a detached spawn + polling flow; adds detached spawn builder, diagnostic capture, readiness polling via ChangesDetached Forward-Start Implementation and Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8647-8682: The new detached forward-launching block (building
forwardStartArgv and the inline launcher passed into
runDetachedForwardStartWithPortReleaseRetries) should be extracted into a new
module (e.g., src/lib/onboard/forward-start.ts) so ensureDashboardForward
remains a thin call site; move the logic that constructs forwardStartArgv, the
detached spawn wrapper (the function that does spawn(..., { stdio: ["ignore",
stdout, stderr], detached: true }) and child.unref()), the retry/backoff wiring
that calls runCaptureOpenshell(["forward","list"], ...) and the cleanup callback
(sleep + bestEffortForwardStop) into exported helper(s) like
startDetachedForwardWithRetries or runDetachedForwardStart, then update
ensureDashboardForward to call that helper (passing actualTarget, sandboxName,
actualPort, runOpenshell, etc.) and return the existing fwdOutcome handling
unchanged.
In `@src/lib/onboard/forward-start.ts`:
- Around line 153-158: The current try/catch around fetchForwardList() swallows
errors so the timeout diagnostic loses the last failure; change the logic in
forward-start.ts where list is set (the try/catch using fetchForwardList()) and
the similar block later to capture and store the last thrown Error (e.g.,
lastFetchError) whenever fetchForwardList() throws, and then include that
error.message (or the Error object) in the timeout/“did not appear” diagnostic
so the timeout path reports the actual fetch failure instead of a generic
message. Ensure both occurrences (the initial fetch and the later polling block)
follow the same pattern and propagate the preserved error into the final
diagnostic.
- Around line 65-75: blockingSleepMs currently spawns a child with an unref'd
timer so the child can exit immediately; change the spawn to a blocking child
(remove .unref() and ensure the timer keeps the process alive, e.g., run node -e
"setTimeout(()=>{}, ms)" or otherwise use a synchronous sleep primitive) so
spawnSync truly waits for ms, and keep the injectable spawnSync pattern in
blockingSleepMs; also stop swallowing fetchForwardList() errors — in the polling
loop capture the thrown error (do not use catch { list = "" }) and include that
error message/stack in the returned diagnostic (or attach it to the
stderr/stdout context) so failures to fetch the list are reflected alongside the
child spawn output when reporting timeouts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ae78cde6-68e6-47af-86f3-5c211a1c1459
📒 Files selected for processing (7)
src/lib/onboard.tssrc/lib/onboard/forward-start.test.tssrc/lib/onboard/forward-start.tstest/onboard-custom-dockerfile.test.tstest/onboard-messaging.test.tstest/onboard.test.tstest/shellquote-sandbox.test.ts
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Selective E2E Results — ✅ All requested jobs passedRun: 26295520377
|
… list errors PR #4072 review fixes. - `forward-start.ts`: export `buildDetachedForwardStartSpawn(argv)` so the detached-spawn wiring lives next to the consumer. `onboard.ts`'s call site shrinks from a 27-line lambda to a single helper call, putting `src/lib/onboard.ts` back inside the entrypoint budget. - `forward-start.ts:blockingSleepMs`: drop `setTimeout(...).unref()`. The unref'd timer let the sleep child's event loop drain immediately so spawnSync returned right away and the caller spun the poll loop instead of pausing between forward-list fetches. - `forward-start.ts:runDetachedForwardStartWithDiagnostics`: record the last `fetchForwardList()` error and append it to the diagnostic on timeout. Previously a persistent gateway/list failure was swallowed and the user only saw a generic "forward did not appear" message. - `test/onboard.test.ts`: align the last remaining spawn mock with the full-argv capture pattern used elsewhere. - `src/lib/onboard/forward-start.test.ts`: new regression test for the diagnostic suffix when `fetchForwardList()` keeps throwing. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
8647-8652:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTrim this callsite back under the onboard-entrypoint budget.
The helper extraction is good, but this file still fails CI at
+6 netlines. These new inline explanation/build-up lines are enough to keeponboard-entrypoint-budgetred, so please collapse or move them out ofsrc/lib/onboard.ts.Suggested minimal trim
- // Detached spawn + forward-list poll (`#4064`) so the openshell CLI's - // inherited stdio cannot trip spawnSync's timeout on Docker-compat hosts. const fwdOutcome = runDetachedForwardStartWithPortReleaseRetries( - buildDetachedForwardStartSpawn( - openshellArgv(["forward", "start", "--background", actualTarget, sandboxName]), - ), + buildDetachedForwardStartSpawn(openshellArgv(["forward", "start", "--background", actualTarget, sandboxName])), () => runCaptureOpenshell(["forward", "list"], { ignoreError: true }), { port: actualPort, sandboxName },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 8647 - 8652, The extra inline explanatory/build-up lines around the detached-forward call cause the onboard-entrypoint budget to exceed; collapse this callsite by removing or relocating the commentary and intermediate staging lines in src/lib/onboard.ts and replace them with a single concise invocation of runDetachedForwardStartWithPortReleaseRetries(buildDetachedForwardStartSpawn(openshellArgv(["forward","start","--background", actualTarget, sandboxName]))) (or move the expanded explanation to a separate helper file or doc), keeping only the minimal assignment to fwdOutcome and any essential error handling; ensure the unique identifiers runDetachedForwardStartWithPortReleaseRetries, buildDetachedForwardStartSpawn, openshellArgv, and fwdOutcome are preserved so callers remain unchanged.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
8647-8659: Please run the onboarding E2E slice for this path.This changes dashboard-forward creation/detection inside
src/lib/onboard.ts, so I'd want the relevant onboarding regressions exercised before merge:cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,channels-stop-start-e2e, andopenshell-gateway-upgrade-e2e.As per coding guidelines, "src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow." and the listed E2E test recommendation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 8647 - 8659, The change affects dashboard-forward creation/detection in src/lib/onboard.ts around runDetachedForwardStartWithPortReleaseRetries / buildDetachedForwardStartSpawn / openshellArgv / runCaptureOpenshell / bestEffortForwardStop (and uses runOpenshell, actualPort, sandboxName), so run the onboarding E2E slice exercising those flows: execute cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, and openshell-gateway-upgrade-e2e (locally or in CI) targeting the modified path; if any test fails, reproduce with logs for the forward start/list/stop sequence, trace through the above functions to fix detection/creation issues, and re-run the listed E2E tests until they pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 94-104: The current buildDetachedForwardStartSpawn only catches
synchronous spawnChild errors; fix it by attaching a one-time "error" listener
to the ChildProcess returned from spawnChild inside
buildDetachedForwardStartSpawn so async launch failures (ENOENT/EACCES) are
captured and propagated into the DetachedForwardSpawnRunner outcome used by
runDetachedForwardStartWithDiagnostics; specifically, after calling
spawnChild(...) in buildDetachedForwardStartSpawn, call child.once("error", err
=> set spawnResult.error = err instanceof Error ? err : new Error(String(err)))
and ensure the function waits/returns a result that reflects that error (or
otherwise resolves the runner) before calling child.unref() or returning { pid:
child.pid } so runDetachedForwardStartWithDiagnostics can see reason:
"spawn-error" when pid is undefined.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 8647-8652: The extra inline explanatory/build-up lines around the
detached-forward call cause the onboard-entrypoint budget to exceed; collapse
this callsite by removing or relocating the commentary and intermediate staging
lines in src/lib/onboard.ts and replace them with a single concise invocation of
runDetachedForwardStartWithPortReleaseRetries(buildDetachedForwardStartSpawn(openshellArgv(["forward","start","--background",
actualTarget, sandboxName]))) (or move the expanded explanation to a separate
helper file or doc), keeping only the minimal assignment to fwdOutcome and any
essential error handling; ensure the unique identifiers
runDetachedForwardStartWithPortReleaseRetries, buildDetachedForwardStartSpawn,
openshellArgv, and fwdOutcome are preserved so callers remain unchanged.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 8647-8659: The change affects dashboard-forward creation/detection
in src/lib/onboard.ts around runDetachedForwardStartWithPortReleaseRetries /
buildDetachedForwardStartSpawn / openshellArgv / runCaptureOpenshell /
bestEffortForwardStop (and uses runOpenshell, actualPort, sandboxName), so run
the onboarding E2E slice exercising those flows: execute cloud-e2e,
sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, and
openshell-gateway-upgrade-e2e (locally or in CI) targeting the modified path; if
any test fails, reproduce with logs for the forward start/list/stop sequence,
trace through the above functions to fix detection/creation issues, and re-run
the listed E2E tests until they pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d8bc0303-3a73-45e7-8649-9e385d8f5c1e
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/forward-start.test.tssrc/lib/onboard/forward-start.tstest/onboard.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26298405268
|
…, fix sleep child PR #4072 review fixes. - `onboard.ts`: collapse `ensureDashboardForward` callsite so net diff against main stays negative (entrypoint budget). `fwdDiagnostic` is destructured directly from the helper outcome and the explicit `{ ignoreError: true }` on `forward list` is dropped so genuine gateway/list failures propagate into the diagnostic instead of being silently coerced to `""`. - `forward-start.ts`: extend `DetachedForwardSpawnRunner` with an optional `onAsyncError` callback. `buildDetachedForwardStartSpawn` registers `child.on("error", …)` so post-spawn ENOENT/EACCES events no longer escape as unhandled errors; the helper surfaces them via the existing `spawn-error` outcome path. - `forward-start.test.ts`: new regression test simulating an async spawn error fired during the helper's poll-loop sleep. - `test/onboard.test.ts`: fix two leftover spawn mocks that came in during the merge with main — they used the legacy `args[1][1]` capture shape and lacked `child.unref`, so the new detached spawn path would crash with `child.unref is not a function` in CI. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ach-poll-4064 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard/forward-start.ts (1)
164-221:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftEnsure
asyncSpawnErrorcan be delivered during the polling loop
buildDetachedForwardStartSpawn()registerschild.on("error", onAsyncError), butrunDetachedForwardStartWithDiagnostics()runs a synchronouswhileloop that repeatedly callsfetchForwardList()and (by default) blocks insleepImplviablockingSleepMs()(spawnSync). While this code is running, Node can’t dispatch the queued ChildProcess"error"event handler, soasyncSpawnErroris unlikely to be set before the loop ends—making thereason: "spawn-error"path unreliable and turning async launch failures into"timeout".
Change the loop to yield to the event loop between polls (e.g., make the helper async and use real async sleep) or use an approach that doesn’t depend on a queued"error"callback.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/forward-start.ts` around lines 164 - 221, The polling loop in runDetachedForwardStartWithDiagnostics relies on a synchronous while/sleepImpl (blockingSleepMs) so Node cannot deliver the ChildProcess "error" event to set asyncSpawnError; change the loop to be asynchronous so events can run: make runDetachedForwardStartWithDiagnostics (or the helper containing the while) async, replace the blocking sleepImpl/blockingSleepMs call with a non-blocking awaitable sleep (e.g., await new Promise(r => setTimeout(r, pollIntervalMs)) or an exported async sleepImpl) and ensure runDetachedSpawn/registering via buildDetachedForwardStartSpawn still sets asyncSpawnError; afterward the loop should check asyncSpawnError between awaits so the "spawn-error" path is reachable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8641-8645: The retry cleanup currently calls
bestEffortForwardStop(runOpenshell, actualPort) unconditionally and can stop a
forward owned by a different sandbox; update the retry callback passed to
runDetachedForwardStartWithPortReleaseRetries so it first queries current
forwards (e.g., via runCaptureOpenshell(["forward", "list"]) or the same helper
used elsewhere), finds the entry for actualPort and verifies its sandbox/owner
equals sandboxName, and only then calls bestEffortForwardStop(runOpenshell,
actualPort); keep all existing symbols
(runDetachedForwardStartWithPortReleaseRetries, buildDetachedForwardStartSpawn,
runCaptureOpenshell, bestEffortForwardStop, runOpenshell, actualPort,
sandboxName) and ensure the ownership check prevents stopping another sandbox's
forward.
---
Outside diff comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 164-221: The polling loop in
runDetachedForwardStartWithDiagnostics relies on a synchronous while/sleepImpl
(blockingSleepMs) so Node cannot deliver the ChildProcess "error" event to set
asyncSpawnError; change the loop to be asynchronous so events can run: make
runDetachedForwardStartWithDiagnostics (or the helper containing the while)
async, replace the blocking sleepImpl/blockingSleepMs call with a non-blocking
awaitable sleep (e.g., await new Promise(r => setTimeout(r, pollIntervalMs)) or
an exported async sleepImpl) and ensure runDetachedSpawn/registering via
buildDetachedForwardStartSpawn still sets asyncSpawnError; afterward the loop
should check asyncSpawnError between awaits so the "spawn-error" path is
reachable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4650746e-cdba-45c8-a2c7-707f5a8771d5
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/forward-start.test.tssrc/lib/onboard/forward-start.tstest/onboard.test.ts
E2E Scenario Advisor RecommendationRequired scenario E2E: None Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
Selective E2E Results — ✅ All requested jobs passedRun: 26324778429
|
…ate on timeout (#4064) User reproduced #4064 on the merged branch: spawnSync ETIMEDOUT is gone but the helper now times out at 60s with "forward did not appear in list". The user's openshell-gateway runs inside the Docker compatibility container (host glibc < required) and per-call gRPC latency pushes the forward-registration handshake past 60s. Manual `openshell forward start <port> <sandbox> -d` against an unrelated sandbox completes immediately, so the CLI path is healthy. - `forward-start.ts`: bump default `overallTimeoutMs` from 60_000 to 180_000 so Docker-compat gateways have headroom for the forward-registration handshake. - `forward-start.ts`: add `onProgress` / `progressIntervalMs` options so the helper can emit a periodic "still waiting" line during the longer wait, and append the last `openshell forward list` snapshot to the timeout diagnostic so users (and triage) can see whether the gateway returned an empty list or an entry in an unexpected state. - `forward-start.ts`: export `buildForwardStartProgressLogger(port)` so the onboard call site stays a one-liner — the entrypoint budget for `src/lib/onboard.ts` is still net-zero against main. - `forward-start.test.ts`: new regression test for the progress callback cadence and the `last forward list:` diagnostic suffix. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ht openshell argv[0] PR #4072 review fixes. - `forward-cleanup.ts`: add `bestEffortForwardStopForSandbox` which consults `openshell forward list` before stopping a port. Skips the stop when the port is owned by a different sandbox, so a TOCTOU / port-conflict retry in `ensureDashboardForward` can no longer kill another sandbox's forward. - `onboard.ts`: use `bestEffortForwardStopForSandbox` for both the pre-start cleanup and the `beforeRetry` callback. Pass an explicit `timeout: 5_000` to `runCaptureOpenshell(["forward", "list"])` so a hung gateway probe cannot bypass the helper's overall deadline. - `forward-start.ts`: preflight `argv[0]` with `fs.accessSync(_, X_OK)` before spawning. The detached-spawn path runs inside a synchronous poll loop, so Node's async `error` event cannot reach the helper while it sleeps in `spawnSync`; surfacing ENOENT/EACCES at preflight time turns the would-be 180s timeout into an immediate spawn-error. The async error listener remains as belt-and-braces. - `forward-start.ts`: move `maxRetries` into the options object so the onboard callsite stays a single positional argument and the entrypoint budget remains net-zero against main. - `forward-cleanup.test.ts`: new file covering the four owner outcomes (stopped, owned-other, no-entry, list-failed) plus the non-live status filter. - `forward-start.test.ts`: replace the synthetic async-callback test with a real `buildDetachedForwardStartSpawn(["/missing"])` assertion so we exercise the actual preflight; update existing `runDetachedForwardStartWithPortReleaseRetries` callers to pass `maxRetries` via the options object. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8620-8622: The poll of the external CLI inside
runDetachedForwardStartWithPortReleaseRetries is currently fatal because
runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }) will surface
errors; change the polling call to be non-fatal by passing ignoreError: true
(e.g. runCaptureOpenshell(["forward", "list"], { timeout: 5_000, ignoreError:
true })) or otherwise catch/log errors and return a falsy/neutral diagnostic so
that runDetachedForwardStartWithPortReleaseRetries can continue its
retry/diagnostic flow; update the call site that constructs the polling lambda
(the second argument to runDetachedForwardStartWithPortReleaseRetries) used
alongside buildDetachedForwardStartSpawn and openshellArgv to ensure transient
openshell failures do not abort onboarding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0361f22d-7b7f-41ef-9b01-4e342619df38
📒 Files selected for processing (5)
src/lib/onboard.tssrc/lib/onboard/forward-cleanup.test.tssrc/lib/onboard/forward-cleanup.tssrc/lib/onboard/forward-start.test.tssrc/lib/onboard/forward-start.ts
| const { ok: fwdOk, diagnostic: fwdDiagnostic } = runDetachedForwardStartWithPortReleaseRetries( | ||
| buildDetachedForwardStartSpawn(openshellArgv(["forward", "start", "--background", actualTarget, sandboxName])), | ||
| () => runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }), |
There was a problem hiding this comment.
Make forward list polling non-fatal.
Line 8622 polls an external CLI without ignoreError: true. A transient openshell forward list failure can bypass runDetachedForwardStartWithPortReleaseRetries's retry/diagnostic path and still abort onboarding or trigger the rollback path for a healthy sandbox.
Suggested fix
const { ok: fwdOk, diagnostic: fwdDiagnostic } = runDetachedForwardStartWithPortReleaseRetries(
buildDetachedForwardStartSpawn(openshellArgv(["forward", "start", "--background", actualTarget, sandboxName])),
- () => runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }),
+ () => runCaptureOpenshell(["forward", "list"], {
+ ignoreError: true,
+ timeout: 5_000,
+ }),
{ port: actualPort, sandboxName },
() => { sleep(1); bestEffortForwardStopForSandbox(runOpenshell, runCaptureOpenshell, actualPort, sandboxName); },
{ onProgress: buildForwardStartProgressLogger(actualPort) },
);As per coding guidelines, src/lib/onboard.ts: “This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` around lines 8620 - 8622, The poll of the external CLI
inside runDetachedForwardStartWithPortReleaseRetries is currently fatal because
runCaptureOpenshell(["forward", "list"], { timeout: 5_000 }) will surface
errors; change the polling call to be non-fatal by passing ignoreError: true
(e.g. runCaptureOpenshell(["forward", "list"], { timeout: 5_000, ignoreError:
true })) or otherwise catch/log errors and return a falsy/neutral diagnostic so
that runDetachedForwardStartWithPortReleaseRetries can continue its
retry/diagnostic flow; update the call site that constructs the polling lambda
(the second argument to runDetachedForwardStartWithPortReleaseRetries) used
alongside buildDetachedForwardStartSpawn and openshellArgv to ensure transient
openshell failures do not abort onboarding.
Selective E2E Results — ✅ All requested jobs passedRun: 26325788526
|
PR #4072 review fixes. - `forward-cleanup.ts`: when `openshell forward list` fails, the owner- scoped stop now skips entirely instead of falling through to a port-only `forward stop` that could kill an unrelated sandbox's forward. Owner-confirmed and no-entry paths use the sandbox-scoped `forward stop <port> <sandbox>` form so a TOCTOU window between list and stop cannot collateral another sandbox's forward either. Pulled the probe timeout from the shared `OPENSHELL_PROBE_TIMEOUT_MS` constant (15s) instead of an ad-hoc 5_000 — a slow Docker-compat gateway can otherwise time out every list call and flip the helper into false-rollback territory. - `onboard.ts`: import `OPENSHELL_PROBE_TIMEOUT_MS` and use it for the forward-list poll's per-call timeout, matching the other openshell probe sites. - `forward-start.ts`: drop the `onAsyncError` parameter and the `asyncSpawnError` check. The detached-spawn path runs inside a synchronous poll loop, so Node's async `error` event cannot reach the helper while it sleeps inside `spawnSync`. The preflight `fs.accessSync(argv[0], X_OK)` already catches the common ENOENT/EACCES cases; the helper now keeps only a no-op `error` listener to swallow any late event so it cannot crash the process. - `forward-start.ts`: on timeout / port-conflict outcomes, send SIGTERM to the detached `openshell forward start --background` pid so the child cannot still register a forward minutes after onboard rolled the sandbox back (which would race the next onboard attempt for the same port). - `forward-cleanup.test.ts`: cover the new behaviours — list-failed now skips stop entirely, owner-confirmed and no-entry use the sandbox-scoped 4-arg `forward stop` form, and the probe timeout asserts `OPENSHELL_PROBE_TIMEOUT_MS` (15s) rather than the previous 5_000 literal. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/onboard/forward-start.ts (1)
113-121:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTreat
child.pid === undefinedas a spawn failure.This still has a false-timeout path:
spawnChild()can return aChildProcesswith nopidand then emit"error"asynchronously. Right now that event is swallowed and the helper falls through into the 180s poll loop, so a launch failure is reported astimeoutinstead ofspawn-error.🛠️ Minimal fix
const child = spawnChild(argv[0], argv.slice(1), { stdio: ["ignore", stdout, stderr], detached: true, }); // Swallow any belated `error` event so a race between accessSync and // execve does not crash the process via an unhandled emitter. child.on("error", () => {}); + if (child.pid == null) { + return { + error: new Error(`Failed to spawn detached forward start: ${argv[0]}`), + }; + } child.unref(); return { pid: child.pid };In Node.js child_process.spawn(), if process launch fails, can spawn() return a ChildProcess with pid undefined and emit an "error" event asynchronously instead of throwing?🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/forward-start.ts` around lines 113 - 121, The spawned ChildProcess may have an undefined pid and later emit "error", so treat child.pid === undefined as an immediate spawn failure: after calling spawnChild(...) check if child.pid is undefined and if so remove the no-op error swallow, attach an error handler that surfaces/returns a spawn-error (or throw) and ensure you unref/cleanup the child if necessary; keep the existing child.on("error", () => {}) behavior only for cases with a valid pid and otherwise return/propagate a spawn failure immediately from the function that calls spawnChild.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/lib/onboard/forward-start.ts`:
- Around line 113-121: The spawned ChildProcess may have an undefined pid and
later emit "error", so treat child.pid === undefined as an immediate spawn
failure: after calling spawnChild(...) check if child.pid is undefined and if so
remove the no-op error swallow, attach an error handler that surfaces/returns a
spawn-error (or throw) and ensure you unref/cleanup the child if necessary; keep
the existing child.on("error", () => {}) behavior only for cases with a valid
pid and otherwise return/propagate a spawn failure immediately from the function
that calls spawnChild.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6b7147b9-541b-4549-8303-3407c6ad44b7
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/forward-cleanup.test.tssrc/lib/onboard/forward-cleanup.tssrc/lib/onboard/forward-start.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26327142805
|
…tests PR #4072 review fixes. - `forward-start.ts`: `buildDetachedForwardStartSpawn` now returns a synchronous spawn-error when `spawn` came back with `child.pid == null`. Without this the swallowed `error` listener would let the caller wait the full 180s deadline for a child that never actually ran. `child.unref()` only runs once the pid is known. - `forward-start.ts`: clear `lastFetchError` after a successful `fetchForwardList` so a recovered gateway does not leave a stale "openshell forward list failed: …" tail on the eventual timeout diagnostic. - `forward-start.ts` / `forward-start.test.ts`: drop the issue/PR references from the new comments; keep the generic rationale. - `forward-start.test.ts`: cover the SIGTERM paths — timeout SIGTERMs the detached pid, port-conflict diagnostic SIGTERMs the detached pid (driven by writing a real EADDRINUSE line to the captured stderr fd), and a spawn-error outcome (no pid) leaves `process.kill` untouched. Also pin the lastFetchError clearing behaviour. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26327552584
|
…ach-poll-4064 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26329354954
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26334005293
|
… in test Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26335995740
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Summary
ensureDashboardForwardinvokedopenshell forward start --backgroundviaspawnSyncwith a 30-second timeout. On hosts using the Docker compatibility gateway (host glibc older than the openshell-gateway requirement), the daemonised forward inherits the parent CLI's stdio, so spawnSync waited on those file descriptors well past 30s, producingspawnSync ... openshell ETIMEDOUTand tripping the rollback path even though the dashboard was healthy inside the sandbox.Switch from spawnSync-with-timeout to a detached spawn plus a poll of
openshell forward list. The CLI's exit code is no longer the success signal — the appearance of a matching(port, sandboxName)entry in the live forward list is.Related Issue
Fixes #4064
Changes
src/lib/onboard/forward-start.ts— new helpers replacerunBackgroundForwardStart*:runDetachedForwardStartWithDiagnostics/runDetachedForwardStartWithPortReleaseRetrieswrite child stdio to a temp file pair, close the host fds, and pollopenshell forward listuntil a match appears or the deadline expires.buildDetachedForwardStartSpawn(argv)preflightsargv[0]withfs.accessSync(_, X_OK), returns an immediatespawn-errorwhen the spawn produces no pid, and registers a no-operrorlistener so a belated async failure cannot crash onboard.buildForwardStartProgressLogger(port)prints a periodic "still waiting" line during the poll.last forward list: …tail. On timeout / port-conflict outcomes the helper sends a best-effort SIGTERM to the detached child to reduce the risk of an orphan registering a forward after rollback.src/lib/onboard/forward-cleanup.ts— newbestEffortForwardStopForSandbox(run, fetch, port, sandbox):openshell forward list(timeoutOPENSHELL_PROBE_TIMEOUT_MS, throws on failure) and returnsstopped/owned-other/no-entry/list-failed.forward stop <port> <sandbox>form; skips entirely whenforward listfails so a port-only fallback cannot kill an unrelated sandbox's forward.src/lib/onboard.ts— wireensureDashboardForwardto the new helpers and passtimeout: OPENSHELL_PROBE_TIMEOUT_MSto the forward-list poll. Entrypoint budget against main is net-negative.src/lib/onboard/forward-start.test.tsandsrc/lib/onboard/forward-cleanup.test.tssuites cover the spawn/poll/SIGTERM paths and the four ownership outcomes (including thelist-failedbranch); existingtest/onboard*.test.tsspawn mocks are updated to capture full argv and emit a matchingforward listentry for the polled port.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests