-
-
Notifications
You must be signed in to change notification settings - Fork 911
fix(runner): reduce restore recovery time and deprecated runner false positives #2523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: 9716038 The changes in this PR will be included in the next version bump. This PR includes changesets to release 23 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WalkthroughThe PR adds a changeset for patch releases. In the supervisor app, the debug-logs route is gated by an env flag with two route variants (real vs. mock) and minor server construction refactor. In CLI v3, controller reconnection logic is revised to act on runner/supervisor changes and re-subscribe selectively; automatic resubscribe on connect is removed. Execution gains connection-error awareness, retries on restore, and expands processEnvOverrides to return runnerIdChanged/supervisorChanged. Snapshot manager adds de-duplication for deprecated detections and new tests (duplicated block noted). Core workload HTTP client centralizes connection-error detection and adds timeouts to heartbeat/snapshots. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–75 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/cli-v3/src/entryPoints/managed/controller.ts (1)
504-510: WorkloadHttpClient updates OK; add guarded re-subscribe on socket reconnect
- WorkloadHttpClient supports dynamic updates (updateApiUrl/updateRunnerId) and execution.processEnvOverrides already calls them — no need to recreate the client or ManagedRunLogger. (packages/cli-v3/src/entryPoints/managed/execution.ts ≈ lines 974–978; ManagedRunLogger holds the httpClient reference in packages/cli-v3/src/entryPoints/managed/logger.ts).
- createSupervisorSocket's "connect" handler only logs and does not re-subscribe. Add a guarded re-subscribe on connect when an execution is active (e.g. if this.runFriendlyId && this.snapshotFriendlyId then call this.subscribeToRunNotifications(...)) to handle servers that don’t retain subscriptions across reconnects. (packages/cli-v3/src/entryPoints/managed/controller.ts — inside createSupervisorSocket socket.on("connect")).
🧹 Nitpick comments (7)
packages/cli-v3/src/entryPoints/managed/snapshot.ts (2)
52-55: Use a Set for O(1) de‑dup and simpler pruning.Array scans are fine at size 50, but a Set is cleaner and future‑proof.
Apply:
- // Track seen deprecated snapshots to prevent false positives - private seenDeprecatedSnapshotIds: string[] = []; - private readonly maxSeenDeprecatedSnapshotIds = 50; + // Track seen deprecated snapshots to prevent false positives + private seenDeprecatedSnapshotIds = new Set<string>(); + private readonly maxSeenDeprecatedSnapshotIds = 50;
291-297: Align filtering and pruning with Set semantics.Switch membership checks/additions to Set and prune in insertion order.
- const deprecatedSnapshots = previousSnapshots.filter((snap) => { - const isDeprecated = deprecatedStatus.includes(snap.snapshot.executionStatus); - const previouslySeen = this.seenDeprecatedSnapshotIds.some( - (s) => s === snap.snapshot.friendlyId - ); - return isDeprecated && !previouslySeen; - }); + const deprecatedSnapshots = previousSnapshots.filter((snap) => { + const isDeprecated = deprecatedStatus.includes(snap.snapshot.executionStatus); + const previouslySeen = this.seenDeprecatedSnapshotIds.has(snap.snapshot.friendlyId); + return isDeprecated && !previouslySeen; + }); @@ - // Add the deprecated snapshot IDs to the seen list - this.seenDeprecatedSnapshotIds.push( - ...deprecatedSnapshots.map((s) => s.snapshot.friendlyId) - ); - - if (this.seenDeprecatedSnapshotIds.length > this.maxSeenDeprecatedSnapshotIds) { - // Only keep the latest maxSeenDeprecatedSnapshotIds - this.seenDeprecatedSnapshotIds = this.seenDeprecatedSnapshotIds.slice( - -this.maxSeenDeprecatedSnapshotIds - ); - } + // Add the deprecated snapshot IDs to the seen set + for (const id of deprecatedSnapshots.map((s) => s.snapshot.friendlyId)) { + this.seenDeprecatedSnapshotIds.add(id); + } + // Prune to most recent N (Set keeps insertion order) + while (this.seenDeprecatedSnapshotIds.size > this.maxSeenDeprecatedSnapshotIds) { + const first = this.seenDeprecatedSnapshotIds.values().next().value as string | undefined; + if (!first) break; + this.seenDeprecatedSnapshotIds.delete(first); + }Also applies to: 310-320
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
55-69: Extend the wrapper to start/complete as well and add timeouts for parity.Start/complete failures from network issues should also surface isConnectionError; adding timeouts keeps all calls bounded.
Outside the selected lines, consider:
async startRunAttempt( runId: string, snapshotId: string, body: WorkloadRunAttemptStartRequestBody ) { - return wrapZodFetch( + return this.withConnectionErrorDetection(() => + wrapZodFetch( WorkloadRunAttemptStartResponseBody, `${this.apiUrl}/api/v1/workload-actions/runs/${runId}/snapshots/${snapshotId}/attempts/start`, { method: "POST", headers: { ...this.defaultHeaders(), }, body: JSON.stringify(body), + signal: AbortSignal.timeout(10_000), } - ); + ) + ); } @@ async completeRunAttempt( runId: string, snapshotId: string, body: WorkloadRunAttemptCompleteRequestBody ) { - return wrapZodFetch( + return this.withConnectionErrorDetection(() => + wrapZodFetch( WorkloadRunAttemptCompleteResponseBody, `${this.apiUrl}/api/v1/workload-actions/runs/${runId}/snapshots/${snapshotId}/attempts/complete`, { method: "POST", headers: { ...this.defaultHeaders(), }, body: JSON.stringify(body), + signal: AbortSignal.timeout(10_000), } - ); + ) + ); }packages/cli-v3/src/entryPoints/managed/execution.ts (2)
918-990: Good: env overrides now signal runner/supervisor changes. Consider immediate re‑poll toggle.You already support shouldPollForSnapshotChanges; wire it from callers that refresh due to errors to reduce recovery latency.
1234-1243: Trigger an immediate re‑poll after refreshing metadata on snapshot fetch errors.This shortens recovery when URLs/runner IDs change.
- // Always trigger metadata refresh on snapshot fetch errors - await this.processEnvOverrides("snapshots since error"); + // Always trigger metadata refresh on snapshot fetch errors, then re-poll once + await this.processEnvOverrides("snapshots since error", true);packages/cli-v3/src/entryPoints/managed/controller.ts (1)
536-553: Current execution is still bound to the old socket after supervisor changeYou create a new socket but never rebind
currentExecution’srun:notifylisteners to it. AfterremoveAllListeners()on the old socket, notifications won’t reach the in-flight execution, which can delay restore recovery and other reactions.Add a rebind call here and implement it in
RunExecution:// Create a new socket with the updated URL and headers this.socket = this.createSupervisorSocket(); + // Rebind the in-flight execution to the new socket so run:notify continues to flow + this.currentExecution?.rebindSocket(this.socket); // Re-subscribe to notifications if we have an active execution if (this.runFriendlyId && this.snapshotFriendlyId) { this.subscribeToRunNotifications(this.runFriendlyId, this.snapshotFriendlyId); }And in RunExecution (example, adjust to your structure):
// packages/cli-v3/src/entryPoints/managed/execution.ts export class RunExecution { private supervisorSocket: SupervisorSocket; // Ensure the notify handler is a stable reference private readonly onRunNotify = (payload: unknown) => this.handleRunNotify(payload); public rebindSocket(socket: SupervisorSocket) { // Detach from old socket if present this.supervisorSocket?.off?.("run:notify", this.onRunNotify); // Attach to new socket this.supervisorSocket = socket; this.supervisorSocket.on("run:notify", this.onRunNotify); } private handleRunNotify(payload: unknown) { // existing logic… } }packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1)
701-758: Test looks good; add call-count assertions to harden against regressionsTighten expectations so we catch extra/deferred handler calls and verify a single metadata lookup.
// First call should be deprecated=false (restore detected) expect(onSnapshotChange).toHaveBeenCalledWith( expect.objectContaining({ snapshot: expect.objectContaining({ friendlyId: "snapshot-2" }) }), false ); + expect(onSnapshotChange).toHaveBeenCalledTimes(1); + expect(mockMetadataClient.getEnvOverrides).toHaveBeenCalledTimes(1); onSnapshotChange.mockClear(); @@ // Should call onSnapshotChange with deprecated = false (no new deprecated snapshots) expect(onSnapshotChange).toHaveBeenCalledWith( expect.objectContaining({ snapshot: expect.objectContaining({ friendlyId: "snapshot-3" }) }), false ); + expect(onSnapshotChange).toHaveBeenCalledTimes(1);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.changeset/six-cougars-play.md(1 hunks)apps/supervisor/src/workloadServer/index.ts(2 hunks)packages/cli-v3/src/entryPoints/managed/controller.ts(2 hunks)packages/cli-v3/src/entryPoints/managed/execution.ts(6 hunks)packages/cli-v3/src/entryPoints/managed/snapshot.test.ts(1 hunks)packages/cli-v3/src/entryPoints/managed/snapshot.ts(3 hunks)packages/core/src/v3/runEngineWorker/workload/http.ts(3 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
packages/cli-v3/src/entryPoints/managed/snapshot.tspackages/cli-v3/src/entryPoints/managed/snapshot.test.tspackages/cli-v3/src/entryPoints/managed/execution.tsapps/supervisor/src/workloadServer/index.tspackages/cli-v3/src/entryPoints/managed/controller.tspackages/core/src/v3/runEngineWorker/workload/http.ts
**/*.test.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Our tests are all vitest
Files:
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts
**/*.{test,spec}.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{test,spec}.{ts,tsx,js,jsx}: Unit tests must use Vitest
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed
Test files live beside the files under test and should use descriptive describe and it blocks
Files:
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
We use zod a lot in packages/core and in the webapp
Files:
packages/core/src/v3/runEngineWorker/workload/http.ts
🧬 Code graph analysis (4)
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1)
packages/cli-v3/src/entryPoints/managed/snapshot.ts (1)
SnapshotManager(36-416)
packages/cli-v3/src/entryPoints/managed/execution.ts (1)
packages/cli-v3/src/entryPoints/managed/overrides.ts (1)
Metadata(1-13)
apps/supervisor/src/workloadServer/index.ts (3)
packages/core/src/v3/serverOnly/httpServer.ts (1)
HttpServer(64-379)apps/supervisor/src/env.ts (1)
env(118-118)packages/core/src/v3/runEngineWorker/workload/schemas.ts (2)
WorkloadDebugLogRequestBody(61-61)WorkloadDebugLogRequestBody(62-62)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
packages/core/src/v3/apiClient/core.ts (1)
wrapZodFetch(727-767)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (7)
.changeset/six-cougars-play.md (1)
1-6: Changeset looks correct for a patch release.Packages and summary align with the code changes.
apps/supervisor/src/workloadServer/index.ts (1)
389-395: Lightweight mock route is a good trade‑off.Keeps the same path with minimal overhead when debug logs are disabled.
packages/core/src/v3/runEngineWorker/workload/http.ts (3)
92-107: Heartbeat wrapper + timeout: LGTM.Consistent shape with isConnectionError and a sane 10s timeout.
123-135: Continue wrapper: LGTM.Good to surface connection errors here.
175-187: Snapshots since wrapper + timeout: LGTM.Matches heartbeat behavior and enables upstream recovery.
packages/cli-v3/src/entryPoints/managed/execution.ts (2)
875-893: Restore retry on connection error: LGTM.Clear, single retry after metadata refresh.
1014-1019: Heartbeat connection‑error refresh: LGTM.Keeps metadata in sync after transient failures.
Other changes: