fix(runner): reduce restore recovery time and deprecated runner false positives #2523

nicktrn · 2025-09-18T15:11:02Z

Other changes:

supervisor: use lightweight mock route when debug logs are disabled

changeset-bot · 2025-09-18T15:11:07Z

🦋 Changeset detected

Latest commit: 9716038

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 23 packages

Name	Type
trigger.dev	Patch
@trigger.dev/core	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
references-nextjs-realtime	Patch
@trigger.dev/build	Patch
@trigger.dev/python	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/zod-worker	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/rsc	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2025-09-18T15:11:12Z

Walkthrough

The PR adds a changeset for patch releases. In the supervisor app, the debug-logs route is gated by an env flag with two route variants (real vs. mock) and minor server construction refactor. In CLI v3, controller reconnection logic is revised to act on runner/supervisor changes and re-subscribe selectively; automatic resubscribe on connect is removed. Execution gains connection-error awareness, retries on restore, and expands processEnvOverrides to return runnerIdChanged/supervisorChanged. Snapshot manager adds de-duplication for deprecated detections and new tests (duplicated block noted). Core workload HTTP client centralizes connection-error detection and adds timeouts to heartbeat/snapshots.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is incomplete and does not follow the repository template: it only notes a single "Other changes" line about the supervisor mock route and omits the required "Closes #" line, the checklist, Testing, Changelog, and Screenshots sections, and it fails to summarize other significant edits present in the diff such as CLI restore/recovery behavior, the processEnvOverrides signature change, snapshot de-duplication and tests, and workload client connection-error handling.	Please update the PR description to use the repository template: add "Closes #" (or N/A), complete the checklist, provide concrete testing steps and results, include a short changelog summarizing all major changes (CLI restore/reconnect and signature change, snapshot de-duplication and tests, workload client connection-error handling, supervisor mock route, and the changeset file), and attach any relevant test logs or screenshots.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "fix(runner): reduce restore recovery time and deprecated runner false positives" is concise, follows the conventional prefix, and accurately captures the primary intent of the changeset (faster restore/reconnect handling and fixes to avoid deprecated-runner false positives as reflected in the CLI, snapshot, and related modules), so it appropriately summarizes the main change for a reviewer scanning history.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/restore-detection

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/cli-v3/src/entryPoints/managed/controller.ts (1)

504-510: WorkloadHttpClient updates OK; add guarded re-subscribe on socket reconnect

WorkloadHttpClient supports dynamic updates (updateApiUrl/updateRunnerId) and execution.processEnvOverrides already calls them — no need to recreate the client or ManagedRunLogger. (packages/cli-v3/src/entryPoints/managed/execution.ts ≈ lines 974–978; ManagedRunLogger holds the httpClient reference in packages/cli-v3/src/entryPoints/managed/logger.ts).

createSupervisorSocket's "connect" handler only logs and does not re-subscribe. Add a guarded re-subscribe on connect when an execution is active (e.g. if this.runFriendlyId && this.snapshotFriendlyId then call this.subscribeToRunNotifications(...)) to handle servers that don’t retain subscriptions across reconnects. (packages/cli-v3/src/entryPoints/managed/controller.ts — inside createSupervisorSocket socket.on("connect")).

🧹 Nitpick comments (7)

packages/cli-v3/src/entryPoints/managed/snapshot.ts (2)

52-55: Use a Set for O(1) de‑dup and simpler pruning.

Array scans are fine at size 50, but a Set is cleaner and future‑proof.

Apply:

-  // Track seen deprecated snapshots to prevent false positives
-  private seenDeprecatedSnapshotIds: string[] = [];
-  private readonly maxSeenDeprecatedSnapshotIds = 50;
+  // Track seen deprecated snapshots to prevent false positives
+  private seenDeprecatedSnapshotIds = new Set<string>();
+  private readonly maxSeenDeprecatedSnapshotIds = 50;

291-297: Align filtering and pruning with Set semantics.

Switch membership checks/additions to Set and prune in insertion order.

-        const deprecatedSnapshots = previousSnapshots.filter((snap) => {
-          const isDeprecated = deprecatedStatus.includes(snap.snapshot.executionStatus);
-          const previouslySeen = this.seenDeprecatedSnapshotIds.some(
-            (s) => s === snap.snapshot.friendlyId
-          );
-          return isDeprecated && !previouslySeen;
-        });
+        const deprecatedSnapshots = previousSnapshots.filter((snap) => {
+          const isDeprecated = deprecatedStatus.includes(snap.snapshot.executionStatus);
+          const previouslySeen = this.seenDeprecatedSnapshotIds.has(snap.snapshot.friendlyId);
+          return isDeprecated && !previouslySeen;
+        });
@@
-          // Add the deprecated snapshot IDs to the seen list
-          this.seenDeprecatedSnapshotIds.push(
-            ...deprecatedSnapshots.map((s) => s.snapshot.friendlyId)
-          );
-
-          if (this.seenDeprecatedSnapshotIds.length > this.maxSeenDeprecatedSnapshotIds) {
-            // Only keep the latest maxSeenDeprecatedSnapshotIds
-            this.seenDeprecatedSnapshotIds = this.seenDeprecatedSnapshotIds.slice(
-              -this.maxSeenDeprecatedSnapshotIds
-            );
-          }
+          // Add the deprecated snapshot IDs to the seen set
+          for (const id of deprecatedSnapshots.map((s) => s.snapshot.friendlyId)) {
+            this.seenDeprecatedSnapshotIds.add(id);
+          }
+          // Prune to most recent N (Set keeps insertion order)
+          while (this.seenDeprecatedSnapshotIds.size > this.maxSeenDeprecatedSnapshotIds) {
+            const first = this.seenDeprecatedSnapshotIds.values().next().value as string | undefined;
+            if (!first) break;
+            this.seenDeprecatedSnapshotIds.delete(first);
+          }

Also applies to: 310-320

packages/core/src/v3/runEngineWorker/workload/http.ts (1)

55-69: Extend the wrapper to start/complete as well and add timeouts for parity.

Start/complete failures from network issues should also surface isConnectionError; adding timeouts keeps all calls bounded.

Outside the selected lines, consider:

   async startRunAttempt(
     runId: string,
     snapshotId: string,
     body: WorkloadRunAttemptStartRequestBody
   ) {
-    return wrapZodFetch(
+    return this.withConnectionErrorDetection(() =>
+      wrapZodFetch(
         WorkloadRunAttemptStartResponseBody,
         `${this.apiUrl}/api/v1/workload-actions/runs/${runId}/snapshots/${snapshotId}/attempts/start`,
         {
           method: "POST",
           headers: {
             ...this.defaultHeaders(),
             },
           body: JSON.stringify(body),
+          signal: AbortSignal.timeout(10_000),
         }
-    );
+      )
+    );
   }
@@
   async completeRunAttempt(
     runId: string,
     snapshotId: string,
     body: WorkloadRunAttemptCompleteRequestBody
   ) {
-    return wrapZodFetch(
+    return this.withConnectionErrorDetection(() =>
+      wrapZodFetch(
         WorkloadRunAttemptCompleteResponseBody,
         `${this.apiUrl}/api/v1/workload-actions/runs/${runId}/snapshots/${snapshotId}/attempts/complete`,
         {
           method: "POST",
           headers: {
             ...this.defaultHeaders(),
           },
           body: JSON.stringify(body),
+          signal: AbortSignal.timeout(10_000),
         }
-    );
+      )
+    );
   }

packages/cli-v3/src/entryPoints/managed/execution.ts (2)

918-990: Good: env overrides now signal runner/supervisor changes. Consider immediate re‑poll toggle.

You already support shouldPollForSnapshotChanges; wire it from callers that refresh due to errors to reduce recovery latency.

1234-1243: Trigger an immediate re‑poll after refreshing metadata on snapshot fetch errors.

This shortens recovery when URLs/runner IDs change.
-      // Always trigger metadata refresh on snapshot fetch errors
-      await this.processEnvOverrides("snapshots since error");
+      // Always trigger metadata refresh on snapshot fetch errors, then re-poll once
+      await this.processEnvOverrides("snapshots since error", true);

packages/cli-v3/src/entryPoints/managed/controller.ts (1)

536-553: Current execution is still bound to the old socket after supervisor change

You create a new socket but never rebind currentExecution’s run:notify listeners to it. After removeAllListeners() on the old socket, notifications won’t reach the in-flight execution, which can delay restore recovery and other reactions.

Add a rebind call here and implement it in RunExecution:

           // Create a new socket with the updated URL and headers
           this.socket = this.createSupervisorSocket();

+          // Rebind the in-flight execution to the new socket so run:notify continues to flow
+          this.currentExecution?.rebindSocket(this.socket);

           // Re-subscribe to notifications if we have an active execution
           if (this.runFriendlyId && this.snapshotFriendlyId) {
             this.subscribeToRunNotifications(this.runFriendlyId, this.snapshotFriendlyId);
           }

And in RunExecution (example, adjust to your structure):

// packages/cli-v3/src/entryPoints/managed/execution.ts
export class RunExecution {
  private supervisorSocket: SupervisorSocket;
  // Ensure the notify handler is a stable reference
  private readonly onRunNotify = (payload: unknown) => this.handleRunNotify(payload);

  public rebindSocket(socket: SupervisorSocket) {
    // Detach from old socket if present
    this.supervisorSocket?.off?.("run:notify", this.onRunNotify);
    // Attach to new socket
    this.supervisorSocket = socket;
    this.supervisorSocket.on("run:notify", this.onRunNotify);
  }

  private handleRunNotify(payload: unknown) {
    // existing logic…
  }
}

packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1)

701-758: Test looks good; add call-count assertions to harden against regressions

Tighten expectations so we catch extra/deferred handler calls and verify a single metadata lookup.

     // First call should be deprecated=false (restore detected)
     expect(onSnapshotChange).toHaveBeenCalledWith(
       expect.objectContaining({ snapshot: expect.objectContaining({ friendlyId: "snapshot-2" }) }),
       false
     );
+    expect(onSnapshotChange).toHaveBeenCalledTimes(1);
+    expect(mockMetadataClient.getEnvOverrides).toHaveBeenCalledTimes(1);

     onSnapshotChange.mockClear();

@@
     // Should call onSnapshotChange with deprecated = false (no new deprecated snapshots)
     expect(onSnapshotChange).toHaveBeenCalledWith(
       expect.objectContaining({ snapshot: expect.objectContaining({ friendlyId: "snapshot-3" }) }),
       false
     );
+    expect(onSnapshotChange).toHaveBeenCalledTimes(1);

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3ef6ea and 9716038.

📒 Files selected for processing (7)

.changeset/six-cougars-play.md (1 hunks)
apps/supervisor/src/workloadServer/index.ts (2 hunks)
packages/cli-v3/src/entryPoints/managed/controller.ts (2 hunks)
packages/cli-v3/src/entryPoints/managed/execution.ts (6 hunks)
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1 hunks)
packages/cli-v3/src/entryPoints/managed/snapshot.ts (3 hunks)
packages/core/src/v3/runEngineWorker/workload/http.ts (3 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

packages/cli-v3/src/entryPoints/managed/snapshot.ts
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts
packages/cli-v3/src/entryPoints/managed/execution.ts
apps/supervisor/src/workloadServer/index.ts
packages/cli-v3/src/entryPoints/managed/controller.ts
packages/core/src/v3/runEngineWorker/workload/http.ts

**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Our tests are all vitest

Files:

packages/cli-v3/src/entryPoints/managed/snapshot.test.ts

**/*.{test,spec}.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{test,spec}.{ts,tsx,js,jsx}: Unit tests must use Vitest
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed
Test files live beside the files under test and should use descriptive describe and it blocks

Files:

packages/cli-v3/src/entryPoints/managed/snapshot.test.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

We use zod a lot in packages/core and in the webapp

Files:

packages/core/src/v3/runEngineWorker/workload/http.ts

🧬 Code graph analysis (4)

packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1)

packages/cli-v3/src/entryPoints/managed/snapshot.ts (1)

SnapshotManager (36-416)

packages/cli-v3/src/entryPoints/managed/execution.ts (1)

packages/cli-v3/src/entryPoints/managed/overrides.ts (1)

Metadata (1-13)

apps/supervisor/src/workloadServer/index.ts (3)

packages/core/src/v3/serverOnly/httpServer.ts (1)

HttpServer (64-379)

apps/supervisor/src/env.ts (1)

env (118-118)

packages/core/src/v3/runEngineWorker/workload/schemas.ts (2)

WorkloadDebugLogRequestBody (61-61)

WorkloadDebugLogRequestBody (62-62)

packages/core/src/v3/runEngineWorker/workload/http.ts (1)

packages/core/src/v3/apiClient/core.ts (1)

wrapZodFetch (727-767)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)

GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: typecheck / typecheck
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (7)

.changeset/six-cougars-play.md (1)

1-6: Changeset looks correct for a patch release.

Packages and summary align with the code changes.

apps/supervisor/src/workloadServer/index.ts (1)

389-395: Lightweight mock route is a good trade‑off.

Keeps the same path with minimal overhead when debug logs are disabled.

packages/core/src/v3/runEngineWorker/workload/http.ts (3)

92-107: Heartbeat wrapper + timeout: LGTM.

Consistent shape with isConnectionError and a sane 10s timeout.

123-135: Continue wrapper: LGTM.

Good to surface connection errors here.

175-187: Snapshots since wrapper + timeout: LGTM.

Matches heartbeat behavior and enables upstream recovery.

packages/cli-v3/src/entryPoints/managed/execution.ts (2)

875-893: Restore retry on connection error: LGTM.

Clear, single retry after metadata refresh.

1014-1019: Heartbeat connection‑error refresh: LGTM.

Keeps metadata in sync after transient failures.

apps/supervisor/src/workloadServer/index.ts

packages/core/src/v3/runEngineWorker/workload/http.ts

nicktrn added 5 commits September 17, 2025 19:51

fix(runner): improve restore detection

aad7840

chore(supervisor): skip schema parsing when debug logs disabled

358d338

fix(runner): deprecation race condition

7c58b3a

add changeset

ea66896

Merge remote-tracking branch 'origin/main' into fix/restore-detection

9716038

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

apps/supervisor/src/workloadServer/index.ts Show resolved Hide resolved

packages/core/src/v3/runEngineWorker/workload/http.ts Show resolved Hide resolved

ericallam approved these changes Sep 18, 2025

View reviewed changes

nicktrn merged commit db87295 into main Sep 18, 2025
32 checks passed

nicktrn deleted the fix/restore-detection branch September 18, 2025 15:59

github-actions bot mentioned this pull request Sep 18, 2025

chore: Update version for release #2486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(runner): reduce restore recovery time and deprecated runner false positives #2523

fix(runner): reduce restore recovery time and deprecated runner false positives #2523

Uh oh!

nicktrn commented Sep 18, 2025

Uh oh!

changeset-bot bot commented Sep 18, 2025

Uh oh!

coderabbitai bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(runner): reduce restore recovery time and deprecated runner false positives #2523

fix(runner): reduce restore recovery time and deprecated runner false positives #2523

Uh oh!

Conversation

nicktrn commented Sep 18, 2025

Uh oh!

changeset-bot bot commented Sep 18, 2025

🦋 Changeset detected

Uh oh!

coderabbitai bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Sep 18, 2025 •

edited

Loading