feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches by myftija · Pull Request #3918 · triggerdotdev/trigger.dev

myftija · 2026-06-12T11:14:32Z

Problem

Firestarter's didWarmStart: true means the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive, and each redrive burns a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES.

Change

After a warm-start hit, the supervisor retains the DequeuedMessage (TimerWheel, default 10s), then probes the existing getLatestSnapshot API. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes.

Double-start safe: startRunAttempt runs under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything.
Probe errors → do nothing: healthy runners legitimately act late during platform brownouts (nested attempt-start retries), so falling back on uncertainty would stampede duplicates. The heartbeat redrive stays as the backstop (also covers supervisor restarts dropping timers).
Off by default: TRIGGER_WARM_START_VERIFY_ENABLED (+ TRIGGER_WARM_START_VERIFY_DELAY_MS, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path.
Emits warmstart.verify wide events (outcome: delivered | fallback | probe_error), making the silent-loss rate directly measurable.

changeset-bot · 2026-06-12T11:14:42Z

⚠️ No Changeset found

Latest commit: 58cef9a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-12T11:14:48Z

Walkthrough

This pull request adds an opt-in warm-start delivery verification feature to the supervisor. The feature validates whether warm-start dispatches reached runners and automatically falls back to cold-start workload creation if delivery is not confirmed within a configurable delay window. Configuration is gated by TRIGGER_WARM_START_VERIFY_ENABLED (default false) with a configurable probe delay between 1 and 60 seconds (default 10 seconds). The new WarmStartVerificationService uses timer-wheel scheduling and limits concurrent snapshot probes to 10. Integration into the supervisor includes conditional service initialization, scheduling verification on successful warm-start, cancellation when a run connects, and graceful shutdown ordering. A createWorkload helper was extracted to centralize cold-create validation and logging.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	PR description is comprehensive and detailed, covering problem, solution, implementation details, and safety guarantees. However, it does not follow the repository's required template structure (missing Closes `#issue`, checklist, Testing, Changelog sections).	Reorganize the description to match the template: add 'Closes #' at top, include the required checklist with items marked, add Testing section describing test coverage, and add Changelog section with brief summary.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature: verifying warm-start delivery and implementing cold-start fallback for silently lost dispatches.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tri-10659-warm-start-delivery-verification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 1 potential issue.

devin-ai-integration · 2026-06-12T11:18:47Z

+      recordPhaseSince("workload_create", createStart, undefined);
+
+      // Disabled for now
+      // this.resourceMonitor.blockResources({
+      //   cpu: message.run.machine.cpu,
+      //   memory: message.run.machine.memory,
+      // });
+    } catch (error) {
+      recordPhaseSince(
+        "workload_create",
+        createStart,
+        error instanceof Error ? error : new Error(String(error))
+      );


📝 Info: recordPhaseSince is a silent no-op when called from the verification fallback path

When createWorkload is invoked from the verification service's timer callback (the fallback cold-create path), it runs outside any runWideEvent context. The recordPhaseSince calls at apps/supervisor/src/index.ts:576 and apps/supervisor/src/index.ts:584-588 use fromContext() which returns null in this case, making recordPhase a no-op (apps/supervisor/src/wideEvents/record.ts:27). This means the workload_create phase timing data is silently dropped for fallback cold-creates. The separate emitOneShot call with outcome "fallback" in the verification service (warmStartVerificationService.ts:140) does capture the event, so there is observability — just not at the same phase-level granularity as the normal dequeue path.

Was this helpful? React with 👍 or 👎 to provide feedback.

The condition you flagged holds: recordPhaseSince -> recordPhase guards with if (!state) return (apps/supervisor/src/wideEvents/record.ts:27), and fromContext() returns null outside the ALS scope (wideEvents/context.ts:12) - so from the verifier's timer path it's a clean no-op, no throw, no corrupted phase data. The fallback path's observability instead comes from the warmstart.verify wide event plus the runId-attributed error log in createWorkload's catch.

Firestarter's didWarmStart: true means the response was written to a socket, not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive minutes later, burning a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES each time. After a warm-start hit the supervisor now retains the DequeuedMessage, waits TRIGGER_WARM_START_VERIFY_DELAY_MS (default 10s), then asks the platform for the run's latest snapshot. If it is still the exact snapshot that was dequeued, no runner ever started the attempt - the run falls through to the regular cold-create path. Double-starts are prevented by the engine: startRunAttempt runs under a per-run lock and rejects stale snapshot ids, so a reviving runner and the fallback workload can't both execute. On probe errors nothing happens - during platform brownouts healthy runners legitimately act late, and falling back on uncertainty would stampede duplicates; the heartbeat redrive stays as the backstop. Off by default; enable with TRIGGER_WARM_START_VERIFY_ENABLED. When disabled the code path is a no-op. Emits warmstart.verify wide events (outcome: delivered / fallback / probe_error). Resolves TRI-10659.

Review follow-ups: the workload-create error log now carries the run id (fallback creates run outside the dequeue wide event, so the log was the only attribution), and the verifier stops before the workload server and session so its timer can't cold-create a workload mid-shutdown.

pkg-pr-new · 2026-06-12T14:14:29Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@58cef9a

trigger.dev

npm i https://pkg.pr.new/trigger.dev@58cef9a

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@58cef9a

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@58cef9a

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@58cef9a

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@58cef9a

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@58cef9a

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@58cef9a

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@58cef9a

commit: 58cef9a

devin-ai-integration Bot reviewed Jun 12, 2026

View reviewed changes

myftija added 3 commits June 12, 2026 16:11

chore: add server-changes entry

5ac2f22

myftija force-pushed the tri-10659-warm-start-delivery-verification branch from b6c35ac to 58cef9a Compare June 12, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918
myftija wants to merge 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification

myftija commented Jun 12, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 12, 2026 •

edited

Loading

Uh oh!

myftija Jun 12, 2026

Uh oh!

pkg-pr-new Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

myftija commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Uh oh!

changeset-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myftija Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

pkg-pr-new Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

myftija commented Jun 12, 2026 •

edited

Loading

changeset-bot Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

devin-ai-integration Bot Jun 12, 2026 •

edited

Loading