Skip to content

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918

Open
myftija wants to merge 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification
Open

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918
myftija wants to merge 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification

Conversation

@myftija

@myftija myftija commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Problem

Firestarter's didWarmStart: true means the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive, and each redrive burns a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES.

Change

After a warm-start hit, the supervisor retains the DequeuedMessage (TimerWheel, default 10s), then probes the existing getLatestSnapshot API. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes.

  • Double-start safe: startRunAttempt runs under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything.
  • Probe errors → do nothing: healthy runners legitimately act late during platform brownouts (nested attempt-start retries), so falling back on uncertainty would stampede duplicates. The heartbeat redrive stays as the backstop (also covers supervisor restarts dropping timers).
  • Off by default: TRIGGER_WARM_START_VERIFY_ENABLED (+ TRIGGER_WARM_START_VERIFY_DELAY_MS, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path.
  • Emits warmstart.verify wide events (outcome: delivered | fallback | probe_error), making the silent-loss rate directly measurable.

@changeset-bot

changeset-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 58cef9a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This pull request adds an opt-in warm-start delivery verification feature to the supervisor. The feature validates whether warm-start dispatches reached runners and automatically falls back to cold-start workload creation if delivery is not confirmed within a configurable delay window. Configuration is gated by TRIGGER_WARM_START_VERIFY_ENABLED (default false) with a configurable probe delay between 1 and 60 seconds (default 10 seconds). The new WarmStartVerificationService uses timer-wheel scheduling and limits concurrent snapshot probes to 10. Integration into the supervisor includes conditional service initialization, scheduling verification on successful warm-start, cancellation when a run connects, and graceful shutdown ordering. A createWorkload helper was extracted to centralize cold-create validation and logging.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description is comprehensive and detailed, covering problem, solution, implementation details, and safety guarantees. However, it does not follow the repository's required template structure (missing Closes #issue, checklist, Testing, Changelog sections). Reorganize the description to match the template: add 'Closes #' at top, include the required checklist with items marked, add Testing section describing test coverage, and add Changelog section with brief summary.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: verifying warm-start delivery and implementing cold-start fallback for silently lost dispatches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tri-10659-warm-start-delivery-verification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

Comment on lines +576 to +588
recordPhaseSince("workload_create", createStart, undefined);

// Disabled for now
// this.resourceMonitor.blockResources({
// cpu: message.run.machine.cpu,
// memory: message.run.machine.memory,
// });
} catch (error) {
recordPhaseSince(
"workload_create",
createStart,
error instanceof Error ? error : new Error(String(error))
);

@devin-ai-integration devin-ai-integration Bot Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: recordPhaseSince is a silent no-op when called from the verification fallback path

When createWorkload is invoked from the verification service's timer callback (the fallback cold-create path), it runs outside any runWideEvent context. The recordPhaseSince calls at apps/supervisor/src/index.ts:576 and apps/supervisor/src/index.ts:584-588 use fromContext() which returns null in this case, making recordPhase a no-op (apps/supervisor/src/wideEvents/record.ts:27). This means the workload_create phase timing data is silently dropped for fallback cold-creates. The separate emitOneShot call with outcome "fallback" in the verification service (warmStartVerificationService.ts:140) does capture the event, so there is observability — just not at the same phase-level granularity as the normal dequeue path.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition you flagged holds: recordPhaseSince -> recordPhase guards with if (!state) return (apps/supervisor/src/wideEvents/record.ts:27), and fromContext() returns null outside the ALS scope (wideEvents/context.ts:12) - so from the verifier's timer path it's a clean no-op, no throw, no corrupted phase data. The fallback path's observability instead comes from the warmstart.verify wide event plus the runId-attributed error log in createWorkload's catch.

myftija added 3 commits June 12, 2026 16:11
Firestarter's didWarmStart: true means the response was written to a
socket, not that the runner received it. A silently dead poller (no FIN,
e.g. a VM torn down mid-poll) leaves the dispatched run stuck in
PENDING_EXECUTING until the run engine's heartbeat redrive minutes
later, burning a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES
each time.

After a warm-start hit the supervisor now retains the DequeuedMessage,
waits TRIGGER_WARM_START_VERIFY_DELAY_MS (default 10s), then asks the
platform for the run's latest snapshot. If it is still the exact
snapshot that was dequeued, no runner ever started the attempt - the
run falls through to the regular cold-create path. Double-starts are
prevented by the engine: startRunAttempt runs under a per-run lock and
rejects stale snapshot ids, so a reviving runner and the fallback
workload can't both execute. On probe errors nothing happens - during
platform brownouts healthy runners legitimately act late, and falling
back on uncertainty would stampede duplicates; the heartbeat redrive
stays as the backstop.

Off by default; enable with TRIGGER_WARM_START_VERIFY_ENABLED. When
disabled the code path is a no-op. Emits warmstart.verify wide events
(outcome: delivered / fallback / probe_error). Resolves TRI-10659.
Review follow-ups: the workload-create error log now carries the run id
(fallback creates run outside the dequeue wide event, so the log was the
only attribution), and the verifier stops before the workload server and
session so its timer can't cold-create a workload mid-shutdown.
@myftija myftija force-pushed the tri-10659-warm-start-delivery-verification branch from b6c35ac to 58cef9a Compare June 12, 2026 14:11
@pkg-pr-new

pkg-pr-new Bot commented Jun 12, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@58cef9a

trigger.dev

npm i https://pkg.pr.new/trigger.dev@58cef9a

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@58cef9a

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@58cef9a

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@58cef9a

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@58cef9a

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@58cef9a

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@58cef9a

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@58cef9a

commit: 58cef9a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant