feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918
feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918myftija wants to merge 3 commits into
Conversation
|
WalkthroughThis pull request adds an opt-in warm-start delivery verification feature to the supervisor. The feature validates whether warm-start dispatches reached runners and automatically falls back to cold-start workload creation if delivery is not confirmed within a configurable delay window. Configuration is gated by 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| recordPhaseSince("workload_create", createStart, undefined); | ||
|
|
||
| // Disabled for now | ||
| // this.resourceMonitor.blockResources({ | ||
| // cpu: message.run.machine.cpu, | ||
| // memory: message.run.machine.memory, | ||
| // }); | ||
| } catch (error) { | ||
| recordPhaseSince( | ||
| "workload_create", | ||
| createStart, | ||
| error instanceof Error ? error : new Error(String(error)) | ||
| ); |
There was a problem hiding this comment.
📝 Info: recordPhaseSince is a silent no-op when called from the verification fallback path
When createWorkload is invoked from the verification service's timer callback (the fallback cold-create path), it runs outside any runWideEvent context. The recordPhaseSince calls at apps/supervisor/src/index.ts:576 and apps/supervisor/src/index.ts:584-588 use fromContext() which returns null in this case, making recordPhase a no-op (apps/supervisor/src/wideEvents/record.ts:27). This means the workload_create phase timing data is silently dropped for fallback cold-creates. The separate emitOneShot call with outcome "fallback" in the verification service (warmStartVerificationService.ts:140) does capture the event, so there is observability — just not at the same phase-level granularity as the normal dequeue path.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
The condition you flagged holds: recordPhaseSince -> recordPhase guards with if (!state) return (apps/supervisor/src/wideEvents/record.ts:27), and fromContext() returns null outside the ALS scope (wideEvents/context.ts:12) - so from the verifier's timer path it's a clean no-op, no throw, no corrupted phase data. The fallback path's observability instead comes from the warmstart.verify wide event plus the runId-attributed error log in createWorkload's catch.
Firestarter's didWarmStart: true means the response was written to a socket, not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive minutes later, burning a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES each time. After a warm-start hit the supervisor now retains the DequeuedMessage, waits TRIGGER_WARM_START_VERIFY_DELAY_MS (default 10s), then asks the platform for the run's latest snapshot. If it is still the exact snapshot that was dequeued, no runner ever started the attempt - the run falls through to the regular cold-create path. Double-starts are prevented by the engine: startRunAttempt runs under a per-run lock and rejects stale snapshot ids, so a reviving runner and the fallback workload can't both execute. On probe errors nothing happens - during platform brownouts healthy runners legitimately act late, and falling back on uncertainty would stampede duplicates; the heartbeat redrive stays as the backstop. Off by default; enable with TRIGGER_WARM_START_VERIFY_ENABLED. When disabled the code path is a no-op. Emits warmstart.verify wide events (outcome: delivered / fallback / probe_error). Resolves TRI-10659.
Review follow-ups: the workload-create error log now carries the run id (fallback creates run outside the dequeue wide event, so the log was the only attribution), and the verifier stops before the workload server and session so its timer can't cold-create a workload mid-shutdown.
b6c35ac to
58cef9a
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
Problem
Firestarter's
didWarmStart: truemeans the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck inPENDING_EXECUTINGuntil the run engine's heartbeat redrive, and each redrive burns a queue redelivery towardTASK_RUN_DEQUEUED_MAX_RETRIES.Change
After a warm-start hit, the supervisor retains the
DequeuedMessage(TimerWheel, default 10s), then probes the existinggetLatestSnapshotAPI. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes.startRunAttemptruns under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything.TRIGGER_WARM_START_VERIFY_ENABLED(+TRIGGER_WARM_START_VERIFY_DELAY_MS, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path.warmstart.verifywide events (outcome: delivered | fallback | probe_error), making the silent-loss rate directly measurable.