feat(cron): add build status reconciliation cron job#1641
Conversation
Adds a cron job that runs every 15 minutes to reconcile stale build_requests rows by querying the builder API directly. For builds with a builder_job_id that have been stale for >5 minutes: fetches real status from the builder, updates status and last_error, and records build time for completed builds. For builds without a builder_job_id that are older than 1 hour: marks them as failed since they were never submitted to the builder. This ensures the dashboard shows accurate build state even when the CLI disconnects before calling /build/status (e.g., network issues).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds a new Supabase Edge Function HTTP POST endpoint that reconciles stale non-terminal Changes
Sequence DiagramsequenceDiagram
participant Scheduler as Cron Scheduler
participant EdgeFunc as Edge Function (Hono app)
participant DB as Supabase DB
participant BuilderAPI as External Builder API
Scheduler->>EdgeFunc: POST /cron_reconcile_build_status
EdgeFunc->>DB: Query stale non-terminal build_requests (batch)
DB-->>EdgeFunc: Return build batch
alt builds found
loop for each build
alt builder_job_id present
EdgeFunc->>BuilderAPI: GET /jobs/:id/status
BuilderAPI-->>EdgeFunc: Return job status
EdgeFunc->>DB: UPDATE build_request with status, error, timestamps
else orphaned (no builder_job_id)
EdgeFunc->>DB: UPDATE build_request -> mark failed/orphaned
end
alt build reached terminal status with timestamps
EdgeFunc->>DB: INSERT/RECORD build duration metrics
end
end
end
EdgeFunc-->>Scheduler: Respond with summary (processed, reconciled, orphaned, errors)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@supabase/functions/_backend/triggers/cron_reconcile_build_status.ts`:
- Around line 119-121: The code silently maps unknown build.platform values to
'ios' via resolvedPlatform; instead, detect invalid platform values before using
resolvedPlatform (check build.platform strictly equals 'ios' or 'android'), and
if invalid (null/empty/other) log a warning including the build identifier and
the unexpected platform value and skip calling recordBuildTime for that build;
adjust the code paths around resolvedPlatform and recordBuildTime so
recordBuildTime is only invoked for valid 'ios' or 'android' platforms.
- Around line 33-34: The code reads BUILDER_URL and BUILDER_API_KEY via getEnv
into builderUrl and builderApiKey but doesn't validate them; add an early check
after retrieving these values (using the builderUrl and builderApiKey symbols)
to verify they are non-empty and a valid URL (for builderUrl) and non-empty for
builderApiKey, and if invalid either throw/return an error or log a clear
message and exit so downstream fetch calls are never made with empty/malformed
values; use getEnv as the source and fail fast with a descriptive error
mentioning the missing/invalid env var.
In `@supabase/functions/triggers/index.ts`:
- Around line 4-5: Reorder the two imports so they follow alphabetical order:
import cron_email before cron_reconcile_build_status; update the import
statements that currently reference app as cron_reconcile_build_status and app
as cron_email so that the line importing app as cron_email appears first to
satisfy the perfectionist/sort-imports ESLint rule.
🧹 Nitpick comments (3)
supabase/functions/_backend/triggers/cron_reconcile_build_status.ts (3)
82-85: Missing timeout on externalfetchcall.If the builder API is slow or unresponsive, this
fetchwill hang indefinitely, potentially causing the cron function to time out or block the queue. Consider adding anAbortSignal.timeout()or similar mechanism.Proposed fix
- const response = await fetch(`${builderUrl}/jobs/${build.builder_job_id}`, { - method: 'GET', - headers: { 'x-api-key': builderApiKey }, - }) + const response = await fetch(`${builderUrl}/jobs/${build.builder_job_id}`, { + method: 'GET', + headers: { 'x-api-key': builderApiKey }, + signal: AbortSignal.timeout(10_000), + })
93-94: Unsafe cast of builder API response with no validation.
response.json() as BuilderStatusResponsedoes not validate the shape. If the builder returns an unexpected payload (e.g., missingjobfield), accessingbuilderJob.job.statuswill throw a runtime error. The outercatchblock handles it, but the error message won't be descriptive.Consider adding a minimal guard:
const builderJob = await response.json() as BuilderStatusResponse if (!builderJob?.job?.status) { cloudlogErr({ requestId: c.get('requestId'), message: 'Invalid builder response shape', buildId: build.id }) errors++ continue }
44-47: Query failure returnsBRES(success) — intentional?When the query fails, the handler logs the error but returns
{ status: 'ok' }. This means the cron system won't retry or flag the failure. If this is intentional (to prevent cascading retries), consider at least including the error signal in the response or a metric so it's observable.
supabase/functions/_backend/triggers/cron_reconcile_build_status.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@supabase/functions/_backend/triggers/cron_reconcile_build_status.ts`:
- Around line 98-108: The code writes an unvalidated builder API status
(builderJob.job.status stored in jobStatus) directly into the
build_requests.status column; instead map or validate jobStatus against your
known set of allowed statuses (e.g., "pending", "in_progress", "completed",
"failed") before calling supabase.from('build_requests').update, and convert
unknown values to a safe fallback (or reject/update to a normalized enum) while
also saving builderJob.job.error to last_error; update the logic around
builderJob/jobStatus to perform this mapping/validation prior to the DB update
so only allowed status values are written.
- Around line 86-96: The fetch to `${builderUrl}/jobs/${build.builder_job_id}`
can hang; modify the request inside the try where `fetch` is called to use an
AbortSignal timeout (e.g., `AbortSignal.timeout(5000)`) and pass the resulting
`signal` option to `fetch`; ensure the catch block detects an abort/timeout (and
logs via `cloudlogErr` including `c.get('requestId')`, `build.id`,
`build.builder_job_id`) and increments `errors` and `continue`s just like other
failures so a single slow request won't stall the loop.
- Line 24: Replace direct construction of the Hono app with the shared
initializer: change the app initialization from "new
Hono<MiddlewareKeyVariables>()" to calling
"createHono<MiddlewareKeyVariables>()" and update imports to import createHono
from "utils/hono.ts" (remove or keep Hono import only if still needed). This
ensures middleware wired by createHono (e.g., requestId via c.get('requestId'))
is present for the exported "app".
| const ORPHAN_THRESHOLD_HOURS = 1 | ||
| const BATCH_LIMIT = 50 | ||
|
|
||
| export const app = new Hono<MiddlewareKeyVariables>() |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use createHono() instead of new Hono<MiddlewareKeyVariables>().
The coding guidelines require using createHono from utils/hono.ts for all Hono app initialization. This likely wires in shared middleware (e.g., request ID generation) that you rely on via c.get('requestId').
Proposed fix
-import { BRES, middlewareAPISecret } from '../utils/hono.ts'
+import { BRES, createHono, middlewareAPISecret } from '../utils/hono.ts'
...
-export const app = new Hono<MiddlewareKeyVariables>()
+export const app = createHono()As per coding guidelines: "Use createHono from utils/hono.ts for all Hono framework application initialization and routing." Based on learnings: "Use createHono from utils/hono.ts for all Hono framework application initialization and routing."
🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/cron_reconcile_build_status.ts` at line
24, Replace direct construction of the Hono app with the shared initializer:
change the app initialization from "new Hono<MiddlewareKeyVariables>()" to
calling "createHono<MiddlewareKeyVariables>()" and update imports to import
createHono from "utils/hono.ts" (remove or keep Hono import only if still
needed). This ensures middleware wired by createHono (e.g., requestId via
c.get('requestId')) is present for the exported "app".
| try { | ||
| const response = await fetch(`${builderUrl}/jobs/${build.builder_job_id}`, { | ||
| method: 'GET', | ||
| headers: { 'x-api-key': builderApiKey }, | ||
| }) | ||
|
|
||
| if (!response.ok) { | ||
| cloudlogErr({ requestId: c.get('requestId'), message: 'Builder status fetch failed', buildId: build.id, jobId: build.builder_job_id, status: response.status }) | ||
| errors++ | ||
| continue | ||
| } |
There was a problem hiding this comment.
External fetch call has no timeout — risk of the cron job hanging indefinitely.
If the builder API is slow or unresponsive, this fetch will block without limit. Since this runs inside a loop over up to 50 builds, a single hung request can stall the entire cron invocation (and potentially hit the function execution time limit silently).
Consider adding an AbortSignal.timeout:
Proposed fix
const response = await fetch(`${builderUrl}/jobs/${build.builder_job_id}`, {
method: 'GET',
headers: { 'x-api-key': builderApiKey },
+ signal: AbortSignal.timeout(10_000),
})🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/cron_reconcile_build_status.ts` around
lines 86 - 96, The fetch to `${builderUrl}/jobs/${build.builder_job_id}` can
hang; modify the request inside the try where `fetch` is called to use an
AbortSignal timeout (e.g., `AbortSignal.timeout(5000)`) and pass the resulting
`signal` option to `fetch`; ensure the catch block detects an abort/timeout (and
logs via `cloudlogErr` including `c.get('requestId')`, `build.id`,
`build.builder_job_id`) and increments `errors` and `continue`s just like other
failures so a single slow request won't stall the loop.
| const builderJob = await response.json() as BuilderStatusResponse | ||
| const jobStatus = builderJob.job.status | ||
|
|
||
| const { error: updateError } = await supabase | ||
| .from('build_requests') | ||
| .update({ | ||
| status: jobStatus, | ||
| last_error: builderJob.job.error || null, | ||
| updated_at: new Date().toISOString(), | ||
| }) | ||
| .eq('id', build.id) |
There was a problem hiding this comment.
Unbounded external status value written directly to the database.
jobStatus (Line 99) is whatever string the builder API returns, and it's written directly into build_requests.status (Line 104). If the builder returns an unexpected status value (e.g., "queued", "running", or a future new status), it will be stored as-is. If the status column has a CHECK constraint or enum, the update will fail (handled). If not, it could introduce values that other parts of the system don't expect.
Consider validating or mapping jobStatus to your known set of statuses before writing.
🤖 Prompt for AI Agents
In `@supabase/functions/_backend/triggers/cron_reconcile_build_status.ts` around
lines 98 - 108, The code writes an unvalidated builder API status
(builderJob.job.status stored in jobStatus) directly into the
build_requests.status column; instead map or validate jobStatus against your
known set of allowed statuses (e.g., "pending", "in_progress", "completed",
"failed") before calling supabase.from('build_requests').update, and convert
unknown values to a safe fallback (or reject/update to a normalized enum) while
also saving builderJob.job.error to last_error; update the logic around
builderJob/jobStatus to perform this mapping/validation prior to the DB update
so only allowed status values are written.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be09942df3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| TERMINAL_STATUSES.has(jobStatus) | ||
| && builderJob.job.started_at | ||
| && builderJob.job.completed_at |
There was a problem hiding this comment.
Restrict build-time billing to billable terminal statuses
The reconciliation path currently records build time for any status in TERMINAL_STATUSES, so jobs that end as cancelled, expired, or released will be billed whenever they include timestamps. In contrast, the normal /public/build/status.ts flow only calls recordBuildTime for succeeded/failed, so this introduces inconsistent and potentially inflated billing depending on which path updates the row first.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR adds automated reconciliation for stale build statuses in the build_requests table. It implements a cron-based solution to handle cases where the CLI disconnects before reporting final build status, ensuring dashboard accuracy and billing correctness.
Changes:
- New migration creates a pgmq queue and configures two cron tasks: a producer that triggers every 15 minutes and a consumer that processes the queue
- New trigger handler
cron_reconcile_build_status.tsfetches stale builds, queries the builder API for actual status, updates the database, and records build time for billing - Trigger registration in
index.tsexposes the new endpoint
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| supabase/migrations/20260216102420_add_build_status_reconciliation_cron.sql | Creates pgmq queue and two cron_tasks entries (producer every 15m, consumer every 1m) |
| supabase/functions/_backend/triggers/cron_reconcile_build_status.ts | Implements reconciliation logic: fetches stale builds from DB, queries builder API, updates status, marks orphans as failed, records build time |
| supabase/functions/triggers/index.ts | Registers the cron_reconcile_build_status route |
| 'reconcile_build_status_queue', | ||
| 'Process build status reconciliation queue', | ||
| 'function_queue', | ||
| '["cron_reconcile_build_status"]', | ||
| null, | ||
| null, | ||
| 1, |
There was a problem hiding this comment.
The PR description states the consumer runs "every 5m" but the minute_interval is set to 1, which means it will actually run every 1 minute. This is inconsistent with the description and may cause unexpected resource usage. If 5 minutes was intentional, change minute_interval to 5. If 1 minute was intentional, update the PR description to match.
| throw new Error(updateError.message) | ||
|
|
||
| if ( | ||
| TERMINAL_STATUSES.has(jobStatus) |
There was a problem hiding this comment.
The recordBuildTime function should only be called for 'succeeded' or 'failed' statuses, not for all terminal statuses. The existing pattern in supabase/functions/_backend/public/build/status.ts:128 explicitly checks for these two statuses before calling recordBuildTime. Other terminal statuses like 'expired', 'released', or 'cancelled' may not have valid started_at/completed_at times or should not be billed. Add an explicit status check: if ((jobStatus === 'succeeded' || jobStatus === 'failed') && builderJob.job.started_at && builderJob.job.completed_at)
| TERMINAL_STATUSES.has(jobStatus) | |
| (jobStatus === 'succeeded' || jobStatus === 'failed') |
| await recordBuildTime( | ||
| c, | ||
| build.owner_org, | ||
| build.requested_by, | ||
| build.builder_job_id!, | ||
| build.platform, | ||
| buildTimeSeconds, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Missing logging before recordBuildTime call. The existing pattern in supabase/functions/_backend/public/build/status.ts:129-136 includes a cloudlog call before recordBuildTime to track when build time is being recorded. This is helpful for debugging billing issues and understanding cron job behavior. Add a similar log here with buildId, orgId, platform, and buildTimeSeconds.



Summary (AI generated)
build_requestsstatus by querying the builder API directlycron_reconcile_build_status.tsMotivation (AI generated)
Build status in
build_requestscan become stale when:/build/status(network issues, user closes terminal, etc.)Without this cron, these builds remain stuck at "pending" forever in the dashboard, even though the builder has long since finished them.
The cron handles two cases:
builder_job_idstale for >5 minutes: fetches real status from the builder API, updatesstatusandlast_error, and records build time for billingbuilder_job_idolder than 1 hour: marks as failed (never submitted to builder)Companion PRs:
supabaseAdminfor status writes/build/statusafter WebSocket delivers terminal statusBusiness Impact (AI generated)
recordBuildTimefor completed builds that the CLI missedTest Plan (AI generated)
bunx supabase db resetlocally to verify migration applies cleanly (pgmq queue + cron_tasks entries created)reconcile_build_status(queue, 15m) andreconcile_build_status_queue(function_queue, 5m) entriesbuild_requestsrow with staleupdated_atand a mockbuilder_job_id, verify the handler fetches from builder and updates the rowbuild_requestsrow withoutbuilder_job_idandcreated_at>1h ago, verify it gets marked as failedGenerated with AI
Summary by CodeRabbit