ci: add preview janitor and recreate control#2930
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
TL;DR — Adds a scheduled/manual "Preview Janitor" workflow that deletes stale Railway and Vercel preview environments left behind by missed close events, and adds a Key changes
Summary | 8 files | 3 commits | base: Scheduled janitor for stale Railway preview environments
The janitor supports dry-run mode (default for manual dispatch), an opt-in flag to delete orphaned environments whose PR returns 404, and writes a summary table to
Manual
|
There was a problem hiding this comment.
Low urgency. Solid operational hardening — the janitor, recreate path, SpiceDB retry, and structured logging all address real gaps. Two minor items below; neither is blocking.
Claude Opus | 𝕏
| runtime_var_is_unresolved() { | ||
| local value="${1:-}" | ||
|
|
||
| [ -z "${value}" ] || printf '%s' "${value}" | grep -q '\$[{][{]' | ||
| } |
There was a problem hiding this comment.
Nit (readability): runtime_var_is_unresolved is defined here on line 238 but first referenced inside the resolve_runtime_vars function body starting at line 187. This works because bash evaluates function bodies lazily — resolve_runtime_vars isn't called until line 245 — but the ordering is surprising and fragile. Moving the definition above resolve_runtime_vars() (or into common.sh alongside the other helpers) would make the dependency explicit and prevent a future refactor from breaking call order.
Not blocking — the current code is correct at runtime.
| RAILWAY_PROJECT_ID: ${{ vars.RAILWAY_PROJECT_ID }} | ||
| GH_TOKEN: ${{ github.token }} | ||
| GITHUB_REPOSITORY: ${{ github.repository }} | ||
| DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run && 'true' || 'false' }} |
There was a problem hiding this comment.
Confirm intent: On schedule triggers, inputs.dry_run is undefined so this expression evaluates to 'false', meaning scheduled runs execute real deletes. The PR description indicates this is intentional (the janitor is meant to repair missed teardowns automatically). Just flagging for visibility — if you ever want scheduled runs to default to dry-run, this expression would need adjustment.
There was a problem hiding this comment.
Pull request overview
This PR hardens the CI automation around ephemeral preview environments by adding a scheduled/manual “janitor” cleanup path and a manual “recreate preview” path, plus improved logging and retries to make long-running preview steps more diagnosable and resilient.
Changes:
- Add a new scheduled/manual Preview Janitor workflow to delete stale Railway
pr-*environments and stale Vercel branch-scoped env vars. - Add a manual
recreate_previewinput to force teardown + reprovision of a PR’s Railway preview environment. - Improve preview script resilience and diagnosability via structured phase logging, safer teardown, and transient SpiceDB transport retries.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/preview-janitor.yml |
New scheduled/manual janitor workflow for Railway + Vercel stale preview cleanup. |
.github/workflows/preview-environments.yml |
Adds recreate_preview input plumbed into Railway provisioning. |
.github/scripts/preview/common.sh |
Adds preview_log, wait-loop logging helpers, safer JSON edge handling, and env-deletion wait helper. |
.github/scripts/preview/provision-railway.sh |
Adds “recreate existing env” path and more structured provisioning logs. |
.github/scripts/preview/teardown-railway.sh |
Hardens teardown for missing/disappearing envs and uses shared deletion-wait helper. |
.github/scripts/preview/bootstrap-preview-auth.sh |
Retries transient SpiceDB transport failures and adds more phase logging. |
.github/scripts/preview/cleanup-vercel-stale-env.sh |
Ensures PR lookup is repo-scoped when GITHUB_REPOSITORY is set. |
.github/scripts/preview/cleanup-stale-railway-envs.sh |
New Railway janitor script to enumerate pr-* envs, map to PR state, and delete stale envs with verification. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| status="$( | ||
| curl -sS \ | ||
| --connect-timeout 10 \ | ||
| --max-time 30 \ | ||
| -o "${response_file}" \ | ||
| -w '%{http_code}' \ | ||
| -H "Authorization: Bearer ${GH_TOKEN}" \ | ||
| -H "Accept: application/vnd.github+json" \ | ||
| "https://api.github.com/repos/${GITHUB_REPOSITORY}/pulls/${pr_number}" | ||
| )" |
There was a problem hiding this comment.
The GitHub PR lookup is hard-coded to https://api.github.com/.... GitHub Actions exposes GITHUB_API_URL (and this also matters for GHES); consider building the endpoint from ${GITHUB_API_URL} (fallbacking to api.github.com) to avoid breaking the janitor in non-public-GitHub contexts.
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Medium
🟠⚠️ Major (2) 🟠⚠️
Inline Comments:
- 🟠 Major:
cleanup-stale-railway-envs.sh:59Deletion errors silently ignored — return status not checked - 🟠 Major:
cleanup-stale-railway-envs.sh:100-105Deletion failures not tracked in exit code — janitor may exit 0 despite failed cleanups
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
preview-environments.yml:389-390Unused env vars passed to teardown job
💭 Consider (2) 💭
💭 1) provision-railway.sh:238 Function definition ordering
Issue: runtime_var_is_unresolved() is defined after resolve_runtime_vars() which references it.
Why: While this works correctly in bash (functions are resolved at call time, not definition time), the code structure is confusing. If someone reorders the script, it could break.
Fix: Move runtime_var_is_unresolved() definition to before line 167 for clearer declaration-before-use ordering.
💭 2) preview-janitor.yml:56 Scheduled runs delete without dry-run
Issue: The DRY_RUN expression evaluates to false for scheduled runs, meaning the cron job will perform actual deletions.
Why: This differs from the manual dispatch default (dry_run: true). While this appears intentional for automated cleanup, it could surprise operators.
Fix: Add a comment documenting this intentional behavior, or change to default dry-run for scheduled runs if that's preferred.
💡 APPROVE WITH SUGGESTIONS
Summary: This is a solid infrastructure hardening PR that closes real gaps in preview environment lifecycle management. The janitor workflow and recreate path are well-designed. The two Major findings around error tracking in the janitor script should be addressed to ensure operators can trust the exit code — without this, failed cleanups could silently accumulate. The other items are minor polish.
Discarded (3)
| Location | Issue | Reason Discarded |
|---|---|---|
provision-railway.sh:238 |
Function runtime_var_is_unresolved defined after use causes runtime error |
False positive — bash resolves function references at call time, not definition time. The function IS defined before resolve_runtime_vars is called. Reclassified as Consider (code style). |
common.sh:27 |
String equality instead of numeric comparison | Works correctly for numeric strings; purely stylistic preference |
bootstrap-preview-auth.sh:42 |
Retry pattern may miss some transient errors | Low risk — current pattern covers common failure modes; adding more patterns is speculative |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-errors |
5 | 0 | 0 | 0 | 2 | 0 | 3 |
pr-review-devops |
8 | 0 | 2 | 0 | 1 | 0 | 5 |
pr-review-standards |
2 | 0 | 0 | 0 | 0 | 0 | 2 |
| Total | 15 | 0 | 2 | 0 | 3 | 0 | 10 |
Note: Standards reviewer findings were duplicates of devops findings (function order, dry-run behavior).
| preview_log "Deleting stale Railway preview environment ${env_name}." | ||
| railway_environment_delete_by_id "${env_id}" >/dev/null | ||
| railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2 | ||
| } |
There was a problem hiding this comment.
🟠 MAJOR: Deletion errors are silently ignored
Issue: railway_environment_delete_by_id return status is not checked. If deletion fails, the function proceeds to railway_wait_for_environment_absent which will eventually fail with a generic timeout message.
Why: When debugging deletion failures, operators see "Railway environment still exists after waiting for deletion" but won't know why (permission denied, API error, rate limit). The actual Railway API error is discarded.
Fix: Check the return status before proceeding:
delete_env_and_verify() {
local env_id="$1"
local env_name="$2"
preview_log "Deleting stale Railway preview environment ${env_name}."
if ! railway_environment_delete_by_id "${env_id}"; then
echo "Failed to delete Railway environment ${env_name}." >&2
return 1
fi
railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
}Refs:
- teardown-railway.sh:27 — similar pattern with explicit error check
| preview_log "[dry-run] Would delete stale Railway preview environment ${env_name}." | ||
| else | ||
| delete_env_and_verify "${env_id}" "${env_name}" | ||
| deleted=$((deleted + 1)) | ||
| deleted_names+=("${env_name}") | ||
| fi |
There was a problem hiding this comment.
🟠 MAJOR: Deletion failures not tracked in exit code
Issue: When delete_env_and_verify fails during the janitor loop, the error is not counted. The final exit code (line 153) only reflects GitHub API lookup errors (errors variable), not Railway deletion failures.
Why: The janitor could fail to delete multiple Railway environments due to API issues, but exit with code 0 if all GitHub PR lookups succeeded. Operators checking CI may incorrectly believe all targeted environments were cleaned up.
Fix: Add a deletion_failures counter:
# Add at line 77 with other counters:
deletion_failures=0
# In the delete blocks (lines 100-105 and 113-118):
if ! delete_env_and_verify "${env_id}" "${env_name}"; then
deletion_failures=$((deletion_failures + 1))
else
deleted=$((deleted + 1))
deleted_names+=("${env_name}")
fi
# Update exit check at line 153:
if [ "${errors}" -gt 0 ] || [ "${deletion_failures}" -gt 0 ]; then
echo "Encountered ${errors} GitHub lookup error(s) and ${deletion_failures} Railway deletion failure(s)." >&2
exit 1
fiThere was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
✅ Prior Feedback Addressed
All issues from the previous review have been resolved in commit 8973aad63:
| Prior Issue | Resolution |
|---|---|
| 🟠 Major: Deletion errors silently ignored | ✅ Fixed — delete_env_and_verify now checks return status and returns 1 on failure (L58-61) |
| 🟠 Major: Deletion failures not tracked in exit code | ✅ Fixed — Added deletion_failures counter, tracked in loop, checked at exit (L82, L107-112, L163-165) |
| 🟡 Minor: Unused env vars passed to teardown | ✅ Fixed — RAILWAY_TEMPLATE_ENVIRONMENT and RAILWAY_OUTPUT_SERVICE removed from teardown job |
| 💭 Consider: Function definition ordering | ✅ Fixed — runtime_var_is_unresolved() moved to L121-125, before resolve_runtime_vars() which uses it |
| 💭 Consider: Scheduled runs delete without dry-run (undocumented) | ✅ Fixed — Comment added at L57 documenting intentional behavior |
| 🔵 Copilot: Hard-coded GitHub API URL | ✅ Fixed — GITHUB_API_URL now configurable with fallback (L16, L32) and passed from workflow (L55) |
Additional Improvements in Delta
The delta also includes GraphQL error handling improvements in common.sh:
railway_environment_create_from_source,railway_environment_delete_by_id, andrailway_variable_collection_upsertnow properly check for GraphQL errors and return failure status- This hardens the entire Railway mutation surface against silent GraphQL-level failures
✅ APPROVE
Summary: All prior review feedback has been addressed comprehensively. The janitor now properly tracks and reports deletion failures, function ordering is cleaner, and the behavior of scheduled vs manual runs is documented. The additional GraphQL error handling improvements further strengthen the reliability of the preview infrastructure. Ship it! 🚀
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review — no new reviewer dispatches needed. All prior issues resolved.

Summary
Add state-recovery hardening for preview environments so missed close-event teardowns and stale preview state stop accumulating, and add clearer phase logging for the long-running preview steps.
Changes
Preview Janitorworkflow that cleans stale Railway preview environments and can also run the existing Vercel preview env cleanuppr-*environments, maps them back to GitHub PR state, deletes merged/closed PR envs, and verifies deletionrecreate_previewworkflow input so a PR can force teardown and reprovision of its Railway preview environment when reused preview state has driftedWhy
We were still seeing two preview-state gaps that the earlier Railway/Vercel hardening did not fully close:
pr-*environments could drift and require manual cleanup or full reprovision.This PR closes both gaps. It adds an explicit recreate path for one PR and a janitor path for system-wide repair.
What We Verified
bash -npasses for the preview shell scripts changed in this PRgit diff --checkpassesagentsRailway project in dry-run mode before cleanupConfirmed Stale Railway Envs
Before cleanup, the Railway project still had these stale preview environments for merged PRs:
pr-2901pr-2902pr-2906pr-2907pr-2908pr-2909I confirmed each of those PRs is closed/merged in GitHub.
After the live cleanup, a follow-up janitor dry-run reported
19Railway preview environments, all tied to open PRs, with0stale environments left to delete.What Caused Those Misses
Those stale environments all came from a transition window before the earlier preview hardening had landed on
main:2026-03-30T21:31:52Z2026-03-31T02:14:47ZAll six stale PRs above were merged before those fixes were on
main. Their close-eventPreview Environmentsworkflow did fire, but the run failed in the oldTeardown Tier 1 (Railway)path immediately after merge, which left the Railway env behind.This PR accounts for that class of miss in two ways:
Recent Preview Run Read
The newest completed failure I checked was branch-specific, not preview infra:
23801865110forimplementation/datasetsProvision Tier 1 (Railway)passed quicklySo the recent run slate looks materially healthier than yesterday. The dominant failures are no longer the old Railway provisioning/control-plane failures this repo was hitting earlier.
Test Plan
bash -n .github/scripts/preview/*.shfor the changed preview scripts.github/workflows/preview-environments.yml.github/workflows/preview-janitor.ymlgit diff --checkcleanup-stale-railway-envs.shagainst the Railway projectPreview Janitorworkflow in GitHub after mergeworkflow_dispatchwithrecreate_preview=trueon a PR that needs forced reprovisionNotes
pnpm checkis still failing in this workspace because@inkeep/agents-apitest boot requiresINKEEP_AGENTS_MANAGE_DATABASE_URLandINKEEP_AGENTS_RUN_DATABASE_URLto be set. That failure is repo/test-environment baseline and unrelated to the preview-shell/workflow changes in this PR.