Skip to content

ci: add preview janitor and recreate control#2930

Open
vnv-varun wants to merge 3 commits intomainfrom
varun/preview-state-hardening
Open

ci: add preview janitor and recreate control#2930
vnv-varun wants to merge 3 commits intomainfrom
varun/preview-state-hardening

Conversation

@vnv-varun
Copy link
Copy Markdown
Contributor

@vnv-varun vnv-varun commented Mar 31, 2026

Summary

Add state-recovery hardening for preview environments so missed close-event teardowns and stale preview state stop accumulating, and add clearer phase logging for the long-running preview steps.

Changes

  • add a scheduled/manual Preview Janitor workflow that cleans stale Railway preview environments and can also run the existing Vercel preview env cleanup
  • add a new Railway janitor script that enumerates pr-* environments, maps them back to GitHub PR state, deletes merged/closed PR envs, and verifies deletion
  • add a manual recreate_preview workflow input so a PR can force teardown and reprovision of its Railway preview environment when reused preview state has drifted
  • retry transient SpiceDB transport failures during preview bootstrap, while still failing fast on real migration drift
  • add structured preview phase logging and wait-loop logging across provisioning, bootstrap, and teardown so future timeouts are diagnosable from CI logs
  • harden teardown and cleanup helpers for missing/disappearing environments and repo-scoped GitHub queries

Why

We were still seeing two preview-state gaps that the earlier Railway/Vercel hardening did not fully close:

  1. Reused pr-* environments could drift and require manual cleanup or full reprovision.
  2. Close-event teardown was still event-driven, so if it failed once, stale preview environments remained indefinitely.

This PR closes both gaps. It adds an explicit recreate path for one PR and a janitor path for system-wide repair.

What We Verified

  • bash -n passes for the preview shell scripts changed in this PR
  • workflow YAML parses successfully for both preview workflows
  • git diff --check passes
  • the new Railway janitor script ran successfully against the live agents Railway project in dry-run mode before cleanup
  • the same Railway janitor script was then run live and deleted stale merged-PR environments that had been left behind

Confirmed Stale Railway Envs

Before cleanup, the Railway project still had these stale preview environments for merged PRs:

  • pr-2901
  • pr-2902
  • pr-2906
  • pr-2907
  • pr-2908
  • pr-2909

I confirmed each of those PRs is closed/merged in GitHub.

After the live cleanup, a follow-up janitor dry-run reported 19 Railway preview environments, all tied to open PRs, with 0 stale environments left to delete.

What Caused Those Misses

Those stale environments all came from a transition window before the earlier preview hardening had landed on main:

  • #2900 merged at 2026-03-30T21:31:52Z
  • #2921 merged at 2026-03-31T02:14:47Z

All six stale PRs above were merged before those fixes were on main. Their close-event Preview Environments workflow did fire, but the run failed in the old Teardown Tier 1 (Railway) path immediately after merge, which left the Railway env behind.

This PR accounts for that class of miss in two ways:

  • a hardier teardown path with better logs
  • a scheduled/manual janitor that repairs missed teardown events after the fact

Recent Preview Run Read

The newest completed failure I checked was branch-specific, not preview infra:

  • run 23801865110 for implementation/datasets
    • Provision Tier 1 (Railway) passed quickly
    • the failure was an outdated lockfile during bootstrap / Vercel deploy

So the recent run slate looks materially healthier than yesterday. The dominant failures are no longer the old Railway provisioning/control-plane failures this repo was hitting earlier.

Test Plan

  • bash -n .github/scripts/preview/*.sh for the changed preview scripts
  • parse .github/workflows/preview-environments.yml
  • parse .github/workflows/preview-janitor.yml
  • git diff --check
  • live dry-run of cleanup-stale-railway-envs.sh against the Railway project
  • live cleanup of confirmed stale merged-PR Railway preview environments
  • run the new Preview Janitor workflow in GitHub after merge
  • exercise workflow_dispatch with recreate_preview=true on a PR that needs forced reprovision

Notes

  • pnpm check is still failing in this workspace because @inkeep/agents-api test boot requires INKEEP_AGENTS_MANAGE_DATABASE_URL and INKEEP_AGENTS_RUN_DATABASE_URL to be set. That failure is repo/test-environment baseline and unrelated to the preview-shell/workflow changes in this PR.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 31, 2026 3:21pm
agents-docs Ready Ready Preview, Comment Mar 31, 2026 3:21pm
agents-manage-ui Ready Ready Preview, Comment Mar 31, 2026 3:21pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 31, 2026

⚠️ No Changeset found

Latest commit: 8973aad

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Mar 31, 2026

TL;DR — Adds a scheduled/manual "Preview Janitor" workflow that deletes stale Railway and Vercel preview environments left behind by missed close events, and adds a recreate_preview dispatch input so a single PR can force a full teardown-and-reprovision. Also hardens SpiceDB bootstrap with transient-error retries, adds railway_require_graphql_success checks on all mutation wrappers, and adds structured preview_log timestamps across all preview scripts for diagnosability.

Key changes

  • Add cleanup-stale-railway-envs.sh janitor script — enumerates all pr-* Railway environments, maps each to GitHub PR state, and deletes environments for closed/merged PRs with post-delete verification and per-deletion failure tracking.
  • Add preview-janitor.yml workflow — runs every 6 hours on a cron schedule (and manually) to invoke the Railway janitor and the existing Vercel stale-env cleanup.
  • Add recreate_preview workflow dispatch input — allows force-deleting and reprovisioning a PR's Railway environment when reused state has drifted.
  • Retry transient SpiceDB transport failures during bootstrap — wraps wait-for-spicedb and db:auth:init calls with run_with_transient_spicedb_retry, matching a curated pattern of transport error strings.
  • Add railway_require_graphql_success to mutation wrappersrailway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert now validate GraphQL responses and propagate failures instead of silently succeeding.
  • Add preview_log structured logging — introduces a shared preview_log helper with UTC timestamps and adds phase-level log lines across provisioning, bootstrap, teardown, and wait loops.
  • Harden Railway helpers for missing data — adds // [] fallbacks on .edges in jq filters, a new railway_wait_for_environment_absent wait loop, and drops unused env vars from teardown's require_env_vars.
  • Scope cleanup-vercel-stale-env.sh to repo — passes GITHUB_REPOSITORY to gh pr list when available and refreshes the header comment from one-time script to reusable janitor description.

Summary | 8 files | 3 commits | base: mainvarun/preview-state-hardening


Scheduled janitor for stale Railway preview environments

Before: If a PR close event's teardown failed, the Railway pr-* environment was left behind indefinitely — cleanup was purely event-driven with no fallback.
After: A new cleanup-stale-railway-envs.sh script runs on a 6-hour cron (and manually) via preview-janitor.yml, cross-referencing every pr-* Railway environment against GitHub PR state and deleting environments for closed/merged PRs.

The janitor supports dry-run mode (default for manual dispatch), an opt-in flag to delete orphaned environments whose PR returns 404, and writes a summary table to GITHUB_STEP_SUMMARY with counts of kept, deleted, unknown, and errored environments. Individual deletion failures are now tracked separately so a single failed delete does not abort the entire sweep.

How does the janitor decide what to delete?

It lists all Railway environments matching ^pr-[0-9]+$, extracts the PR number, queries the GitHub API for PR state, and categorizes each as open (keep), closed (delete), or missing (skip unless DELETE_UNKNOWN_PREVIEW_ENVS=true). Deletions are verified with railway_wait_for_environment_absent.

cleanup-stale-railway-envs.sh · preview-janitor.yml


Manual recreate_preview for drifted PR environments

Before: A PR with a drifted Railway environment required manual Railway console intervention to reprovision.
After: A recreate_preview boolean input on the workflow_dispatch trigger deletes the existing Railway environment and creates a fresh one from the template before continuing provisioning.

The flag flows from the workflow input through compute-context outputs to provision-railway.sh via RECREATE_PREVIEW_ENV.

provision-railway.sh · preview-environments.yml


Transient SpiceDB retry during preview bootstrap

Before: A single SpiceDB transport failure (e.g. UNAVAILABLE, ECONNRESET) during wait-for-spicedb or db:auth:init failed the entire preview run.
After: A new run_with_transient_spicedb_retry wrapper retries up to 2 attempts with exponential backoff, matching a curated regex of transport error strings while still failing fast on real migration or schema drift errors.

bootstrap-preview-auth.sh


GraphQL error checking on Railway mutation wrappers

Before: railway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert returned raw GraphQL responses without checking for errors — callers could silently succeed on a mutation that actually failed.
After: Each wrapper captures the response, runs it through railway_require_graphql_success, and returns a non-zero exit code on GraphQL-level errors.

common.sh


Structured preview_log and wait-loop observability

Before: Preview scripts used bare echo with no timestamps; wait loops were silent until failure.
After: A shared preview_log function emits [preview][UTC timestamp] lines to stderr. Wait loops log on attempts 1, every 5th attempt, and the final attempt via preview_should_log_wait_attempt. Phase transitions (resolving vars, seeding runtime vars, validating SpiceDB key) are logged throughout provisioning.

The resolve_runtime_vars loop now also reports which variables are still unresolved at each logged attempt.

common.sh · provision-railway.sh · teardown-railway.sh

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low urgency. Solid operational hardening — the janitor, recreate path, SpiceDB retry, and structured logging all address real gaps. Two minor items below; neither is blocking.

Pullfrog  | View workflow run | Using Claude Opus𝕏

Comment on lines +238 to +242
runtime_var_is_unresolved() {
local value="${1:-}"

[ -z "${value}" ] || printf '%s' "${value}" | grep -q '\$[{][{]'
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (readability): runtime_var_is_unresolved is defined here on line 238 but first referenced inside the resolve_runtime_vars function body starting at line 187. This works because bash evaluates function bodies lazily — resolve_runtime_vars isn't called until line 245 — but the ordering is surprising and fragile. Moving the definition above resolve_runtime_vars() (or into common.sh alongside the other helpers) would make the dependency explicit and prevent a future refactor from breaking call order.

Not blocking — the current code is correct at runtime.

RAILWAY_PROJECT_ID: ${{ vars.RAILWAY_PROJECT_ID }}
GH_TOKEN: ${{ github.token }}
GITHUB_REPOSITORY: ${{ github.repository }}
DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run && 'true' || 'false' }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm intent: On schedule triggers, inputs.dry_run is undefined so this expression evaluates to 'false', meaning scheduled runs execute real deletes. The PR description indicates this is intentional (the janitor is meant to repair missed teardowns automatically). Just flagging for visibility — if you ever want scheduled runs to default to dry-run, this expression would need adjustment.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the CI automation around ephemeral preview environments by adding a scheduled/manual “janitor” cleanup path and a manual “recreate preview” path, plus improved logging and retries to make long-running preview steps more diagnosable and resilient.

Changes:

  • Add a new scheduled/manual Preview Janitor workflow to delete stale Railway pr-* environments and stale Vercel branch-scoped env vars.
  • Add a manual recreate_preview input to force teardown + reprovision of a PR’s Railway preview environment.
  • Improve preview script resilience and diagnosability via structured phase logging, safer teardown, and transient SpiceDB transport retries.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/preview-janitor.yml New scheduled/manual janitor workflow for Railway + Vercel stale preview cleanup.
.github/workflows/preview-environments.yml Adds recreate_preview input plumbed into Railway provisioning.
.github/scripts/preview/common.sh Adds preview_log, wait-loop logging helpers, safer JSON edge handling, and env-deletion wait helper.
.github/scripts/preview/provision-railway.sh Adds “recreate existing env” path and more structured provisioning logs.
.github/scripts/preview/teardown-railway.sh Hardens teardown for missing/disappearing envs and uses shared deletion-wait helper.
.github/scripts/preview/bootstrap-preview-auth.sh Retries transient SpiceDB transport failures and adds more phase logging.
.github/scripts/preview/cleanup-vercel-stale-env.sh Ensures PR lookup is repo-scoped when GITHUB_REPOSITORY is set.
.github/scripts/preview/cleanup-stale-railway-envs.sh New Railway janitor script to enumerate pr-* envs, map to PR state, and delete stale envs with verification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +32
status="$(
curl -sS \
--connect-timeout 10 \
--max-time 30 \
-o "${response_file}" \
-w '%{http_code}' \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/${GITHUB_REPOSITORY}/pulls/${pr_number}"
)"
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub PR lookup is hard-coded to https://api.github.com/.... GitHub Actions exposes GITHUB_API_URL (and this also matters for GHES); consider building the endpoint from ${GITHUB_API_URL} (fallbacking to api.github.com) to avoid breaking the janitor in non-public-GitHub contexts.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(3) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

Inline Comments:

  • 🟠 Major: cleanup-stale-railway-envs.sh:59 Deletion errors silently ignored — return status not checked
  • 🟠 Major: cleanup-stale-railway-envs.sh:100-105 Deletion failures not tracked in exit code — janitor may exit 0 despite failed cleanups

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: preview-environments.yml:389-390 Unused env vars passed to teardown job

💭 Consider (2) 💭

💭 1) provision-railway.sh:238 Function definition ordering

Issue: runtime_var_is_unresolved() is defined after resolve_runtime_vars() which references it.
Why: While this works correctly in bash (functions are resolved at call time, not definition time), the code structure is confusing. If someone reorders the script, it could break.
Fix: Move runtime_var_is_unresolved() definition to before line 167 for clearer declaration-before-use ordering.

💭 2) preview-janitor.yml:56 Scheduled runs delete without dry-run

Issue: The DRY_RUN expression evaluates to false for scheduled runs, meaning the cron job will perform actual deletions.
Why: This differs from the manual dispatch default (dry_run: true). While this appears intentional for automated cleanup, it could surprise operators.
Fix: Add a comment documenting this intentional behavior, or change to default dry-run for scheduled runs if that's preferred.


💡 APPROVE WITH SUGGESTIONS

Summary: This is a solid infrastructure hardening PR that closes real gaps in preview environment lifecycle management. The janitor workflow and recreate path are well-designed. The two Major findings around error tracking in the janitor script should be addressed to ensure operators can trust the exit code — without this, failed cleanups could silently accumulate. The other items are minor polish.

Discarded (3)
Location Issue Reason Discarded
provision-railway.sh:238 Function runtime_var_is_unresolved defined after use causes runtime error False positive — bash resolves function references at call time, not definition time. The function IS defined before resolve_runtime_vars is called. Reclassified as Consider (code style).
common.sh:27 String equality instead of numeric comparison Works correctly for numeric strings; purely stylistic preference
bootstrap-preview-auth.sh:42 Retry pattern may miss some transient errors Low risk — current pattern covers common failure modes; adding more patterns is speculative
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-errors 5 0 0 0 2 0 3
pr-review-devops 8 0 2 0 1 0 5
pr-review-standards 2 0 0 0 0 0 2
Total 15 0 2 0 3 0 10

Note: Standards reviewer findings were duplicates of devops findings (function order, dry-run behavior).

preview_log "Deleting stale Railway preview environment ${env_name}."
railway_environment_delete_by_id "${env_id}" >/dev/null
railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Deletion errors are silently ignored

Issue: railway_environment_delete_by_id return status is not checked. If deletion fails, the function proceeds to railway_wait_for_environment_absent which will eventually fail with a generic timeout message.

Why: When debugging deletion failures, operators see "Railway environment still exists after waiting for deletion" but won't know why (permission denied, API error, rate limit). The actual Railway API error is discarded.

Fix: Check the return status before proceeding:

delete_env_and_verify() {
  local env_id="$1"
  local env_name="$2"

  preview_log "Deleting stale Railway preview environment ${env_name}."
  if ! railway_environment_delete_by_id "${env_id}"; then
    echo "Failed to delete Railway environment ${env_name}." >&2
    return 1
  fi
  railway_wait_for_environment_absent "${RAILWAY_PROJECT_ID}" "${env_name}" 10 2
}

Refs:

Comment on lines +100 to +105
preview_log "[dry-run] Would delete stale Railway preview environment ${env_name}."
else
delete_env_and_verify "${env_id}" "${env_name}"
deleted=$((deleted + 1))
deleted_names+=("${env_name}")
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Deletion failures not tracked in exit code

Issue: When delete_env_and_verify fails during the janitor loop, the error is not counted. The final exit code (line 153) only reflects GitHub API lookup errors (errors variable), not Railway deletion failures.

Why: The janitor could fail to delete multiple Railway environments due to API issues, but exit with code 0 if all GitHub PR lookups succeeded. Operators checking CI may incorrectly believe all targeted environments were cleaned up.

Fix: Add a deletion_failures counter:

# Add at line 77 with other counters:
deletion_failures=0

# In the delete blocks (lines 100-105 and 113-118):
if ! delete_env_and_verify "${env_id}" "${env_name}"; then
  deletion_failures=$((deletion_failures + 1))
else
  deleted=$((deleted + 1))
  deleted_names+=("${env_name}")
fi

# Update exit check at line 153:
if [ "${errors}" -gt 0 ] || [ "${deletion_failures}" -gt 0 ]; then
  echo "Encountered ${errors} GitHub lookup error(s) and ${deletion_failures} Railway deletion failure(s)." >&2
  exit 1
fi

@github-actions github-actions bot deleted a comment from claude bot Mar 31, 2026
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

✅ Prior Feedback Addressed

All issues from the previous review have been resolved in commit 8973aad63:

Prior Issue Resolution
🟠 Major: Deletion errors silently ignored ✅ Fixed — delete_env_and_verify now checks return status and returns 1 on failure (L58-61)
🟠 Major: Deletion failures not tracked in exit code ✅ Fixed — Added deletion_failures counter, tracked in loop, checked at exit (L82, L107-112, L163-165)
🟡 Minor: Unused env vars passed to teardown ✅ Fixed — RAILWAY_TEMPLATE_ENVIRONMENT and RAILWAY_OUTPUT_SERVICE removed from teardown job
💭 Consider: Function definition ordering ✅ Fixed — runtime_var_is_unresolved() moved to L121-125, before resolve_runtime_vars() which uses it
💭 Consider: Scheduled runs delete without dry-run (undocumented) ✅ Fixed — Comment added at L57 documenting intentional behavior
🔵 Copilot: Hard-coded GitHub API URL ✅ Fixed — GITHUB_API_URL now configurable with fallback (L16, L32) and passed from workflow (L55)

Additional Improvements in Delta

The delta also includes GraphQL error handling improvements in common.sh:

  • railway_environment_create_from_source, railway_environment_delete_by_id, and railway_variable_collection_upsert now properly check for GraphQL errors and return failure status
  • This hardens the entire Railway mutation surface against silent GraphQL-level failures

✅ APPROVE

Summary: All prior review feedback has been addressed comprehensively. The janitor now properly tracks and reports deletion failures, function ordering is cleaner, and the behavior of scheduled vs manual runs is documented. The additional GraphQL error handling improvements further strengthen the reliability of the preview infrastructure. Ship it! 🚀

Reviewers (0)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
Total 0 0 0 0 0 0 0

Note: Delta review — no new reviewer dispatches needed. All prior issues resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants