ci: seed preview auth in PR previews#2775
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Ito Test Report ❌14 test cases ran. 12 passed, 2 failed. ✅ The authenticated login, tenant authorization, resilience, and adversarial-input flows included in this report mostly behaved as expected. 🔍 Code-first verification confirms two real auth-gating defects on protected UI routes: deep links and tampered session cookies can still reach protected pages because route protection checks for cookie presence rather than validated session state. ✅ Passed (12)
❌ Failed (2)
Direct deep-link entry to protected screen without session redirects safely – Failed
Tampered or stale session cookie is rejected – Failed
📋 View Recording |
|
@claude --review |
There was a problem hiding this comment.
Solid PR. The refactoring centralizes Railway helpers into common.sh, adds a proper auth bootstrap job, and extends smoke coverage to exercise the real Better Auth sign-in path. Secret handling is correct — BETTER_AUTH_SECRET and SPICEDB_PRESHARED_KEY are now upserted into Vercel, masked in CI logs, and redacted in diagnostic output (including the new Set-Cookie: better-auth.* patterns). The ensure_runtime_var_seeded reorder to prioritize explicit template overrides is a deliberate semantic change and the right call. No actionable issues found.
|
TL;DR — Adds deterministic admin-user seeding to PR preview environments so Better Auth login works before smoke tests run. The preview pipeline now provisions TCP proxies for databases/SpiceDB, validates auth secrets match between Railway and GitHub, bootstraps the admin user via Key changes
Summary | 7 files | 8 commits | base:
|
| Helper | Purpose |
|---|---|
sleep_with_jitter() |
Sleeps base × (0.5 + random()) to decorrelate retries |
railway_link_service() |
Links Railway CLI to a project/service/environment |
railway_extract_runtime_var() |
Polls for a var with retry, distinguishing missing vs. unresolved (${{}} still interpolating) |
railway_graphql() |
Authenticated GraphQL client with connect/max timeouts |
railway_ensure_tcp_proxy() |
Creates TCP proxy via mutation, polls until ACTIVE (up to 30 attempts with jitter) |
railway_environment_id() |
Resolves environment name → ID via GraphQL |
railway_service_id_for_env() |
Resolves service name → ID within an environment |
redact_preview_logs() |
Strips postgres URLs, secrets, Better Auth cookies, and Bearer tokens |
How does
railway_extract_runtime_var()handle interpolation delays?Railway variables may contain
${{...}}references that take time to resolve. The function polls up to 20 times (with jittered 2s base sleep), checking each iteration whether the value is empty (missing) or still contains${{(unresolved). It returns distinct error messages for each case so callers know whether the variable doesn't exist or just hasn't finished interpolating.
common.sh · upsert-vercel-preview-env.sh
provision-railway.sh — TCP proxies, runtime-var seeding, and SpiceDB key validation
Before: Databases and SpiceDB were only reachable via Railway-internal URLs; SpiceDB key mismatches caused silent auth failures.
After: TCP proxies are provisioned for all three services with active-status polling, runtime vars are seeded from template with explicit override support, and the SpiceDB preshared key is validated against the GitHub secret before proceeding.
Environment creation uses an idempotent retry pattern — if railway environment new fails, the script re-checks existence before aborting, handling race conditions gracefully.
How does TCP proxy creation work?
railway_ensure_tcp_proxy()queries Railway's GraphQL API for existing proxies on the target service+port. If none exist, it creates one via thetcpProxyCreatemutation, then polls up to 30 times (2s base with jitter) until the proxy status reportsACTIVE.
provision-railway.sh · common.sh
smoke-preview.sh — end-to-end auth smoke test with UI-aware health checks
Before: Smoke tests only verified URL health checks (HTTP 200 on API health + UI endpoints).
After: A dedicatedwait_for_ui_urlfunction accepts redirect and auth-challenge status codes (301/302/307/308/401/403) as valid "alive" signals. A newrun_preview_auth_smokefunction signs in viaPOST /api/auth/sign-in/email, validates thebetter-auth.*session cookie, and uses it to callGET /manage/tenants/{tenant_id}/projects— confirming the full auth flow works.
All error output is piped through redact_preview_logs to prevent secrets from leaking into CI logs on failure.
smoke-preview.sh · capture-preview-failure-diagnostics.sh
preview-environments.yml — workflow orchestration with deferred-failure diagnostics
Before: The workflow had no auth bootstrap step and smoke failures produced minimal diagnostic output.
After: Abootstrap-preview-authjob runs between provisioning and Vercel configuration. The smoke job usescontinue-on-errorso that on failure,capture-preview-failure-diagnostics.shandfetch-vercel-runtime-logs.shalways execute before a final step fails the job.
This deferred-failure pattern ensures rich post-mortem data is always captured — including authenticated sign-in attempts and Vercel runtime logs — regardless of which smoke check failed.
preview-environments.yml · smoke-preview.sh
There was a problem hiding this comment.
PR Review Summary
(4) Total Issues | Risk: Low
🟠⚠️ Major (1) 🟠⚠️
🟠 1) common.sh · provision-railway.sh Retry loops without jitter risk thundering herd
files:
.github/scripts/preview/common.sh:71-88(railway_extract_runtime_var).github/scripts/preview/common.sh:230-251(railway_ensure_tcp_proxy polling).github/scripts/preview/provision-railway.sh:159-173(extract_runtime_var)
Issue: All retry loops use fixed sleep intervals (2 seconds) without any randomized jitter. This pattern appears in three locations across two files.
Why: When multiple concurrent PR preview environments experience Railway variable resolution delays or TCP proxy provisioning slowdowns, they'll all retry at synchronized intervals. This creates a thundering herd pattern that can overwhelm Railway's API and cause cascading timeouts. The risk is amplified when Railway experiences any degradation, as all pending PRs will hammer the API in lockstep.
Fix: Add randomized jitter to sleep intervals. The AWS Builders Library recommends jitter for all retry loops:
# Add jitter: base_sleep * (0.5 to 1.5)
jittered_sleep=$(awk "BEGIN {srand(); print ${sleep_seconds} * (0.5 + rand())}")
sleep "${jittered_sleep}"Consider extracting a shared retry_with_jitter helper function in common.sh to ensure consistency across all retry loops.
Refs:
Inline Comments:
- 🟠 Major:
common.sh:85-88Retry without jitter causes thundering herd
🟡 Minor (3) 🟡
🟡 1) smoke-preview.sh:86-87 · smoke-preview.sh:107-108 Error output not redacted
Issue: On sign-in and manage auth check failures, raw response bodies are output without passing through redact_preview_logs.
Why: Auth error responses may contain diagnostic data that shouldn't appear in CI logs. The redact_preview_logs function is used elsewhere (diagnostics script, header output) but not consistently here.
Fix: Pipe through redaction: cat "${sign_in_body}" | redact_preview_logs >&2
Refs: redact_preview_logs
🟡 2) common.sh:216-227 TCP proxy mutation discards error response
Issue: The tcpProxyCreate mutation response is redirected to /dev/null, discarding any error information.
Why: If the mutation fails (quota exceeded, invalid parameters), error details are lost. The polling loop will timeout after ~60s without context.
Fix: Capture the mutation response and check for GraphQL errors before entering the polling loop.
Refs: common.sh:216-227
🟡 3) provision-railway.sh:195 Masking order allows brief log exposure window
Issue: The mask_env_vars call occurs after variables are extracted. If extraction fails mid-way with verbose logging, values could appear before masking takes effect.
Why: GitHub Actions ::add-mask:: only redacts values from log output produced after the mask command runs.
Fix: Mask immediately after each variable assignment, or ensure error paths don't log the raw values.
Refs: GitHub Actions: Masking a value in log
Inline Comments:
- 🟡 Minor:
smoke-preview.sh:85-87Error output should be redacted - 🟡 Minor:
smoke-preview.sh:106-108Error output should be redacted - 🟡 Minor:
common.sh:216-227TCP proxy creation mutation discards error response
💭 Consider (2) 💭
💭 1) bootstrap-preview-auth.sh:48-54 Add explicit timeouts to migration/auth init
Issue: pnpm db:run:migrate and pnpm db:auth:init run without explicit timeouts.
Why: If a database connection hangs, the job waits until the 20-minute job timeout. Explicit command timeouts provide faster feedback.
Fix: Wrap with timeout 300 pnpm db:run:migrate and timeout 180 pnpm db:auth:init.
Refs: bootstrap-preview-auth.sh:48-54
💭 2) common.sh:107-121 Railway GraphQL API lacks retry on transient failures
Issue: railway_graphql makes a single HTTP request without retry logic. Transient 5xx or network errors cause immediate failure.
Why: Railway's GraphQL API may experience momentary availability issues, causing unnecessary preview provisioning failures.
Fix: Add 2-3 retry attempts for 5xx and connection errors with exponential backoff.
Inline Comments:
- 💭 Consider:
bootstrap-preview-auth.sh:48-54Add explicit timeouts to migration and auth init commands
💡 APPROVE WITH SUGGESTIONS
Summary: This PR achieves its goals of making preview login deterministic and removing insecure preview auth defaults. The implementation is well-structured with good secret masking, SpiceDB key validation, and comprehensive smoke testing. The main area for improvement is adding jitter to retry loops to prevent thundering herd issues when multiple PRs are being provisioned concurrently. The minor items around error output redaction are straightforward fixes that improve consistency.
Discarded (4)
| Location | Issue | Reason Discarded |
|---|---|---|
workflow:185,229,269,373 |
Railway CLI installed via npm without integrity verification | Low risk — version is pinned, and Railway doesn't provide official checksums. Standard practice for CLI tools in CI. |
workflow:212 |
bootstrap-preview-auth job has no outputs declaration | Inconsistent but not blocking — the job doesn't need to pass data downstream currently. |
bootstrap-preview-auth.sh |
No cleanup on partial failure during auth bootstrap | Migrations are generally idempotent and auth init is idempotent. Re-running the job should recover. |
smoke-preview.sh:16 |
Health check retry without jitter | Lower priority — polls the preview environment itself, not a shared external API. Less likely to cause issues. |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
7 | 3 | 0 | 0 | 3 | 0 | 4 |
pr-review-sre |
11 | 1 | 2 | 0 | 2 | 0 | 0 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 18 | 4 | 2 | 0 | 5 | 0 | 4 |
Note: pr-review-standards returned no findings — shell scripts passed correctness checks. Duplicates between reviewers were merged (SRE and DevOps both flagged the redaction issues).
| if [ "${attempt}" -lt "${max_attempts}" ]; then | ||
| sleep "${sleep_seconds}" | ||
| fi | ||
| done |
There was a problem hiding this comment.
🟠 MAJOR: Retry without jitter causes thundering herd
Issue: The retry loops in railway_extract_runtime_var and railway_ensure_tcp_proxy use fixed sleep intervals without jitter.
Why: When multiple concurrent PRs experience Railway variable resolution delays, they'll all retry at synchronized intervals, creating a thundering herd pattern that can overwhelm Railway's API and cause cascading timeouts.
Fix: Add randomized jitter to the sleep duration:
if [ "${attempt}" -lt "${max_attempts}" ]; then
# Add jitter: sleep_seconds * (0.5 to 1.5)
jittered_sleep=$(awk "BEGIN {srand(); print ${sleep_seconds} * (0.5 + rand())}")
sleep "${jittered_sleep}"
fiRefs:
| echo "Preview sign-in failed with status ${sign_in_status}" >&2 | ||
| cat "${sign_in_body}" >&2 | ||
| return 1 |
There was a problem hiding this comment.
🟡 Minor: Error output should be redacted
Issue: On sign-in failure, the raw response body is output without redaction. Auth error responses may contain diagnostic data that shouldn't appear in CI logs.
Why: The redact_preview_logs function is used elsewhere (e.g., diagnostics script, line 92 for headers) but not here for the body.
Fix:
| echo "Preview sign-in failed with status ${sign_in_status}" >&2 | |
| cat "${sign_in_body}" >&2 | |
| return 1 | |
| if [ "${sign_in_status}" != "200" ]; then | |
| echo "Preview sign-in failed with status ${sign_in_status}" >&2 | |
| cat "${sign_in_body}" | redact_preview_logs >&2 |
Refs:
| echo "Preview manage auth check failed with status ${manage_status}" >&2 | ||
| cat "${manage_body}" >&2 | ||
| return 1 |
There was a problem hiding this comment.
🟡 Minor: Error output should be redacted
Issue: On manage auth check failure, the raw response body is output without redaction.
Why: Consistent redaction prevents accidental exposure of auth-related diagnostics in CI logs.
Fix:
| echo "Preview manage auth check failed with status ${manage_status}" >&2 | |
| cat "${manage_body}" >&2 | |
| return 1 | |
| if [ "${manage_status}" != "200" ]; then | |
| echo "Preview manage auth check failed with status ${manage_status}" >&2 | |
| cat "${manage_body}" | redact_preview_logs >&2 |
.github/scripts/preview/common.sh
Outdated
| railway_graphql "$(cat <<EOF | ||
| mutation { | ||
| tcpProxyCreate(input: { | ||
| environmentId: "${env_id}" | ||
| serviceId: "${service_id}" | ||
| applicationPort: ${application_port} | ||
| }) { | ||
| id | ||
| } | ||
| } | ||
| EOF | ||
| )" >/dev/null |
There was a problem hiding this comment.
🟡 Minor: TCP proxy creation mutation discards error response
Issue: The tcpProxyCreate mutation response is redirected to /dev/null, discarding any error information.
Why: If the mutation fails (e.g., quota exceeded, invalid parameters), the error details are lost. The subsequent polling loop will timeout after ~60s, but without the creation error context, debugging becomes difficult.
Fix: Capture the mutation response and check for GraphQL errors:
if [ "${count}" = "0" ]; then
mutation_response="$(railway_graphql "$(cat <<EOF
mutation {
tcpProxyCreate(input: {
environmentId: "${env_id}"
serviceId: "${service_id}"
applicationPort: ${application_port}
}) {
id
}
}
EOF
)")"
if echo "${mutation_response}" | jq -e '.errors' >/dev/null 2>&1; then
echo "Failed to create TCP proxy: $(echo "${mutation_response}" | jq -r '.errors[0].message')" >&2
return 1
fi
fi| echo "::group::Run preview runtime migrations" | ||
| pnpm db:run:migrate | ||
| echo "::endgroup::" | ||
|
|
||
| echo "::group::Initialize preview auth" | ||
| pnpm db:auth:init | ||
| echo "::endgroup::" |
There was a problem hiding this comment.
💭 Consider: Add explicit timeouts to migration and auth init commands
Issue: The pnpm db:run:migrate and pnpm db:auth:init commands run without explicit timeouts. If a database connection hangs, the job will wait until the 20-minute job timeout.
Why: Database migrations during preview bootstrapping may encounter transient connection issues to Railway-hosted Postgres. Explicit timeouts provide faster feedback and clearer failure attribution.
Fix: Consider wrapping with explicit timeouts:
echo "::group::Run preview runtime migrations"
timeout 300 pnpm db:run:migrate
echo "::endgroup::"
echo "::group::Initialize preview auth"
timeout 180 pnpm db:auth:init
echo "::endgroup::"This is optional — the job timeout is a backstop, but explicit command timeouts give faster feedback.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
Re-review Summary
This is a delta review following up on the prior automated review. The new commit cca2bba9c ci: harden preview retry and error logging addresses all actionable findings from the previous review:
| Prior Finding | Status |
|---|---|
| 🟠 Major: Retry loops without jitter risk thundering herd | ✅ Fixed — Added sleep_with_jitter() helper using randomized multiplier (0.5–1.5×), applied to all three retry loops |
| 🟡 Minor: Error output not redacted in smoke-preview.sh | ✅ Fixed — Both error paths now pipe through redact_preview_logs |
| 🟡 Minor: TCP proxy mutation discards error response | ✅ Fixed — Response is now captured and checked for GraphQL errors before polling |
Delta Changes Analysis
The commit introduces a clean sleep_with_jitter() function that uses Python's random module to generate jitter in the range [0.5 × base, 1.5 × base):
sleep_with_jitter() {
local sleep_seconds="$1"
jittered_sleep="$(python3 - <<PY
import random
base = float(${sleep_seconds})
print(base * (0.5 + random.random()))
PY
)"
sleep "${jittered_sleep}"
}This is correctly applied to:
railway_extract_runtime_var(common.sh:101)railway_ensure_tcp_proxypolling loop (common.sh:271)extract_runtime_varin provision-railway.sh (line 170)
Remaining Optional Improvements
The following items from the prior review remain unaddressed but are not blocking:
| Item | Reason Not Blocking |
|---|---|
| 💭 Add explicit timeouts to migration/auth init | Job timeout (20 min) provides backstop; explicit timeouts are nice-to-have |
| 💭 Railway GraphQL API lacks retry on transient failures | Single failures will fail the job which can be re-run; retry complexity not justified |
| 🟡 Masking order allows brief log exposure window | Low practical risk — values are masked before use; error paths don't log raw values |
✅ APPROVE
Summary: All actionable feedback from the prior review has been addressed. The jitter implementation follows AWS best practices for retry loops, error output is now properly redacted, and TCP proxy creation errors are captured. This PR is ready to merge.
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator-delta |
3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review — all prior findings resolved by commit cca2bba9c.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
Re-review Summary
This is a delta review following up on the prior APPROVE.
Changes since last review:
| Commit | Description |
|---|---|
0b1d393ad |
Merge branch 'main' into varun/preview-auth-bootstrap |
The only change is a clean merge from main that incorporates 7 commits:
d83ba0bb4— Update pullfrog + dependabot config65c151d20— Adding app ID to runAuth middlewareb396a8878— Fix scheduled trigger invocations being skippedfddbd38ac— Make OpenTelemetry startup idempotentc71e1eabe— Version Packagesb1e6ced4f— SSO configuration and auth method management
Delta Analysis
| Check | Status |
|---|---|
| Preview auth files unchanged | ✅ No changes to .github/scripts/preview/ or preview-environments.yml |
| Merge conflicts | ✅ Clean merge, no conflicts |
| Integration impact | ✅ Merged commits do not touch preview environment functionality |
The PR's core implementation (preview auth bootstrap, smoke tests, Railway TCP proxies, jitter retry loops, secret redaction) remains exactly as approved in the previous review.
Prior Feedback Status
All findings from the initial review were addressed in commit cca2bba9c:
| Prior Finding | Status |
|---|---|
| 🟠 Retry loops without jitter | ✅ Fixed — sleep_with_jitter() applied to all retry loops |
| 🟡 Error output not redacted | ✅ Fixed — Both error paths pipe through redact_preview_logs |
| 🟡 TCP proxy mutation discards error | ✅ Fixed — Response captured and checked for GraphQL errors |
✅ APPROVE
Summary: The merge from main is clean and does not affect the preview auth implementation. All actionable feedback from prior reviews was addressed. This PR is ready to merge.
Note: Unable to submit formal approval due to GitHub App permissions — this is a COMMENT review with approval recommendation.
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator-delta |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: Delta review of merge commit 0b1d393ad — no PR-specific changes since last approval.
Ito Test Report ✅20 test cases ran. 20 passed. Across 7 report shards, all 20 of 20 test cases passed with no failures, confirming preview readiness (UI reachable and /health correctly returning 204 No Content), successful seeded-admin login/session behavior, and proper protected-route access when authenticated. The most important findings are that auth security and resilience controls held under adversarial and edge conditions—unauthenticated, forged, malformed, injection, origin-spoofed, tenant-tampered, burst, open-redirect, cross-context, mobile, double-submit, and refresh/history scenarios all failed closed or recovered safely without secret/session leakage or instability. ✅ Passed (20)
Commit: Tell us how we did: Give Ito Feedback |
* ci: bootstrap preview auth * ci: require secure preview auth config * ci: recover preview auth runtime vars * ci: install railway in preview bootstrap * ci: provision preview db tcp proxies * ci: proxy preview spicedb bootstrap * ci: harden preview retry and error logging --------- Co-authored-by: Andrew Mikofalvy <5668128+amikofalvy@users.noreply.github.com>













































Summary
Seed a deterministic preview admin user in each PR environment, remove insecure preview auth defaults, and extend preview smoke coverage to verify the real Better Auth email login path against the preview API.
Changes
Bootstrap Preview Authjob to the preview workflow after Railway provisioningpnpm db:run:migrateandpnpm db:auth:initagainst the PR runtime database before smoke checks/api/auth/sign-in/email, assert a Better Auth session cookie is returned, and verify an authenticated manage route succeedsBETTER_AUTH_SECRETandSPICEDB_PRESHARED_KEYinto the Vercel API preview envspicedbservice key matches the GitHub preview SpiceDB secretWhy
The earlier preview work proved infra boot and backend wiring, but it did not guarantee a seeded preview user. That made human login unreliable even when preview CI was green. It also relied on CI-style fallback auth secrets that are not appropriate for public preview backing services.
This change makes preview login deterministic for reviewers while failing closed if the preview auth configuration is missing or mismatched.
Current Status
69aa37376eb6530e6e85803417875442ad580f8fTest Plan
pnpm formatbash -n .github/scripts/preview/common.sh .github/scripts/preview/bootstrap-preview-auth.sh .github/scripts/preview/provision-railway.sh .github/scripts/preview/upsert-vercel-preview-env.sh .github/scripts/preview/smoke-preview.sh .github/scripts/preview/capture-preview-failure-diagnostics.sh.github/workflows/preview-environments.ymlgit diff --check69aa37376passes with the seeded preview login path enabled end to endNotes
The repo-level preview auth config expected by this PR is:
PREVIEW_MANAGE_UI_USERNAMEPREVIEW_MANAGE_UI_PASSWORDPREVIEW_BETTER_AUTH_SECRETPREVIEW_SPICEDB_PRESHARED_KEYThe preview UI smoke remains at the reachability layer because Vercel preview auth still blocks unauthenticated headless UI access from CI. The authenticated smoke runs against the preview API layer instead.