Skip to content

ci: seed preview auth in PR previews#2775

Merged
amikofalvy merged 8 commits intomainfrom
varun/preview-auth-bootstrap
Mar 20, 2026
Merged

ci: seed preview auth in PR previews#2775
amikofalvy merged 8 commits intomainfrom
varun/preview-auth-bootstrap

Conversation

@vnv-varun
Copy link
Contributor

@vnv-varun vnv-varun commented Mar 19, 2026

Summary

Seed a deterministic preview admin user in each PR environment, remove insecure preview auth defaults, and extend preview smoke coverage to verify the real Better Auth email login path against the preview API.

Changes

  • add a Bootstrap Preview Auth job to the preview workflow after Railway provisioning
  • run pnpm db:run:migrate and pnpm db:auth:init against the PR runtime database before smoke checks
  • extend preview smoke to sign in via /api/auth/sign-in/email, assert a Better Auth session cookie is returned, and verify an authenticated manage route succeeds
  • remove hardcoded preview auth defaults and require explicit preview auth config from GitHub secrets/vars
  • upsert BETTER_AUTH_SECRET and SPICEDB_PRESHARED_KEY into the Vercel API preview env
  • validate that the Railway spicedb service key matches the GitHub preview SpiceDB secret

Why

The earlier preview work proved infra boot and backend wiring, but it did not guarantee a seeded preview user. That made human login unreliable even when preview CI was green. It also relied on CI-style fallback auth secrets that are not appropriate for public preview backing services.

This change makes preview login deterministic for reviewers while failing closed if the preview auth configuration is missing or mismatched.

Current Status

  • remote PR head is 69aa37376eb6530e6e85803417875442ad580f8f
  • the secure preview auth config is now present in GitHub and Railway
  • a fresh workflow run was triggered automatically from that head update
  • current validation target is: Railway provisioning passes, preview auth bootstrap passes, and authenticated preview smoke passes in a real PR environment

Test Plan

  • pnpm format
  • bash -n .github/scripts/preview/common.sh .github/scripts/preview/bootstrap-preview-auth.sh .github/scripts/preview/provision-railway.sh .github/scripts/preview/upsert-vercel-preview-env.sh .github/scripts/preview/smoke-preview.sh .github/scripts/preview/capture-preview-failure-diagnostics.sh
  • YAML parse of .github/workflows/preview-environments.yml
  • git diff --check
  • preview workflow rerun on 69aa37376 passes with the seeded preview login path enabled end to end

Notes

The repo-level preview auth config expected by this PR is:

  • repo var PREVIEW_MANAGE_UI_USERNAME
  • repo secret PREVIEW_MANAGE_UI_PASSWORD
  • repo secret PREVIEW_BETTER_AUTH_SECRET
  • repo secret PREVIEW_SPICEDB_PRESHARED_KEY

The preview UI smoke remains at the reachability layer because Vercel preview auth still blocks unauthenticated headless UI access from CI. The authenticated smoke runs against the preview API layer instead.

@changeset-bot
Copy link

changeset-bot bot commented Mar 19, 2026

⚠️ No Changeset found

Latest commit: 0b1d393

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Mar 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 20, 2026 0:20am
agents-docs Ready Ready Preview, Comment Mar 20, 2026 0:20am
agents-manage-ui Ready Ready Preview, Comment Mar 20, 2026 0:20am

Request Review

@itoqa
Copy link

itoqa bot commented Mar 19, 2026

Ito Test Report ✅

10 test cases ran. 10 passed.

✅ The executed UI authentication and resilience scenarios completed successfully in this run. Verified results indicate stable login, redirect, returnUrl safety, session handling under refresh/double-submit, and consistent behavior across mobile and parallel-tab flows.

✅ Passed (10)
Test Case Summary Timestamp Screenshot
ROUTE-5 Login form submitted successfully and redirected to /default/projects. 8:31 ROUTE-5_8-31.png
EDGE-1 Unauthenticated deep-link redirected to login and preserved returnUrl. 8:31 EDGE-1_8-31.png
EDGE-2 Internal returnUrl redirected correctly to /default/projects after authentication. 8:31 EDGE-2_8-31.png
EDGE-3 Post-login navigation stayed on trusted local origin and did not redirect externally. 8:31 EDGE-3_8-31.png
EDGE-4 Page recovered after refresh interruption and allowed successful login completion. 8:31 EDGE-4_8-31.png
EDGE-5 Back and forward transitions stayed in authenticated routes without oscillation. 8:31 EDGE-5_8-31.png
EDGE-6 At 390x844 viewport, email/password/sign-in controls were usable and login succeeded. 8:31 EDGE-6_8-31.png
EDGE-7 Rapid double-submit resulted in one stable authenticated session without corruption. 8:31 EDGE-7_8-31.png
EDGE-8 Concurrent sign-in in two tabs produced authenticated projects access in both tabs. 8:31 EDGE-8_8-31.png
ADV-3 A pre-seeded forged better-auth cookie was replaced/ignored after valid sign-in. 8:31 ADV-3_8-31.png
📋 View Recording

Screen Recording

@itoqa
Copy link

itoqa bot commented Mar 19, 2026

Ito Test Report ❌

14 test cases ran. 12 passed, 2 failed.

✅ The authenticated login, tenant authorization, resilience, and adversarial-input flows included in this report mostly behaved as expected. 🔍 Code-first verification confirms two real auth-gating defects on protected UI routes: deep links and tampered session cookies can still reach protected pages because route protection checks for cookie presence rather than validated session state.

✅ Passed (12)
Test Case Summary Timestamp Screenshot
ROUTE-3 Valid sign-in issued Better Auth cookies and authenticated tenant projects request returned 200. 2:20 ROUTE-3_2-20.png
ROUTE-4 Login form accepted valid credentials and redirected to tenant projects with persistence after reload. 2:20 ROUTE-4_2-20.png
ROUTE-5 Unauthenticated protected tenant projects endpoint correctly returned 401 with no project data exposure. 4:55 ROUTE-5_4-55.png
LOGIC-1 Three repeated login-clear-relogin cycles completed successfully and remained authenticated after reload. 10:15 LOGIC-1_10-15.png
LOGIC-2 Authenticated UI context could call API and second tab stayed authenticated. 2:21 LOGIC-2_2-21.png
LOGIC-3 Invalid password returned 401, no session cookie was set, and protected endpoint stayed unauthorized. 4:55 LOGIC-3_4-55.png
EDGE-1 Authorized tenant succeeded while mismatched tenant request was forbidden without data leakage. 4:55 EDGE-1_4-55.png
EDGE-2 Reload plus back/forward interruption stabilized on authenticated projects page without persistent loop. 10:15 EDGE-2_10-15.png
EDGE-4 Mobile viewport login and projects flow remained usable with no horizontal overflow. 10:15 EDGE-4_10-15.png
EDGE-5 Rapid repeated submits converged to a stable authenticated state and stayed on projects after reload. 10:15 EDGE-5_10-15.png
ADV-1 Malicious login payloads did not execute scripts or bypass authentication. 12:25 ADV-1_12-25.png
ADV-3 Path manipulation attempts were denied while baseline tenant access still worked. 4:55 ADV-3_4-55.png
❌ Failed (2)
Test Case Summary Timestamp Screenshot
ADV-4 Deep-link to /default/projects rendered protected UI instead of redirecting to /login after unauthenticated state. 4:55 ADV-4_4-55.png
EDGE-6 Tampered cookie was rejected by protected API (401), but protected UI route still rendered instead of requiring re-authentication. 10:15 EDGE-6_10-15.png
Direct deep-link entry to protected screen without session redirects safely – Failed
  • Where: Manage UI protected routing at /{tenantId}/projects (middleware/proxy auth gate)

  • Steps to reproduce: Clear/lose valid session, then open a protected route like /default/projects directly.

  • What failed: Expected redirect to /login, but protected route is allowed when a session cookie name is present even if the session is not validated.

  • Code analysis: I reviewed the UI request gate in agents-manage-ui/src/proxy.ts, session-cookie matching in packages/agents-core/src/auth/cookie-names.ts, and API session validation in agents-api/src/middleware/manageAuth.ts. The proxy permits access based only on cookie-name presence, while API routes validate the session and reject invalid tokens.

  • Relevant code:

    agents-manage-ui/src/proxy.ts (lines 39-48)

    const hasSession = request.cookies.getAll().some((c) => isSessionCookie(c.name));
    
    if (hasSession) {
      if (request.cookies.has(LOGGED_OUT_COOKIE)) {
        const response = NextResponse.next();
        response.cookies.delete(LOGGED_OUT_COOKIE);
        return response;
      }
      return NextResponse.next();
    }

    packages/agents-core/src/auth/cookie-names.ts (lines 14-17)

    export function isSessionCookie(cookieName: string): boolean {
      return (
        cookieName === SESSION_COOKIE_NAME || cookieName === `${SECURE_PREFIX}${SESSION_COOKIE_NAME}`
      );
    }

    agents-api/src/middleware/manageAuth.ts (lines 85-107)

    // 2. Try to validate as a better-auth session token (from device authorization flow or cookie)
    const auth = c.get('auth');
    try {
      const headers = new Headers();
      headers.set('Authorization', authHeader);
      const cookie = c.req.header('cookie');
      if (cookie) {
        headers.set('cookie', cookie);
      }
      const session = await auth.api.getSession({ headers });
  • Why this is likely a bug: Protected UI route access is granted by cookie presence alone, which is weaker than the API's actual session validation and allows unauthenticated deep-link entry behavior.

  • Introduced by this PR: No - pre-existing bug (code not changed in this PR)

  • Timestamp: 4:55

Tampered or stale session cookie is rejected – Failed
  • Where: Manage UI protected-route guard for tenant/project pages.

  • Steps to reproduce: Tamper the better-auth.session_token cookie, then navigate to /default/projects.

  • What failed: API correctly returns 401 for tampered cookie, but UI still renders protected route instead of redirecting to /login.

  • Code analysis: I inspected the same auth gate and found no token/session validity check in agents-manage-ui/src/proxy.ts; any cookie with a recognized session name is treated as authenticated. This conflicts with API-side validation behavior that rejects invalid sessions.

  • Relevant code:

    agents-manage-ui/src/proxy.ts (lines 39-48)

    const hasSession = request.cookies.getAll().some((c) => isSessionCookie(c.name));
    
    if (hasSession) {
      if (request.cookies.has(LOGGED_OUT_COOKIE)) {
        const response = NextResponse.next();
        response.cookies.delete(LOGGED_OUT_COOKIE);
        return response;
      }
      return NextResponse.next();
    }

    packages/agents-core/src/auth/cookie-names.ts (lines 14-17)

    export function isSessionCookie(cookieName: string): boolean {
      return (
        cookieName === SESSION_COOKIE_NAME || cookieName === `${SECURE_PREFIX}${SESSION_COOKIE_NAME}`
      );
    }

    agents-api/src/middleware/manageAuth.ts (lines 106-110)

    const session = await auth.api.getSession({ headers });
    
    if (session?.user) {
      logger.info({ userId: session.user.id }, 'Better-auth session authenticated successfully');
  • Why this is likely a bug: A tampered session token should force re-authentication everywhere, but the UI gate trusts cookie presence and can expose protected route rendering despite invalid auth.

  • Introduced by this PR: No - pre-existing bug (code not changed in this PR)

  • Timestamp: 10:15

📋 View Recording

Screen Recording

@vnv-varun
Copy link
Contributor Author

@claude --review

@vnv-varun vnv-varun marked this pull request as ready for review March 19, 2026 23:13
Copy link
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR. The refactoring centralizes Railway helpers into common.sh, adds a proper auth bootstrap job, and extends smoke coverage to exercise the real Better Auth sign-in path. Secret handling is correct — BETTER_AUTH_SECRET and SPICEDB_PRESHARED_KEY are now upserted into Vercel, masked in CI logs, and redacted in diagnostic output (including the new Set-Cookie: better-auth.* patterns). The ensure_runtime_var_seeded reorder to prioritize explicit template overrides is a deliberate semantic change and the right call. No actionable issues found.

Pullfrog  | View workflow runpullfrog.com𝕏

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 19, 2026

TL;DR — Adds deterministic admin-user seeding to PR preview environments so Better Auth login works before smoke tests run. The preview pipeline now provisions TCP proxies for databases/SpiceDB, validates auth secrets match between Railway and GitHub, bootstraps the admin user via db:auth:init, and runs a full sign-in → session-cookie → authenticated-API smoke test. Shared Railway helpers are consolidated into common.sh with jittered retries, self-healing Railway fallbacks, and comprehensive secret redaction to harden the entire flow against transient failures.

Key changes

  • bootstrap-preview-auth.sh — Seeds the preview admin user by running runtime DB migrations and db:auth:init, with a Railway fallback for resolving connection strings when job outputs are missing.
  • common.sh — Shared helper library with Railway GraphQL client, TCP proxy lifecycle, jittered retry loops (sleep_with_jitter), runtime-var polling that distinguishes missing vs. unresolved values, and a redact_preview_logs sed pipeline for safe CI output.
  • provision-railway.sh — Creates Railway environment from template, provisions TCP proxies for manage DB / run DB / SpiceDB, seeds runtime vars with template-override support, and validates the SpiceDB preshared key against the GitHub secret.
  • smoke-preview.sh — Health-checks API and UI (accepting redirect/auth-challenge status codes for UI), then performs end-to-end sign-in, validates the session cookie, and hits an authenticated manage API route.
  • upsert-vercel-preview-env.sh — Pushes all required env vars (BETTER_AUTH_SECRET, DB URLs, SpiceDB config) to Vercel preview deployments with branch-scoped targeting and Railway self-healing fallback.
  • preview-environments.yml — Orchestrates the full lifecycle: provision → bootstrap auth → configure Vercel → smoke test → capture diagnostics on failure → teardown on PR close.
  • capture-preview-failure-diagnostics.sh — Collects auth-specific diagnostics (sign-in attempt + authenticated API call) on smoke failure, all piped through redact_preview_logs to prevent secret leakage.

Summary | 7 files | 8 commits | base: mainvarun/preview-auth-bootstrap


bootstrap-preview-auth.sh — preview auth seeding with Railway fallback

Before: Preview environments had no admin user seeded; login depended on runtime defaults or manual intervention.
After: A dedicated CI step runs pnpm db:run:migrate and pnpm db:auth:init against the preview databases, creating a deterministic admin user with SpiceDB permissions.

If RUN_DB_URL or SPICEDB_ENDPOINT are not available from prior job outputs, the script self-heals by linking to Railway and extracting the values via railway_extract_runtime_var with jittered retries. All resolved secrets are masked before any work begins.

bootstrap-preview-auth.sh · preview-environments.yml


common.sh — shared Railway helpers, jittered retries, and log redaction

Before: Railway CLI link/extract logic was duplicated inline across scripts; TCP proxy management was unsupported; no retry jitter or structured error reporting existed.
After: Thirteen shared functions centralize Railway operations with jittered retries, GraphQL-based TCP proxy lifecycle, and comprehensive secret redaction.

Helper Purpose
sleep_with_jitter() Sleeps base × (0.5 + random()) to decorrelate retries
railway_link_service() Links Railway CLI to a project/service/environment
railway_extract_runtime_var() Polls for a var with retry, distinguishing missing vs. unresolved (${{}} still interpolating)
railway_graphql() Authenticated GraphQL client with connect/max timeouts
railway_ensure_tcp_proxy() Creates TCP proxy via mutation, polls until ACTIVE (up to 30 attempts with jitter)
railway_environment_id() Resolves environment name → ID via GraphQL
railway_service_id_for_env() Resolves service name → ID within an environment
redact_preview_logs() Strips postgres URLs, secrets, Better Auth cookies, and Bearer tokens
How does railway_extract_runtime_var() handle interpolation delays?

Railway variables may contain ${{...}} references that take time to resolve. The function polls up to 20 times (with jittered 2s base sleep), checking each iteration whether the value is empty (missing) or still contains ${{ (unresolved). It returns distinct error messages for each case so callers know whether the variable doesn't exist or just hasn't finished interpolating.

common.sh · upsert-vercel-preview-env.sh


provision-railway.sh — TCP proxies, runtime-var seeding, and SpiceDB key validation

Before: Databases and SpiceDB were only reachable via Railway-internal URLs; SpiceDB key mismatches caused silent auth failures.
After: TCP proxies are provisioned for all three services with active-status polling, runtime vars are seeded from template with explicit override support, and the SpiceDB preshared key is validated against the GitHub secret before proceeding.

Environment creation uses an idempotent retry pattern — if railway environment new fails, the script re-checks existence before aborting, handling race conditions gracefully.

How does TCP proxy creation work?

railway_ensure_tcp_proxy() queries Railway's GraphQL API for existing proxies on the target service+port. If none exist, it creates one via the tcpProxyCreate mutation, then polls up to 30 times (2s base with jitter) until the proxy status reports ACTIVE.

provision-railway.sh · common.sh


smoke-preview.sh — end-to-end auth smoke test with UI-aware health checks

Before: Smoke tests only verified URL health checks (HTTP 200 on API health + UI endpoints).
After: A dedicated wait_for_ui_url function accepts redirect and auth-challenge status codes (301/302/307/308/401/403) as valid "alive" signals. A new run_preview_auth_smoke function signs in via POST /api/auth/sign-in/email, validates the better-auth.* session cookie, and uses it to call GET /manage/tenants/{tenant_id}/projects — confirming the full auth flow works.

All error output is piped through redact_preview_logs to prevent secrets from leaking into CI logs on failure.

smoke-preview.sh · capture-preview-failure-diagnostics.sh


preview-environments.yml — workflow orchestration with deferred-failure diagnostics

Before: The workflow had no auth bootstrap step and smoke failures produced minimal diagnostic output.
After: A bootstrap-preview-auth job runs between provisioning and Vercel configuration. The smoke job uses continue-on-error so that on failure, capture-preview-failure-diagnostics.sh and fetch-vercel-runtime-logs.sh always execute before a final step fails the job.

This deferred-failure pattern ensures rich post-mortem data is always captured — including authenticated sign-in attempts and Vercel runtime logs — regardless of which smoke check failed.

preview-environments.yml · smoke-preview.sh

Pullfrog  | View workflow run | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: Low

🟠⚠️ Major (1) 🟠⚠️

🟠 1) common.sh · provision-railway.sh Retry loops without jitter risk thundering herd

files:

  • .github/scripts/preview/common.sh:71-88 (railway_extract_runtime_var)
  • .github/scripts/preview/common.sh:230-251 (railway_ensure_tcp_proxy polling)
  • .github/scripts/preview/provision-railway.sh:159-173 (extract_runtime_var)

Issue: All retry loops use fixed sleep intervals (2 seconds) without any randomized jitter. This pattern appears in three locations across two files.

Why: When multiple concurrent PR preview environments experience Railway variable resolution delays or TCP proxy provisioning slowdowns, they'll all retry at synchronized intervals. This creates a thundering herd pattern that can overwhelm Railway's API and cause cascading timeouts. The risk is amplified when Railway experiences any degradation, as all pending PRs will hammer the API in lockstep.

Fix: Add randomized jitter to sleep intervals. The AWS Builders Library recommends jitter for all retry loops:

# Add jitter: base_sleep * (0.5 to 1.5)
jittered_sleep=$(awk "BEGIN {srand(); print ${sleep_seconds} * (0.5 + rand())}")
sleep "${jittered_sleep}"

Consider extracting a shared retry_with_jitter helper function in common.sh to ensure consistency across all retry loops.

Refs:

Inline Comments:

  • 🟠 Major: common.sh:85-88 Retry without jitter causes thundering herd

🟡 Minor (3) 🟡

🟡 1) smoke-preview.sh:86-87 · smoke-preview.sh:107-108 Error output not redacted

Issue: On sign-in and manage auth check failures, raw response bodies are output without passing through redact_preview_logs.
Why: Auth error responses may contain diagnostic data that shouldn't appear in CI logs. The redact_preview_logs function is used elsewhere (diagnostics script, header output) but not consistently here.
Fix: Pipe through redaction: cat "${sign_in_body}" | redact_preview_logs >&2
Refs: redact_preview_logs

🟡 2) common.sh:216-227 TCP proxy mutation discards error response

Issue: The tcpProxyCreate mutation response is redirected to /dev/null, discarding any error information.
Why: If the mutation fails (quota exceeded, invalid parameters), error details are lost. The polling loop will timeout after ~60s without context.
Fix: Capture the mutation response and check for GraphQL errors before entering the polling loop.
Refs: common.sh:216-227

🟡 3) provision-railway.sh:195 Masking order allows brief log exposure window

Issue: The mask_env_vars call occurs after variables are extracted. If extraction fails mid-way with verbose logging, values could appear before masking takes effect.
Why: GitHub Actions ::add-mask:: only redacts values from log output produced after the mask command runs.
Fix: Mask immediately after each variable assignment, or ensure error paths don't log the raw values.
Refs: GitHub Actions: Masking a value in log

Inline Comments:

  • 🟡 Minor: smoke-preview.sh:85-87 Error output should be redacted
  • 🟡 Minor: smoke-preview.sh:106-108 Error output should be redacted
  • 🟡 Minor: common.sh:216-227 TCP proxy creation mutation discards error response

💭 Consider (2) 💭

💭 1) bootstrap-preview-auth.sh:48-54 Add explicit timeouts to migration/auth init
Issue: pnpm db:run:migrate and pnpm db:auth:init run without explicit timeouts.
Why: If a database connection hangs, the job waits until the 20-minute job timeout. Explicit command timeouts provide faster feedback.
Fix: Wrap with timeout 300 pnpm db:run:migrate and timeout 180 pnpm db:auth:init.
Refs: bootstrap-preview-auth.sh:48-54

💭 2) common.sh:107-121 Railway GraphQL API lacks retry on transient failures
Issue: railway_graphql makes a single HTTP request without retry logic. Transient 5xx or network errors cause immediate failure.
Why: Railway's GraphQL API may experience momentary availability issues, causing unnecessary preview provisioning failures.
Fix: Add 2-3 retry attempts for 5xx and connection errors with exponential backoff.

Inline Comments:

  • 💭 Consider: bootstrap-preview-auth.sh:48-54 Add explicit timeouts to migration and auth init commands

💡 APPROVE WITH SUGGESTIONS

Summary: This PR achieves its goals of making preview login deterministic and removing insecure preview auth defaults. The implementation is well-structured with good secret masking, SpiceDB key validation, and comprehensive smoke testing. The main area for improvement is adding jitter to retry loops to prevent thundering herd issues when multiple PRs are being provisioned concurrently. The minor items around error output redaction are straightforward fixes that improve consistency.

Discarded (4)
Location Issue Reason Discarded
workflow:185,229,269,373 Railway CLI installed via npm without integrity verification Low risk — version is pinned, and Railway doesn't provide official checksums. Standard practice for CLI tools in CI.
workflow:212 bootstrap-preview-auth job has no outputs declaration Inconsistent but not blocking — the job doesn't need to pass data downstream currently.
bootstrap-preview-auth.sh No cleanup on partial failure during auth bootstrap Migrations are generally idempotent and auth init is idempotent. Re-running the job should recover.
smoke-preview.sh:16 Health check retry without jitter Lower priority — polls the preview environment itself, not a shared external API. Less likely to cause issues.
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 7 3 0 0 3 0 4
pr-review-sre 11 1 2 0 2 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 18 4 2 0 5 0 4

Note: pr-review-standards returned no findings — shell scripts passed correctness checks. Duplicates between reviewers were merged (SRE and DevOps both flagged the redaction issues).

Comment on lines +85 to +88
if [ "${attempt}" -lt "${max_attempts}" ]; then
sleep "${sleep_seconds}"
fi
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Retry without jitter causes thundering herd

Issue: The retry loops in railway_extract_runtime_var and railway_ensure_tcp_proxy use fixed sleep intervals without jitter.

Why: When multiple concurrent PRs experience Railway variable resolution delays, they'll all retry at synchronized intervals, creating a thundering herd pattern that can overwhelm Railway's API and cause cascading timeouts.

Fix: Add randomized jitter to the sleep duration:

    if [ "${attempt}" -lt "${max_attempts}" ]; then
      # Add jitter: sleep_seconds * (0.5 to 1.5)
      jittered_sleep=$(awk "BEGIN {srand(); print ${sleep_seconds} * (0.5 + rand())}")
      sleep "${jittered_sleep}"
    fi

Refs:

Comment on lines +85 to +87
echo "Preview sign-in failed with status ${sign_in_status}" >&2
cat "${sign_in_body}" >&2
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Error output should be redacted

Issue: On sign-in failure, the raw response body is output without redaction. Auth error responses may contain diagnostic data that shouldn't appear in CI logs.

Why: The redact_preview_logs function is used elsewhere (e.g., diagnostics script, line 92 for headers) but not here for the body.

Fix:

Suggested change
echo "Preview sign-in failed with status ${sign_in_status}" >&2
cat "${sign_in_body}" >&2
return 1
if [ "${sign_in_status}" != "200" ]; then
echo "Preview sign-in failed with status ${sign_in_status}" >&2
cat "${sign_in_body}" | redact_preview_logs >&2

Refs:

Comment on lines +106 to +108
echo "Preview manage auth check failed with status ${manage_status}" >&2
cat "${manage_body}" >&2
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Error output should be redacted

Issue: On manage auth check failure, the raw response body is output without redaction.

Why: Consistent redaction prevents accidental exposure of auth-related diagnostics in CI logs.

Fix:

Suggested change
echo "Preview manage auth check failed with status ${manage_status}" >&2
cat "${manage_body}" >&2
return 1
if [ "${manage_status}" != "200" ]; then
echo "Preview manage auth check failed with status ${manage_status}" >&2
cat "${manage_body}" | redact_preview_logs >&2

Comment on lines +216 to +227
railway_graphql "$(cat <<EOF
mutation {
tcpProxyCreate(input: {
environmentId: "${env_id}"
serviceId: "${service_id}"
applicationPort: ${application_port}
}) {
id
}
}
EOF
)" >/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: TCP proxy creation mutation discards error response

Issue: The tcpProxyCreate mutation response is redirected to /dev/null, discarding any error information.

Why: If the mutation fails (e.g., quota exceeded, invalid parameters), the error details are lost. The subsequent polling loop will timeout after ~60s, but without the creation error context, debugging becomes difficult.

Fix: Capture the mutation response and check for GraphQL errors:

  if [ "${count}" = "0" ]; then
    mutation_response="$(railway_graphql "$(cat <<EOF
mutation {
  tcpProxyCreate(input: {
    environmentId: "${env_id}"
    serviceId: "${service_id}"
    applicationPort: ${application_port}
  }) {
    id
  }
}
EOF
)")"
    if echo "${mutation_response}" | jq -e '.errors' >/dev/null 2>&1; then
      echo "Failed to create TCP proxy: $(echo "${mutation_response}" | jq -r '.errors[0].message')" >&2
      return 1
    fi
  fi

Comment on lines +48 to +54
echo "::group::Run preview runtime migrations"
pnpm db:run:migrate
echo "::endgroup::"

echo "::group::Initialize preview auth"
pnpm db:auth:init
echo "::endgroup::"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Add explicit timeouts to migration and auth init commands

Issue: The pnpm db:run:migrate and pnpm db:auth:init commands run without explicit timeouts. If a database connection hangs, the job will wait until the 20-minute job timeout.

Why: Database migrations during preview bootstrapping may encounter transient connection issues to Railway-hosted Postgres. Explicit timeouts provide faster feedback and clearer failure attribution.

Fix: Consider wrapping with explicit timeouts:

echo "::group::Run preview runtime migrations"
timeout 300 pnpm db:run:migrate
echo "::endgroup::"

echo "::group::Initialize preview auth"
timeout 180 pnpm db:auth:init
echo "::endgroup::"

This is optional — the job timeout is a backstop, but explicit command timeouts give faster feedback.

@github-actions github-actions bot deleted a comment from claude bot Mar 19, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

Re-review Summary

This is a delta review following up on the prior automated review. The new commit cca2bba9c ci: harden preview retry and error logging addresses all actionable findings from the previous review:

Prior Finding Status
🟠 Major: Retry loops without jitter risk thundering herd Fixed — Added sleep_with_jitter() helper using randomized multiplier (0.5–1.5×), applied to all three retry loops
🟡 Minor: Error output not redacted in smoke-preview.sh Fixed — Both error paths now pipe through redact_preview_logs
🟡 Minor: TCP proxy mutation discards error response Fixed — Response is now captured and checked for GraphQL errors before polling

Delta Changes Analysis

The commit introduces a clean sleep_with_jitter() function that uses Python's random module to generate jitter in the range [0.5 × base, 1.5 × base):

sleep_with_jitter() {
  local sleep_seconds="$1"
  jittered_sleep="$(python3 - <<PY
import random
base = float(${sleep_seconds})
print(base * (0.5 + random.random()))
PY
  )"
  sleep "${jittered_sleep}"
}

This is correctly applied to:

  • railway_extract_runtime_var (common.sh:101)
  • railway_ensure_tcp_proxy polling loop (common.sh:271)
  • extract_runtime_var in provision-railway.sh (line 170)

Remaining Optional Improvements

The following items from the prior review remain unaddressed but are not blocking:

Item Reason Not Blocking
💭 Add explicit timeouts to migration/auth init Job timeout (20 min) provides backstop; explicit timeouts are nice-to-have
💭 Railway GraphQL API lacks retry on transient failures Single failures will fail the job which can be re-run; retry complexity not justified
🟡 Masking order allows brief log exposure window Low practical risk — values are masked before use; error paths don't log raw values

✅ APPROVE

Summary: All actionable feedback from the prior review has been addressed. The jitter implementation follows AWS best practices for retry loops, error output is now properly redacted, and TCP proxy creation errors are captured. This PR is ready to merge.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator-delta 3 0 0 0 0 0 0
Total 3 0 0 0 0 0 0

Note: Delta review — all prior findings resolved by commit cca2bba9c.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

Re-review Summary

This is a delta review following up on the prior APPROVE.

Changes since last review:

Commit Description
0b1d393ad Merge branch 'main' into varun/preview-auth-bootstrap

The only change is a clean merge from main that incorporates 7 commits:

  • d83ba0bb4 — Update pullfrog + dependabot config
  • 65c151d20 — Adding app ID to runAuth middleware
  • b396a8878 — Fix scheduled trigger invocations being skipped
  • fddbd38ac — Make OpenTelemetry startup idempotent
  • c71e1eabe — Version Packages
  • b1e6ced4f — SSO configuration and auth method management

Delta Analysis

Check Status
Preview auth files unchanged ✅ No changes to .github/scripts/preview/ or preview-environments.yml
Merge conflicts ✅ Clean merge, no conflicts
Integration impact ✅ Merged commits do not touch preview environment functionality

The PR's core implementation (preview auth bootstrap, smoke tests, Railway TCP proxies, jitter retry loops, secret redaction) remains exactly as approved in the previous review.

Prior Feedback Status

All findings from the initial review were addressed in commit cca2bba9c:

Prior Finding Status
🟠 Retry loops without jitter ✅ Fixed — sleep_with_jitter() applied to all retry loops
🟡 Error output not redacted ✅ Fixed — Both error paths pipe through redact_preview_logs
🟡 TCP proxy mutation discards error ✅ Fixed — Response captured and checked for GraphQL errors

✅ APPROVE

Summary: The merge from main is clean and does not affect the preview auth implementation. All actionable feedback from prior reviews was addressed. This PR is ready to merge.

Note: Unable to submit formal approval due to GitHub App permissions — this is a COMMENT review with approval recommendation.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator-delta 0 0 0 0 0 0 0
Total 0 0 0 0 0 0 0

Note: Delta review of merge commit 0b1d393ad — no PR-specific changes since last approval.

@itoqa
Copy link

itoqa bot commented Mar 20, 2026

Ito Test Report ✅

20 test cases ran. 20 passed.

Across 7 report shards, all 20 of 20 test cases passed with no failures, confirming preview readiness (UI reachable and /health correctly returning 204 No Content), successful seeded-admin login/session behavior, and proper protected-route access when authenticated. The most important findings are that auth security and resilience controls held under adversarial and edge conditions—unauthenticated, forged, malformed, injection, origin-spoofed, tenant-tampered, burst, open-redirect, cross-context, mobile, double-submit, and refresh/history scenarios all failed closed or recovered safely without secret/session leakage or instability.

✅ Passed (20)
Category Summary Screenshot
Adversarial Forged Origin request was rejected with 403 INVALID_ORIGIN and did not issue auth cookies, satisfying rejection/constrained-session expectation. ADV-1
Adversarial Rapid invalid burst remained controlled at 401 without service destabilization, and a follow-up valid sign-in succeeded normally with session issuance. ADV-2
Adversarial SQL-like, script-like, and unicode-control payloads were safely rejected with 4xx responses, with no cookie issuance, no 5xx errors, and no secret leakage indicators. ADV-3
Adversarial A forged session cookie alone was rejected with HTTP 401 on the protected route. ADV-4
Adversarial Isolated contexts behaved correctly: authenticated Context A accessed protected data (200) while fresh Context B remained unauthorized (401). ADV-5
Adversarial External returnUrl open-redirect attempt was neutralized; login ended on a same-origin internal route. ADV-6
Adversarial All negative/error auth scenarios returned controlled 4xx/401 responses without leaking secret material or session token values; forged Origin was rejected (403). ADV-7
Edge Wrong-password sign-in returned 401 auth error and did not issue better-auth cookies; malformed/injection companion checks stayed controlled errors without secret leakage. EDGE-1
Edge Authenticated request to nonexistent tenant endpoint returned 403 and did not expose default tenant project data. EDGE-2
Edge Rapid double-click submit authenticated once and stabilized on projects without auth-loop/crash indicators. EDGE-3
Edge At 390x844, login controls were visible/operable and successful sign-in led to a readable projects page. EDGE-4
Edge Reload plus history navigation during login did not deadlock; app remained recoverable on /login/authenticated routes. EDGE-5
Edge Anonymous deep-link redirected to login with returnUrl, and successful login returned user to protected /default/projects route. EDGE-6
Edge Malformed auth inputs consistently returned controlled 4xx validation/auth errors, with no session cookie issuance and no secret leakage in inspected headers/bodies. EDGE-7
Happy-path UI root at http://localhost:3000 loaded successfully without browser hard-fail, and readiness endpoint was reachable (HTTP 204). ROUTE-1
Happy-path API health check behavior is expected: GET /health returns 204 No Content with an empty body, matching source and verification evidence. ROUTE-2
Happy-path From an anonymous deep-link redirect to /login, email+password sign-in with seeded admin succeeded and landed in authenticated projects UI. ROUTE-3
Happy-path Direct Better Auth email sign-in returned 200 and issued required better-auth session cookies. ROUTE-4
Happy-path Protected manage projects API call with valid Better Auth cookie returned HTTP 200 and JSON listing payload. ROUTE-5
Happy-path Anonymous request to protected manage projects endpoint returned 401 Unauthorized and did not leak project data. ROUTE-6

Commit: 0b1d393

View Full Run


Tell us how we did: Give Ito Feedback

@amikofalvy amikofalvy added this pull request to the merge queue Mar 20, 2026
Merged via the queue into main with commit 8da1e7e Mar 20, 2026
21 checks passed
@amikofalvy amikofalvy deleted the varun/preview-auth-bootstrap branch March 20, 2026 02:01
dimaMachina pushed a commit that referenced this pull request Mar 20, 2026
* ci: bootstrap preview auth

* ci: require secure preview auth config

* ci: recover preview auth runtime vars

* ci: install railway in preview bootstrap

* ci: provision preview db tcp proxies

* ci: proxy preview spicedb bootstrap

* ci: harden preview retry and error logging

---------

Co-authored-by: Andrew Mikofalvy <5668128+amikofalvy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants