Skip to content

Pre-push gate race condition: 200s gate causes repeated push failures on active main branch #1821

@OAGr

Description

@OAGr

Problem

Problem

The pre-push gate takes ~200 seconds to run (16 parallel checks including build-data, TypeScript, tests, linting). During that window, other agents frequently push to main, causing the push to be rejected with a stale-ref error. The agent must then reset --hard origin/main, re-merge, and re-run the full 200s gate. This cycle can repeat 5–6 times before a push succeeds, turning a single merge commit into a 20–30 minute ordeal.

Observed in: PR #1792 (release: 2026-03-06 #2) — resolving production→main merge conflicts required ~6 gate attempts, ~30 minutes total.

Root cause: The gate is a local pre-push hook. There is no mechanism to "reserve" a push slot. Any commit by another agent or human during the gate window will cause the push to be rejected.

Options to Consider

1. Move the gate to CI only (remove pre-push hook)

  • Pro: Eliminates the race entirely. Push is instant. Gate runs as a GitHub Actions check.
  • Con: Failures are only visible after push; local feedback loop is slower.
  • Variant: Keep a lightweight ~10s local check (conflict markers, TypeScript, no build-data), move the expensive checks (build-data, tests) to CI only.

2. Gate result caching + fast re-push

  • The gate already caches by commit hash ("Gate already passed for this commit, fast mode").
  • Problem: A merge commit is a new commit with a different hash. If origin/main moves and we re-merge, the merge commit hash changes — cache miss, full re-run.
  • Potential fix: Cache by tree hash or content hash instead of commit hash. If the working tree is identical, skip re-running.

3. Optimistic push with CI enforcement

  • Push immediately (skip gate locally), let CI enforce quality.
  • Add branch protection: require all CI checks to pass before merging PRs.
  • Agents push fast; gate runs in CI; merges only happen when green.
  • Con: Main can have a briefly broken state between push and CI completing.

4. Gate triage / skip unchanged checks

  • Gate already has LLM-based triage that tries to skip irrelevant checks.
  • A merge commit that only touches conflict markers (no code changes) could skip TypeScript, tests, etc.
  • Problem: Merge commits look like they touch all the conflicted files, making triage conservative.

5. Agent coordination / push queue

  • Add a "push lock" — agents check a shared file/DB entry before starting the gate, and claim a push slot.
  • Complex to implement; could deadlock if an agent crashes while holding the lock.

6. Reduce gate time

  • Build-data alone takes ~4–5 minutes (fetch from wiki-server, compute derived data, etc.).
  • For merge commits on main → production (no content changes), most of this is unnecessary.
  • Could gate-skip build-data for commits that only touch .github/, crux/, apps/wiki-server/ with no MDX/YAML changes.

Immediate Impact

  • High-traffic periods (multiple agents active) make routine release PRs unreliable
  • The 200s gate is the primary bottleneck; everything else is fast
  • SSH connections also drop during the wait, requiring keep-alive workarounds

Recommendation

Likely best path: Option 1 variant — keep a fast local pre-push check (~15s: conflict markers + TypeScript), remove build-data and tests from the pre-push hook, enforce those in CI. This eliminates the race while preserving fast local feedback for the most common errors.

Recommended Model

Haiku — well-scoped for this model.

Acceptance Criteria

  • Solution chosen and documented
  • Gate push time or race frequency measurably reduced
  • No regression in catching real errors before push

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions