perf(staged): move PR polling to a backend scheduler and coalesce status events by matt2e · Pull Request #783 · block/builderbot

matt2e · 2026-06-12T02:48:48Z

Reworks PR-status polling to fix the main-thread jank that showed up when switching projects. The frontend used to own the polling cadence and fan out one refresh_all_pr_statuses call per project; this moves cadence, concurrency, and dedup into the backend and reduces the reactive churn a polling cycle causes on the client.

What changed

Coalesce status events + bound fetch concurrency (perf)

ProjectHome.svelte now buffers incoming pr-status-changed events and applies them in a single requestAnimationFrame flush, so a storm of N per-branch events coalesces into one reactive rebuild instead of N new Map(...) allocations + N derivation re-runs.
prs.rs splits the polling core out into refresh_project_pr_statuses, which fans per-branch gh fetches across a bounded semaphore pool (PR_REFRESH_CONCURRENCY = 6) instead of awaiting them serially. The semaphore is passed in so a single pool can be shared across all projects.

Backend PR-poll scheduler (refactor)

New pr_poll_scheduler.rs owns polling cadence and concurrency: it derives the project list from the DB, decides what is due across the pending/selected/background tiers, dedups in-flight work, backs off failures, and runs everything on one bounded pool. Spawned once the store is ready; the interest/hint commands are managed unconditionally so they resolve even before the store exists.
prPollingService.ts becomes a thin shim that forwards UI interest to the backend as hints (selected project, pending checks, window focus, explicit refresh nudges) and re-broadcasts the backend's refresh/stale lifecycle events, preserving the onRefreshing / onStale / isRefreshing API that BranchCardPrButton relies on.
App.svelte drops the app-side project-list $effect and wires prPollingService.init() / dispose() into the component lifecycle.

Per-client interest (refactor)

Interest is tracked per connected client and unioned across clients. Each hint carries a clientId (tauri-main for native, a per-tab UUID for web), and disconnect evicts a client's interest. New interest/hint commands: set_foreground_project, set_focus, set_branch_pending, refresh_now, disconnect_client.

Latency instrumentation used during the investigation was added and then removed; only the PR-poll rework remains in the net diff.

…ncy, yield background polling (plan phase 0) Phase 0 of moving PR polling into a backend-owned scheduler. Three self-contained changes that fix the intermittent project-switch freeze caused by a PR-polling storm starving the renderer's event loop: - ProjectHome.svelte: buffer pr-status-changed events and apply a single branchesByProject rebuild per animation frame instead of one Map allocation per branch; the pending flush is cancelled on destroy. A storm of N branch events now collapses into one reactive flush. - prs.rs: fan the per-branch fetches in refresh_all_pr_statuses out across a Semaphore-bounded pool (cap 6) instead of awaiting them serially, so a project resolves in ~1 round-trip's wall-clock instead of N. Per-branch and final emissions, error handling, and return value are unchanged — only the scheduling is now parallel. - prPollingService.ts: yield to the event loop between projects in the poll() drain loop, and hold background-tier polling for a short cooldown after setSelectedProject so a switch can interleave instead of waiting out the serial chain. Interval tiers, focus/blur gating, failure backoff, and the due logic are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>

…end scheduler (plan phase 1) Phase 1 of moving PR polling into a backend-owned scheduler. Makes the backend the single owner of polling cadence and concurrency and collapses `prPollingService.ts` into a thin interest/hint layer. Behavior-preserving: the same effective tiers and triggers, now driven from the backend. Single frontend only — no multi-client registry (that is Phase 2). Backend: - New `pr_poll_scheduler` module: one long-lived `tokio::time::interval` tick loop (granularity 5s, well below the 15s pending tier). It owns the poll-state that used to live in the frontend — per-project `last_polled_at`, the three interval tiers (15s pending / 60s selected / 5m background, moved verbatim from the frontend), and failure/stale tracking. Poll-state is not persisted: on restart everything is "due" (per the plan's open question). On each tick it derives the project list from the DB (`list_projects`), computes which projects are due from owned state + interest, dedups in-flight work, and spawns refreshes. Interest changes and `refresh_now` nudges wake the loop immediately via a `Notify`. - All due/dedup/backoff decisions live in a pure `PollState` whose methods take an explicit `now` (ms) and in-flight set — the injectable seam that makes the logic unit-testable without `gh` calls or wall-clock sleeps. The loop is the only place that touches the real clock and fetcher. 7 unit tests cover the three tiers, the focus=off pause, in-flight dedup, `refresh_now` folding into dedup, the stale threshold, failure backoff, and pruning. - Reuses Phase 0's bounded concurrency: `refresh_all_pr_statuses` is split so its semaphore-bounded fan-out lives in a reusable `refresh_project_pr_statuses` core that both the command and the scheduler call. The scheduler passes one shared `Semaphore` (cap `PR_REFRESH_CONCURRENCY` = 6) across all projects it refreshes, so a focus-regain / launch burst still caps total `gh` subprocesses — the bounded pool is the herd control (explicit time-jitter was left out to keep the due logic pure/testable; a possible follow-up). - New interest/hint commands, registered in the invoke handler: `set_foreground_project`, `set_focus`, `set_branch_pending`, `refresh_now`. The effective tier for a project is the union of interest (any foregrounding => selected; any pending => fast; nothing focused => pause), structured as a union so Phase 2 can extend it. - The scheduler is managed unconditionally (so the hint commands resolve even during the needs-reset prompt) and the tick loop is spawned in setup's `Ok` branch alongside `background_sync`. - Per-branch `pr-status-changed` and final `pr-statuses-refreshed` events are unchanged. Two small events replace state the frontend used to compute in its own poll loop: `pr-refresh-state` (per-project refreshing on/off) and `pr-status-stale` (crossed/recovered the failure threshold). Frontend (`prPollingService.ts` + callers): - Removed the timer-driven drain loop, the interval tiers, the `getProjectsDue` math, and the failure backoff — these now live in the backend. Phase 0's yield helper and switch cooldown went with them (the foreground serial chain they mitigated no longer exists). - Rewired the public methods to backend commands, preserving the API callers use: `setSelectedProject` -> `set_foreground_project`, `updateChecksStatus` -> `set_branch_pending`, `refreshNow` -> `refresh_now`, focus/blur listeners -> `set_focus`. `setProjects` was removed (the backend derives the project list from the DB); `App.svelte`'s project-list `$effect` and its now-unused import were dropped accordingly. - `onStale` / `onRefreshing` / `isRefreshing` are unchanged for callers but are now driven by the new backend lifecycle events, subscribed in a new `init()` (with a matching `dispose()`), wired from `App.svelte`'s onMount/onDestroy. Verification: `cargo check` clean; `cargo test pr_poll_scheduler` 7/7 pass (291 existing tests unaffected); `pnpm check` 0 errors / 0 warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>

…disconnect eviction (plan phase 2) Phase 2 of the PR-poll-scheduler rearchitecture. Turns the scheduler's process-global, single-valued interest into per-connected-client interest, with the effective cadence computed as the union across all clients and per-client eviction on disconnect. Clients are the native Tauri window plus each WebSocket browser session (web path lands in parallel). Per-client data model + union cadence (pr_poll_scheduler.rs): - New `ClientInterest { foreground_project, focused, pending_branches, last_seen }`. `PollState` drops the scalar `foreground_project`/`focused`/ `pending_branches` and gains `clients: HashMap<String, ClientInterest>`. The per-project *work* bookkeeping (`last_polled_at`/`failures`/`stale`/ `forced`) stays project-keyed and shared — it is about the work, not the observer, so N clients still trigger only one poll per project per tier. - Union helpers keep `PollState` clock-free (all `now` passed in): `any_focused()`, `is_foreground(p)`, and `project_has_pending(p)` rewritten over `clients`. `interval_for` keeps its precedence (any pending => 15s; else foregrounded => 60s; else 5m); `due` swaps `!self.focused` for `!self.any_focused()`. `forced`/dedup/interval logic is untouched. - Client-keyed setters bump `last_seen`: `set_foreground`/`set_focus`/ `set_branch_pending(client_id, .., now)`. `force(project_id)` stays global. Lifecycle + TTL: - New `touch(client_id, now)` (create-or-bump heartbeat), `disconnect_client` (clean WS close), and `evict_stale_clients(now, ttl_ms)` (dirty-drop fallback). `CLIENT_TTL_MS = 90_000` (~3x an expected <=30s WS keepalive); the tick loop calls `evict_stale_clients` inside the existing locked block before `prune`. `TAURI_CLIENT_ID = "tauri-main"` is exempt from TTL eviction (process death is the native window's teardown). - `prune` keeps project-keyed pruning and now also clears each client's foreground/pending interest in dead projects; client lifecycle eviction stays separate from project prune. Phase-1 equivalence: `PollState::new()` pre-seeds `clients` with the Tauri id as `focused: true, last_seen: 0`, so the first immediate tick still polls at launch and single-client behavior is byte-for-byte equivalent to Phase 1 (the Tauri frontend's later hints update that same entry). Commands (pr_poll_scheduler.rs + lib.rs): all four hint commands gain a `client_id` arg and `rename_all = "camelCase"` (added to `set_focus`, which lacked it); `refresh_now` carries `client_id` only to `touch` that client. New `disconnect_client` command, registered in the lib.rs invoke handler. The public `PrPollScheduler` methods (`set_foreground`/`set_focus`/ `set_branch_pending`/`force`/`touch`/`disconnect_client`) read the clock and delegate to the pure `PollState`, and are `pub` so the parallel web `handle_ws`/`dispatch` can drive them via `app_handle.state::<Arc<...>>()`. Frontend (prPollingService.ts + commands.ts): the service owns a module-level client id — the fixed `"tauri-main"` under Tauri (via `isTauri`), else a `crypto.randomUUID()` per page load — threaded through all four hint wrappers (now carrying `clientId`). New `disconnectPrPollClient(clientId)` wrapper, called from `dispose()`. The id is exposed via `getPrPollClientId()` so the web transport can append the same value as `?clientId=<id>` on the WS connect. Backend<->web contract the parallel work must honor: the same `clientId` flows on both channels — every interest invoke and the events WS connect query. The web work owns adding the four hint arms + `disconnect_client` to `dispatch` (reading `clientId` from args) and the `handle_ws` `touch` (connect/ping) / `disconnect_client` (close) calls, plus ensuring a keepalive actually flows so the TTL fallback works. Intentionally out of scope: `web_server.rs` is left to the parallel web work (touching it here would conflict); the two open code-review bugs — `record_success`/`record_failure` clearing `forced` (mid-flight re-force drop) and pruning `pending_branches` by branch id — are left for their own commits. `forced` stays exactly as today (global, `record_*` behavior unchanged) so the bug-(a) fix applies cleanly on top. Tests: the 7 existing PollState tests are ported to the per-client structure (seeding the single Tauri client); 9 new pure-`PollState` tests cover foreground/pending/focus union across two clients, disconnect recomputing the union and dropping pending, TTL eviction with the Tauri-id exemption, `touch` keeping a client alive past the boundary, `forced` surviving the forcing client's disconnect, and single-client equivalence to Phase 1. Verification: - `cargo check` (src-tauri) — clean, no warnings. - `cargo test pr_poll_scheduler` — 16/16 pass (7 ported + 9 new; 291 other tests unaffected). - `pnpm check` — 0 errors / 0 warnings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>

…er + slow-path logs) Port the project-switch diagnostics from the upstream working tree so the PR-poll-scheduler rework on this branch can be measured against the original freeze. Console-only; no behavior change. - switchTracer.ts (new): traces a project→project switch — synchronous milestones (selectProject, ProjectSection mount/destroy, the App selectedProjectId effect), component mount/unmount counts, and requestAnimationFrame gaps as main-thread stalls — so genuine render cost is distinguishable from event-loop starvation. Settles on quiet (300ms) or a 3s cap; frame gaps >700ms are treated as background throttling and not counted as stalls. Lifecycle lines use `[switch #N ...]`; one-off slow-path warnings from other modules use the bare `[switch]` prefix. - navigation.svelte.ts: beginSwitch() at the top of selectProject and a "sync complete" mark at the end, bracketing the synchronous switch work. - App.svelte: mark the selectedProjectId effect that hints prPolling. - ProjectSection.svelte / BranchCard.svelte: count mounts/unmounts and mark ProjectSection mount/destroy; BranchCard also records its synchronous timeline-cache hydration cost (recordSync). - DiffModal.svelte: time selected-file → painted via a double rAF (`[diff] render painted in Nms`). - persistentStore.ts: warn when a store.set disk write blocks >=50ms. Skipped from the upstream tree: the prPollingService.ts slow-`refreshAllPrStatuses` log. That diagnostic instrumented the frontend poll loop, which this branch already moved into the backend scheduler (phases 0–2) — the instrumented code no longer exists here, so there is no landing site for it. Verification: `pnpm check` — 0 errors / 0 warnings. Signed-off-by: Matt Toohey <contact@matttoohey.com>

… in switchTracer Switch #11 was a ~7.5s continuous main-thread freeze that switchTracer reported as `maxGap 0 / stalls 0`: not a single requestAnimationFrame fired the entire switch, so onFrame never ran, and the safety-net timer won the post-freeze race to finalize — which folded the missing frame into `firstFrame ?? settled` and recorded no stall. The freeze was invisible in exactly the case we most need to catch. This makes the next reproduction name its own cause. switchTracer.ts: - finalize now closes the post-freeze race. It measures the trailing gap from the last frame it saw to settle time and synthesizes the implied stall — symmetric with onFrame (record at >= STALL_MS, log immediately at >= FREEZE_MS), gated on foreground visibility so a hidden window's paused rAF isn't mistaken for a freeze. A continuous freeze (firstFrame still null) is logged as "continuous — no frame fired the entire switch", a tail freeze as "at switch tail"; either way the stall now lands in maxGap and the stall stats instead of vanishing. - the done line reports `firstFrame none (no frame fired)` instead of the misleading `firstFrame +<settled>ms`, so the no-frame fingerprint is unambiguous. - the done line gains `freezeBracket` — the largest gap between two adjacent synchronous milestones (begin, selectProject sync complete, ProjectSection destroy/mount, the App effect). Marks are recorded even during a total freeze (they fire from the reactive flush itself), so this names the span the freeze lives in (e.g. "navigation.selectProject sync complete → ProjectSection destroy 6220ms"), pointing the next investigation straight at the un-instrumented reactive cascade to profile — as close to the cause as the logs can get without a JS profiler. Gated at FREEZE_MS so healthy switches stay uncluttered. Also lands the previously-uncommitted diagnostics this builds on (the "separate slow disk write from blocked main thread" work): persistentStore schedules an event-loop-lag probe alongside each store.set and logs `set=<n>ms eventLoopLag=<n>ms`, so a slow set can be attributed to a genuine backend write (set >> lag) vs. a co-victim of a main-thread freeze (set ≈ lag); and switchTracer's stall classification moved from a size upper bound (THROTTLE_MS) to foreground visibility, with maxGap always reported. Console-only; no behavior change. Verification: pnpm check — 0 errors / 0 warnings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>

…tch cascade The timeline recording in the latest analysis proved the worst project switch is ~670ms of pure JavaScript in the synchronous reactive cascade (not rendering, layout, the persistentStore write, or PR-polling), but the export carried no sampling data so it couldn't name the offending function — and switchTracer's freezeBracket could only point at one opaque span, "selectProject sync complete → ProjectSection destroy". Add the finer in-app marks the analysis recommended so the next reproduction names the stage in the console without a profiler. The marks subdivide that previously-opaque span into derive / old-subtree teardown / new-subtree build: - ProjectHome.svelte: a `$effect.pre` emits `ProjectHome derive` (with the visible-project count). `$effect.pre` runs inside the flush *before* the keyed `{#each visibleProjects}` block reconciles, so the mark lands between `navigation.selectProject sync complete` and the old `ProjectSection destroy` — splitting the freeze bracket into the derive stage (up to this mark) and the old-subtree teardown that follows. - ProjectSection.svelte: a top-level `ProjectSection construct` mark. Top-level script runs once per new instance, before the template builds any child BranchCards, so the gap to the existing `ProjectSection mount` (onMount, fired after all cards are constructed) isolates the new-subtree build — BranchCard construction plus each card's synchronous timeline-cache hydration (already captured as `BranchCard.hydrateSync`) — from the teardown. The construction-time `project.id` read is intentional, suppressed with `svelte-ignore state_referenced_locally` (same pattern BranchCard already uses for its hydration read). - switchTracer.ts: doc comment updated to list the new milestones and note that a freeze now names its stage (derive vs. teardown vs. build) rather than landing in one span. Console-only; no behavior change (the marks read pure deriveds/props and `mark()` no-ops when no switch is active). Verification: pnpm check — 0 errors / 0 warnings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>

… just the PR-poll fix The project-switch slowdown will be pursued on a separate branch, so strip all the diagnostic/logging code that was added here to measure the freeze. This reverses the three console-only instrumentation commits (65ec3ad, fd9b7a6, 66d6820) in full, leaving the branch with only the PR-management fix (the backend-owned PR-poll scheduler, plan phases 0-2). Removed: - switchTracer.ts (the whole project-switch tracer module). - navigation.svelte.ts: the beginSwitch() / "sync complete" marks bracketing selectProject. - App.svelte: the selectedProjectId-effect mark. - ProjectHome.svelte: the `$effect.pre` "ProjectHome derive" mark (the Phase 0 pr-status-changed event buffering / rAF flush stays — that is the fix). - ProjectSection.svelte: the construct/mount/destroy marks and mount/unmount counts. - BranchCard.svelte: the mount/unmount counts and the hydrateSync timing. - DiffModal.svelte: the selected-file -> painted double-rAF timing log. - persistentStore.ts: the slow-set warning and its event-loop-lag probe; setStoreValue is back to a plain awaited backend.store.set. The resulting tree is byte-for-byte equal to the phase-2 fix commit (fa22aa6); no PR-poll-scheduler code is touched. Verification: `pnpm check` - 0 errors / 0 warnings; no remaining references to the removed instrumentation. Signed-off-by: Matt Toohey <contact@matttoohey.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c74fce2d32

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T02:51:56Z

+    fn record_success(&mut self, project_id: &str, now: i64) -> bool {
+        self.last_polled_at.insert(project_id.to_string(), now);
+        self.failures.remove(project_id);
+        self.forced.remove(project_id);


Preserve forced refreshes queued during in-flight polls

When refresh_now is called for a project that is already in in_flight, due() skips it before consuming forced, so the nudge should survive until the current poll finishes. This removal clears that pending nudge on successful completion, meaning a push/PR-create refresh requested during an existing poll is dropped and the UI may keep pre-push PR status until the normal 15s/60s/5m interval elapses; the old frontend explicitly queued these same-project refreshes after the in-flight operation.

Useful? React with 👍 / 👎.

Keep refresh_now nudges that arrive while a project refresh is already in flight by only consuming forced entries when the scheduler schedules work, not when an older refresh completes. Add a scheduler regression covering both successful and failed completions. Signed-off-by: Matt Toohey <contact@matttoohey.com>

Collect every spawned branch refresh result before deciding whether the project refresh failed. A single branch task panic or join error is now logged and ignored when other branch tasks complete, so the refreshed count is preserved and the final pr-statuses-refreshed event is emitted. Keep returning an error when every spawned branch refresh task fails, and cover both partial and total task-failure cases with targeted async regressions. Signed-off-by: Matt Toohey <contact@matttoohey.com>

Validate scheduler pending-branch interest against live branch ids during prune so deleted branches cannot pin a surviving project to the pending poll tier. Add coverage for the deleted-branch scheduler regression and the store branch-id query. Signed-off-by: Matt Toohey <contact@matttoohey.com>

Treat branch-level PR status fetch/update failures as failed refreshes in the aggregation path, so an all-false project refresh returns an error and drives scheduler backoff/stale handling. Keep mixed success/failure refreshes tolerant and cover the all-branches-failed regression. Signed-off-by: Matt Toohey <contact@matttoohey.com>

matt2e and others added 7 commits June 10, 2026 17:45

matt2e requested review from baxen and wesbillman as code owners June 12, 2026 02:48

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

matt2e added 4 commits June 12, 2026 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(staged): move PR polling to a backend scheduler and coalesce status events#783

perf(staged): move PR polling to a backend scheduler and coalesce status events#783
matt2e wants to merge 11 commits into
mainfrom
project-swith-slowdown

matt2e commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matt2e commented Jun 12, 2026

What changed

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant