perf(staged): move PR polling to a backend scheduler and coalesce status events#783
perf(staged): move PR polling to a backend scheduler and coalesce status events#783matt2e wants to merge 11 commits into
Conversation
…ncy, yield background polling (plan phase 0) Phase 0 of moving PR polling into a backend-owned scheduler. Three self-contained changes that fix the intermittent project-switch freeze caused by a PR-polling storm starving the renderer's event loop: - ProjectHome.svelte: buffer pr-status-changed events and apply a single branchesByProject rebuild per animation frame instead of one Map allocation per branch; the pending flush is cancelled on destroy. A storm of N branch events now collapses into one reactive flush. - prs.rs: fan the per-branch fetches in refresh_all_pr_statuses out across a Semaphore-bounded pool (cap 6) instead of awaiting them serially, so a project resolves in ~1 round-trip's wall-clock instead of N. Per-branch and final emissions, error handling, and return value are unchanged — only the scheduling is now parallel. - prPollingService.ts: yield to the event loop between projects in the poll() drain loop, and hold background-tier polling for a short cooldown after setSelectedProject so a switch can interleave instead of waiting out the serial chain. Interval tiers, focus/blur gating, failure backoff, and the due logic are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>
…end scheduler (plan phase 1) Phase 1 of moving PR polling into a backend-owned scheduler. Makes the backend the single owner of polling cadence and concurrency and collapses `prPollingService.ts` into a thin interest/hint layer. Behavior-preserving: the same effective tiers and triggers, now driven from the backend. Single frontend only — no multi-client registry (that is Phase 2). Backend: - New `pr_poll_scheduler` module: one long-lived `tokio::time::interval` tick loop (granularity 5s, well below the 15s pending tier). It owns the poll-state that used to live in the frontend — per-project `last_polled_at`, the three interval tiers (15s pending / 60s selected / 5m background, moved verbatim from the frontend), and failure/stale tracking. Poll-state is not persisted: on restart everything is "due" (per the plan's open question). On each tick it derives the project list from the DB (`list_projects`), computes which projects are due from owned state + interest, dedups in-flight work, and spawns refreshes. Interest changes and `refresh_now` nudges wake the loop immediately via a `Notify`. - All due/dedup/backoff decisions live in a pure `PollState` whose methods take an explicit `now` (ms) and in-flight set — the injectable seam that makes the logic unit-testable without `gh` calls or wall-clock sleeps. The loop is the only place that touches the real clock and fetcher. 7 unit tests cover the three tiers, the focus=off pause, in-flight dedup, `refresh_now` folding into dedup, the stale threshold, failure backoff, and pruning. - Reuses Phase 0's bounded concurrency: `refresh_all_pr_statuses` is split so its semaphore-bounded fan-out lives in a reusable `refresh_project_pr_statuses` core that both the command and the scheduler call. The scheduler passes one shared `Semaphore` (cap `PR_REFRESH_CONCURRENCY` = 6) across all projects it refreshes, so a focus-regain / launch burst still caps total `gh` subprocesses — the bounded pool is the herd control (explicit time-jitter was left out to keep the due logic pure/testable; a possible follow-up). - New interest/hint commands, registered in the invoke handler: `set_foreground_project`, `set_focus`, `set_branch_pending`, `refresh_now`. The effective tier for a project is the union of interest (any foregrounding => selected; any pending => fast; nothing focused => pause), structured as a union so Phase 2 can extend it. - The scheduler is managed unconditionally (so the hint commands resolve even during the needs-reset prompt) and the tick loop is spawned in setup's `Ok` branch alongside `background_sync`. - Per-branch `pr-status-changed` and final `pr-statuses-refreshed` events are unchanged. Two small events replace state the frontend used to compute in its own poll loop: `pr-refresh-state` (per-project refreshing on/off) and `pr-status-stale` (crossed/recovered the failure threshold). Frontend (`prPollingService.ts` + callers): - Removed the timer-driven drain loop, the interval tiers, the `getProjectsDue` math, and the failure backoff — these now live in the backend. Phase 0's yield helper and switch cooldown went with them (the foreground serial chain they mitigated no longer exists). - Rewired the public methods to backend commands, preserving the API callers use: `setSelectedProject` -> `set_foreground_project`, `updateChecksStatus` -> `set_branch_pending`, `refreshNow` -> `refresh_now`, focus/blur listeners -> `set_focus`. `setProjects` was removed (the backend derives the project list from the DB); `App.svelte`'s project-list `$effect` and its now-unused import were dropped accordingly. - `onStale` / `onRefreshing` / `isRefreshing` are unchanged for callers but are now driven by the new backend lifecycle events, subscribed in a new `init()` (with a matching `dispose()`), wired from `App.svelte`'s onMount/onDestroy. Verification: `cargo check` clean; `cargo test pr_poll_scheduler` 7/7 pass (291 existing tests unaffected); `pnpm check` 0 errors / 0 warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>
…disconnect eviction (plan phase 2)
Phase 2 of the PR-poll-scheduler rearchitecture. Turns the scheduler's
process-global, single-valued interest into per-connected-client interest,
with the effective cadence computed as the union across all clients and
per-client eviction on disconnect. Clients are the native Tauri window plus
each WebSocket browser session (web path lands in parallel).
Per-client data model + union cadence (pr_poll_scheduler.rs):
- New `ClientInterest { foreground_project, focused, pending_branches,
last_seen }`. `PollState` drops the scalar `foreground_project`/`focused`/
`pending_branches` and gains `clients: HashMap<String, ClientInterest>`.
The per-project *work* bookkeeping (`last_polled_at`/`failures`/`stale`/
`forced`) stays project-keyed and shared — it is about the work, not the
observer, so N clients still trigger only one poll per project per tier.
- Union helpers keep `PollState` clock-free (all `now` passed in):
`any_focused()`, `is_foreground(p)`, and `project_has_pending(p)` rewritten
over `clients`. `interval_for` keeps its precedence (any pending => 15s;
else foregrounded => 60s; else 5m); `due` swaps `!self.focused` for
`!self.any_focused()`. `forced`/dedup/interval logic is untouched.
- Client-keyed setters bump `last_seen`: `set_foreground`/`set_focus`/
`set_branch_pending(client_id, .., now)`. `force(project_id)` stays global.
Lifecycle + TTL:
- New `touch(client_id, now)` (create-or-bump heartbeat), `disconnect_client`
(clean WS close), and `evict_stale_clients(now, ttl_ms)` (dirty-drop
fallback). `CLIENT_TTL_MS = 90_000` (~3x an expected <=30s WS keepalive);
the tick loop calls `evict_stale_clients` inside the existing locked block
before `prune`. `TAURI_CLIENT_ID = "tauri-main"` is exempt from TTL
eviction (process death is the native window's teardown).
- `prune` keeps project-keyed pruning and now also clears each client's
foreground/pending interest in dead projects; client lifecycle eviction
stays separate from project prune.
Phase-1 equivalence: `PollState::new()` pre-seeds `clients` with the Tauri id
as `focused: true, last_seen: 0`, so the first immediate tick still polls at
launch and single-client behavior is byte-for-byte equivalent to Phase 1
(the Tauri frontend's later hints update that same entry).
Commands (pr_poll_scheduler.rs + lib.rs): all four hint commands gain a
`client_id` arg and `rename_all = "camelCase"` (added to `set_focus`, which
lacked it); `refresh_now` carries `client_id` only to `touch` that client.
New `disconnect_client` command, registered in the lib.rs invoke handler. The
public `PrPollScheduler` methods (`set_foreground`/`set_focus`/
`set_branch_pending`/`force`/`touch`/`disconnect_client`) read the clock and
delegate to the pure `PollState`, and are `pub` so the parallel web
`handle_ws`/`dispatch` can drive them via `app_handle.state::<Arc<...>>()`.
Frontend (prPollingService.ts + commands.ts): the service owns a module-level
client id — the fixed `"tauri-main"` under Tauri (via `isTauri`), else a
`crypto.randomUUID()` per page load — threaded through all four hint wrappers
(now carrying `clientId`). New `disconnectPrPollClient(clientId)` wrapper,
called from `dispose()`. The id is exposed via `getPrPollClientId()` so the
web transport can append the same value as `?clientId=<id>` on the WS connect.
Backend<->web contract the parallel work must honor: the same `clientId` flows
on both channels — every interest invoke and the events WS connect query. The
web work owns adding the four hint arms + `disconnect_client` to `dispatch`
(reading `clientId` from args) and the `handle_ws` `touch` (connect/ping) /
`disconnect_client` (close) calls, plus ensuring a keepalive actually flows so
the TTL fallback works.
Intentionally out of scope: `web_server.rs` is left to the parallel web work
(touching it here would conflict); the two open code-review bugs —
`record_success`/`record_failure` clearing `forced` (mid-flight re-force drop)
and pruning `pending_branches` by branch id — are left for their own commits.
`forced` stays exactly as today (global, `record_*` behavior unchanged) so the
bug-(a) fix applies cleanly on top.
Tests: the 7 existing PollState tests are ported to the per-client structure
(seeding the single Tauri client); 9 new pure-`PollState` tests cover
foreground/pending/focus union across two clients, disconnect recomputing the
union and dropping pending, TTL eviction with the Tauri-id exemption, `touch`
keeping a client alive past the boundary, `forced` surviving the forcing
client's disconnect, and single-client equivalence to Phase 1.
Verification:
- `cargo check` (src-tauri) — clean, no warnings.
- `cargo test pr_poll_scheduler` — 16/16 pass (7 ported + 9 new; 291 other
tests unaffected).
- `pnpm check` — 0 errors / 0 warnings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
…er + slow-path logs) Port the project-switch diagnostics from the upstream working tree so the PR-poll-scheduler rework on this branch can be measured against the original freeze. Console-only; no behavior change. - switchTracer.ts (new): traces a project→project switch — synchronous milestones (selectProject, ProjectSection mount/destroy, the App selectedProjectId effect), component mount/unmount counts, and requestAnimationFrame gaps as main-thread stalls — so genuine render cost is distinguishable from event-loop starvation. Settles on quiet (300ms) or a 3s cap; frame gaps >700ms are treated as background throttling and not counted as stalls. Lifecycle lines use `[switch #N ...]`; one-off slow-path warnings from other modules use the bare `[switch]` prefix. - navigation.svelte.ts: beginSwitch() at the top of selectProject and a "sync complete" mark at the end, bracketing the synchronous switch work. - App.svelte: mark the selectedProjectId effect that hints prPolling. - ProjectSection.svelte / BranchCard.svelte: count mounts/unmounts and mark ProjectSection mount/destroy; BranchCard also records its synchronous timeline-cache hydration cost (recordSync). - DiffModal.svelte: time selected-file → painted via a double rAF (`[diff] render painted in Nms`). - persistentStore.ts: warn when a store.set disk write blocks >=50ms. Skipped from the upstream tree: the prPollingService.ts slow-`refreshAllPrStatuses` log. That diagnostic instrumented the frontend poll loop, which this branch already moved into the backend scheduler (phases 0–2) — the instrumented code no longer exists here, so there is no landing site for it. Verification: `pnpm check` — 0 errors / 0 warnings. Signed-off-by: Matt Toohey <contact@matttoohey.com>
… in switchTracer Switch #11 was a ~7.5s continuous main-thread freeze that switchTracer reported as `maxGap 0 / stalls 0`: not a single requestAnimationFrame fired the entire switch, so onFrame never ran, and the safety-net timer won the post-freeze race to finalize — which folded the missing frame into `firstFrame ?? settled` and recorded no stall. The freeze was invisible in exactly the case we most need to catch. This makes the next reproduction name its own cause. switchTracer.ts: - finalize now closes the post-freeze race. It measures the trailing gap from the last frame it saw to settle time and synthesizes the implied stall — symmetric with onFrame (record at >= STALL_MS, log immediately at >= FREEZE_MS), gated on foreground visibility so a hidden window's paused rAF isn't mistaken for a freeze. A continuous freeze (firstFrame still null) is logged as "continuous — no frame fired the entire switch", a tail freeze as "at switch tail"; either way the stall now lands in maxGap and the stall stats instead of vanishing. - the done line reports `firstFrame none (no frame fired)` instead of the misleading `firstFrame +<settled>ms`, so the no-frame fingerprint is unambiguous. - the done line gains `freezeBracket` — the largest gap between two adjacent synchronous milestones (begin, selectProject sync complete, ProjectSection destroy/mount, the App effect). Marks are recorded even during a total freeze (they fire from the reactive flush itself), so this names the span the freeze lives in (e.g. "navigation.selectProject sync complete → ProjectSection destroy 6220ms"), pointing the next investigation straight at the un-instrumented reactive cascade to profile — as close to the cause as the logs can get without a JS profiler. Gated at FREEZE_MS so healthy switches stay uncluttered. Also lands the previously-uncommitted diagnostics this builds on (the "separate slow disk write from blocked main thread" work): persistentStore schedules an event-loop-lag probe alongside each store.set and logs `set=<n>ms eventLoopLag=<n>ms`, so a slow set can be attributed to a genuine backend write (set >> lag) vs. a co-victim of a main-thread freeze (set ≈ lag); and switchTracer's stall classification moved from a size upper bound (THROTTLE_MS) to foreground visibility, with maxGap always reported. Console-only; no behavior change. Verification: pnpm check — 0 errors / 0 warnings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Toohey <contact@matttoohey.com>
…tch cascade
The timeline recording in the latest analysis proved the worst project
switch is ~670ms of pure JavaScript in the synchronous reactive cascade
(not rendering, layout, the persistentStore write, or PR-polling), but
the export carried no sampling data so it couldn't name the offending
function — and switchTracer's freezeBracket could only point at one
opaque span, "selectProject sync complete → ProjectSection destroy".
Add the finer in-app marks the analysis recommended so the next
reproduction names the stage in the console without a profiler. The marks
subdivide that previously-opaque span into derive / old-subtree teardown
/ new-subtree build:
- ProjectHome.svelte: a `$effect.pre` emits `ProjectHome derive` (with the
visible-project count). `$effect.pre` runs inside the flush *before* the
keyed `{#each visibleProjects}` block reconciles, so the mark lands
between `navigation.selectProject sync complete` and the old
`ProjectSection destroy` — splitting the freeze bracket into the derive
stage (up to this mark) and the old-subtree teardown that follows.
- ProjectSection.svelte: a top-level `ProjectSection construct` mark.
Top-level script runs once per new instance, before the template builds
any child BranchCards, so the gap to the existing `ProjectSection mount`
(onMount, fired after all cards are constructed) isolates the new-subtree
build — BranchCard construction plus each card's synchronous
timeline-cache hydration (already captured as `BranchCard.hydrateSync`)
— from the teardown. The construction-time `project.id` read is
intentional, suppressed with `svelte-ignore state_referenced_locally`
(same pattern BranchCard already uses for its hydration read).
- switchTracer.ts: doc comment updated to list the new milestones and note
that a freeze now names its stage (derive vs. teardown vs. build) rather
than landing in one span.
Console-only; no behavior change (the marks read pure deriveds/props and
`mark()` no-ops when no switch is active). Verification: pnpm check —
0 errors / 0 warnings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
… just the PR-poll fix The project-switch slowdown will be pursued on a separate branch, so strip all the diagnostic/logging code that was added here to measure the freeze. This reverses the three console-only instrumentation commits (65ec3ad, fd9b7a6, 66d6820) in full, leaving the branch with only the PR-management fix (the backend-owned PR-poll scheduler, plan phases 0-2). Removed: - switchTracer.ts (the whole project-switch tracer module). - navigation.svelte.ts: the beginSwitch() / "sync complete" marks bracketing selectProject. - App.svelte: the selectedProjectId-effect mark. - ProjectHome.svelte: the `$effect.pre` "ProjectHome derive" mark (the Phase 0 pr-status-changed event buffering / rAF flush stays — that is the fix). - ProjectSection.svelte: the construct/mount/destroy marks and mount/unmount counts. - BranchCard.svelte: the mount/unmount counts and the hydrateSync timing. - DiffModal.svelte: the selected-file -> painted double-rAF timing log. - persistentStore.ts: the slow-set warning and its event-loop-lag probe; setStoreValue is back to a plain awaited backend.store.set. The resulting tree is byte-for-byte equal to the phase-2 fix commit (fa22aa6); no PR-poll-scheduler code is touched. Verification: `pnpm check` - 0 errors / 0 warnings; no remaining references to the removed instrumentation. Signed-off-by: Matt Toohey <contact@matttoohey.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c74fce2d32
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| fn record_success(&mut self, project_id: &str, now: i64) -> bool { | ||
| self.last_polled_at.insert(project_id.to_string(), now); | ||
| self.failures.remove(project_id); | ||
| self.forced.remove(project_id); |
There was a problem hiding this comment.
Preserve forced refreshes queued during in-flight polls
When refresh_now is called for a project that is already in in_flight, due() skips it before consuming forced, so the nudge should survive until the current poll finishes. This removal clears that pending nudge on successful completion, meaning a push/PR-create refresh requested during an existing poll is dropped and the UI may keep pre-push PR status until the normal 15s/60s/5m interval elapses; the old frontend explicitly queued these same-project refreshes after the in-flight operation.
Useful? React with 👍 / 👎.
Keep refresh_now nudges that arrive while a project refresh is already in flight by only consuming forced entries when the scheduler schedules work, not when an older refresh completes. Add a scheduler regression covering both successful and failed completions. Signed-off-by: Matt Toohey <contact@matttoohey.com>
Collect every spawned branch refresh result before deciding whether the project refresh failed. A single branch task panic or join error is now logged and ignored when other branch tasks complete, so the refreshed count is preserved and the final pr-statuses-refreshed event is emitted. Keep returning an error when every spawned branch refresh task fails, and cover both partial and total task-failure cases with targeted async regressions. Signed-off-by: Matt Toohey <contact@matttoohey.com>
Validate scheduler pending-branch interest against live branch ids during prune so deleted branches cannot pin a surviving project to the pending poll tier. Add coverage for the deleted-branch scheduler regression and the store branch-id query. Signed-off-by: Matt Toohey <contact@matttoohey.com>
Treat branch-level PR status fetch/update failures as failed refreshes in the aggregation path, so an all-false project refresh returns an error and drives scheduler backoff/stale handling. Keep mixed success/failure refreshes tolerant and cover the all-branches-failed regression. Signed-off-by: Matt Toohey <contact@matttoohey.com>
Reworks PR-status polling to fix the main-thread jank that showed up when switching projects. The frontend used to own the polling cadence and fan out one
refresh_all_pr_statusescall per project; this moves cadence, concurrency, and dedup into the backend and reduces the reactive churn a polling cycle causes on the client.What changed
Coalesce status events + bound fetch concurrency (perf)
ProjectHome.sveltenow buffers incomingpr-status-changedevents and applies them in a singlerequestAnimationFrameflush, so a storm of N per-branch events coalesces into one reactive rebuild instead of Nnew Map(...)allocations + N derivation re-runs.prs.rssplits the polling core out intorefresh_project_pr_statuses, which fans per-branchghfetches across a bounded semaphore pool (PR_REFRESH_CONCURRENCY = 6) instead of awaiting them serially. The semaphore is passed in so a single pool can be shared across all projects.Backend PR-poll scheduler (refactor)
pr_poll_scheduler.rsowns polling cadence and concurrency: it derives the project list from the DB, decides what is due across the pending/selected/background tiers, dedups in-flight work, backs off failures, and runs everything on one bounded pool. Spawned once the store is ready; the interest/hint commands are managed unconditionally so they resolve even before the store exists.prPollingService.tsbecomes a thin shim that forwards UI interest to the backend as hints (selected project, pending checks, window focus, explicit refresh nudges) and re-broadcasts the backend's refresh/stale lifecycle events, preserving theonRefreshing/onStale/isRefreshingAPI thatBranchCardPrButtonrelies on.App.sveltedrops the app-side project-list$effectand wiresprPollingService.init()/dispose()into the component lifecycle.Per-client interest (refactor)
clientId(tauri-mainfor native, a per-tab UUID for web), and disconnect evicts a client's interest. New interest/hint commands:set_foreground_project,set_focus,set_branch_pending,refresh_now,disconnect_client.Latency instrumentation used during the investigation was added and then removed; only the PR-poll rework remains in the net diff.