Skip to content

perf(staged): move PR polling to a backend scheduler and coalesce status events#783

Open
matt2e wants to merge 11 commits into
mainfrom
project-swith-slowdown
Open

perf(staged): move PR polling to a backend scheduler and coalesce status events#783
matt2e wants to merge 11 commits into
mainfrom
project-swith-slowdown

Conversation

@matt2e

@matt2e matt2e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Reworks PR-status polling to fix the main-thread jank that showed up when switching projects. The frontend used to own the polling cadence and fan out one refresh_all_pr_statuses call per project; this moves cadence, concurrency, and dedup into the backend and reduces the reactive churn a polling cycle causes on the client.

What changed

Coalesce status events + bound fetch concurrency (perf)

  • ProjectHome.svelte now buffers incoming pr-status-changed events and applies them in a single requestAnimationFrame flush, so a storm of N per-branch events coalesces into one reactive rebuild instead of N new Map(...) allocations + N derivation re-runs.
  • prs.rs splits the polling core out into refresh_project_pr_statuses, which fans per-branch gh fetches across a bounded semaphore pool (PR_REFRESH_CONCURRENCY = 6) instead of awaiting them serially. The semaphore is passed in so a single pool can be shared across all projects.

Backend PR-poll scheduler (refactor)

  • New pr_poll_scheduler.rs owns polling cadence and concurrency: it derives the project list from the DB, decides what is due across the pending/selected/background tiers, dedups in-flight work, backs off failures, and runs everything on one bounded pool. Spawned once the store is ready; the interest/hint commands are managed unconditionally so they resolve even before the store exists.
  • prPollingService.ts becomes a thin shim that forwards UI interest to the backend as hints (selected project, pending checks, window focus, explicit refresh nudges) and re-broadcasts the backend's refresh/stale lifecycle events, preserving the onRefreshing / onStale / isRefreshing API that BranchCardPrButton relies on.
  • App.svelte drops the app-side project-list $effect and wires prPollingService.init() / dispose() into the component lifecycle.

Per-client interest (refactor)

  • Interest is tracked per connected client and unioned across clients. Each hint carries a clientId (tauri-main for native, a per-tab UUID for web), and disconnect evicts a client's interest. New interest/hint commands: set_foreground_project, set_focus, set_branch_pending, refresh_now, disconnect_client.

Latency instrumentation used during the investigation was added and then removed; only the PR-poll rework remains in the net diff.

matt2e and others added 7 commits June 10, 2026 17:45
…ncy, yield background polling (plan phase 0)

Phase 0 of moving PR polling into a backend-owned scheduler. Three
self-contained changes that fix the intermittent project-switch freeze
caused by a PR-polling storm starving the renderer's event loop:

- ProjectHome.svelte: buffer pr-status-changed events and apply a single
  branchesByProject rebuild per animation frame instead of one Map
  allocation per branch; the pending flush is cancelled on destroy. A
  storm of N branch events now collapses into one reactive flush.
- prs.rs: fan the per-branch fetches in refresh_all_pr_statuses out
  across a Semaphore-bounded pool (cap 6) instead of awaiting them
  serially, so a project resolves in ~1 round-trip's wall-clock instead
  of N. Per-branch and final emissions, error handling, and return value
  are unchanged — only the scheduling is now parallel.
- prPollingService.ts: yield to the event loop between projects in the
  poll() drain loop, and hold background-tier polling for a short
  cooldown after setSelectedProject so a switch can interleave instead
  of waiting out the serial chain. Interval tiers, focus/blur gating,
  failure backoff, and the due logic are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
…end scheduler (plan phase 1)

Phase 1 of moving PR polling into a backend-owned scheduler. Makes the
backend the single owner of polling cadence and concurrency and collapses
`prPollingService.ts` into a thin interest/hint layer. Behavior-preserving:
the same effective tiers and triggers, now driven from the backend. Single
frontend only — no multi-client registry (that is Phase 2).

Backend:
- New `pr_poll_scheduler` module: one long-lived `tokio::time::interval`
  tick loop (granularity 5s, well below the 15s pending tier). It owns the
  poll-state that used to live in the frontend — per-project `last_polled_at`,
  the three interval tiers (15s pending / 60s selected / 5m background, moved
  verbatim from the frontend), and failure/stale tracking. Poll-state is not
  persisted: on restart everything is "due" (per the plan's open question).
  On each tick it derives the project list from the DB (`list_projects`),
  computes which projects are due from owned state + interest, dedups
  in-flight work, and spawns refreshes. Interest changes and `refresh_now`
  nudges wake the loop immediately via a `Notify`.
- All due/dedup/backoff decisions live in a pure `PollState` whose methods
  take an explicit `now` (ms) and in-flight set — the injectable seam that
  makes the logic unit-testable without `gh` calls or wall-clock sleeps. The
  loop is the only place that touches the real clock and fetcher. 7 unit
  tests cover the three tiers, the focus=off pause, in-flight dedup,
  `refresh_now` folding into dedup, the stale threshold, failure backoff, and
  pruning.
- Reuses Phase 0's bounded concurrency: `refresh_all_pr_statuses` is split so
  its semaphore-bounded fan-out lives in a reusable `refresh_project_pr_statuses`
  core that both the command and the scheduler call. The scheduler passes one
  shared `Semaphore` (cap `PR_REFRESH_CONCURRENCY` = 6) across all projects it
  refreshes, so a focus-regain / launch burst still caps total `gh`
  subprocesses — the bounded pool is the herd control (explicit time-jitter
  was left out to keep the due logic pure/testable; a possible follow-up).
- New interest/hint commands, registered in the invoke handler:
  `set_foreground_project`, `set_focus`, `set_branch_pending`, `refresh_now`.
  The effective tier for a project is the union of interest (any foregrounding
  => selected; any pending => fast; nothing focused => pause), structured as a
  union so Phase 2 can extend it.
- The scheduler is managed unconditionally (so the hint commands resolve even
  during the needs-reset prompt) and the tick loop is spawned in setup's `Ok`
  branch alongside `background_sync`.
- Per-branch `pr-status-changed` and final `pr-statuses-refreshed` events are
  unchanged. Two small events replace state the frontend used to compute in
  its own poll loop: `pr-refresh-state` (per-project refreshing on/off) and
  `pr-status-stale` (crossed/recovered the failure threshold).

Frontend (`prPollingService.ts` + callers):
- Removed the timer-driven drain loop, the interval tiers, the
  `getProjectsDue` math, and the failure backoff — these now live in the
  backend. Phase 0's yield helper and switch cooldown went with them (the
  foreground serial chain they mitigated no longer exists).
- Rewired the public methods to backend commands, preserving the API callers
  use: `setSelectedProject` -> `set_foreground_project`, `updateChecksStatus`
  -> `set_branch_pending`, `refreshNow` -> `refresh_now`, focus/blur listeners
  -> `set_focus`. `setProjects` was removed (the backend derives the project
  list from the DB); `App.svelte`'s project-list `$effect` and its now-unused
  import were dropped accordingly.
- `onStale` / `onRefreshing` / `isRefreshing` are unchanged for callers but are
  now driven by the new backend lifecycle events, subscribed in a new `init()`
  (with a matching `dispose()`), wired from `App.svelte`'s onMount/onDestroy.

Verification: `cargo check` clean; `cargo test pr_poll_scheduler` 7/7 pass
(291 existing tests unaffected); `pnpm check` 0 errors / 0 warnings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
…disconnect eviction (plan phase 2)

Phase 2 of the PR-poll-scheduler rearchitecture. Turns the scheduler's
process-global, single-valued interest into per-connected-client interest,
with the effective cadence computed as the union across all clients and
per-client eviction on disconnect. Clients are the native Tauri window plus
each WebSocket browser session (web path lands in parallel).

Per-client data model + union cadence (pr_poll_scheduler.rs):
- New `ClientInterest { foreground_project, focused, pending_branches,
  last_seen }`. `PollState` drops the scalar `foreground_project`/`focused`/
  `pending_branches` and gains `clients: HashMap<String, ClientInterest>`.
  The per-project *work* bookkeeping (`last_polled_at`/`failures`/`stale`/
  `forced`) stays project-keyed and shared — it is about the work, not the
  observer, so N clients still trigger only one poll per project per tier.
- Union helpers keep `PollState` clock-free (all `now` passed in):
  `any_focused()`, `is_foreground(p)`, and `project_has_pending(p)` rewritten
  over `clients`. `interval_for` keeps its precedence (any pending => 15s;
  else foregrounded => 60s; else 5m); `due` swaps `!self.focused` for
  `!self.any_focused()`. `forced`/dedup/interval logic is untouched.
- Client-keyed setters bump `last_seen`: `set_foreground`/`set_focus`/
  `set_branch_pending(client_id, .., now)`. `force(project_id)` stays global.

Lifecycle + TTL:
- New `touch(client_id, now)` (create-or-bump heartbeat), `disconnect_client`
  (clean WS close), and `evict_stale_clients(now, ttl_ms)` (dirty-drop
  fallback). `CLIENT_TTL_MS = 90_000` (~3x an expected <=30s WS keepalive);
  the tick loop calls `evict_stale_clients` inside the existing locked block
  before `prune`. `TAURI_CLIENT_ID = "tauri-main"` is exempt from TTL
  eviction (process death is the native window's teardown).
- `prune` keeps project-keyed pruning and now also clears each client's
  foreground/pending interest in dead projects; client lifecycle eviction
  stays separate from project prune.

Phase-1 equivalence: `PollState::new()` pre-seeds `clients` with the Tauri id
as `focused: true, last_seen: 0`, so the first immediate tick still polls at
launch and single-client behavior is byte-for-byte equivalent to Phase 1
(the Tauri frontend's later hints update that same entry).

Commands (pr_poll_scheduler.rs + lib.rs): all four hint commands gain a
`client_id` arg and `rename_all = "camelCase"` (added to `set_focus`, which
lacked it); `refresh_now` carries `client_id` only to `touch` that client.
New `disconnect_client` command, registered in the lib.rs invoke handler. The
public `PrPollScheduler` methods (`set_foreground`/`set_focus`/
`set_branch_pending`/`force`/`touch`/`disconnect_client`) read the clock and
delegate to the pure `PollState`, and are `pub` so the parallel web
`handle_ws`/`dispatch` can drive them via `app_handle.state::<Arc<...>>()`.

Frontend (prPollingService.ts + commands.ts): the service owns a module-level
client id — the fixed `"tauri-main"` under Tauri (via `isTauri`), else a
`crypto.randomUUID()` per page load — threaded through all four hint wrappers
(now carrying `clientId`). New `disconnectPrPollClient(clientId)` wrapper,
called from `dispose()`. The id is exposed via `getPrPollClientId()` so the
web transport can append the same value as `?clientId=<id>` on the WS connect.

Backend<->web contract the parallel work must honor: the same `clientId` flows
on both channels — every interest invoke and the events WS connect query. The
web work owns adding the four hint arms + `disconnect_client` to `dispatch`
(reading `clientId` from args) and the `handle_ws` `touch` (connect/ping) /
`disconnect_client` (close) calls, plus ensuring a keepalive actually flows so
the TTL fallback works.

Intentionally out of scope: `web_server.rs` is left to the parallel web work
(touching it here would conflict); the two open code-review bugs —
`record_success`/`record_failure` clearing `forced` (mid-flight re-force drop)
and pruning `pending_branches` by branch id — are left for their own commits.
`forced` stays exactly as today (global, `record_*` behavior unchanged) so the
bug-(a) fix applies cleanly on top.

Tests: the 7 existing PollState tests are ported to the per-client structure
(seeding the single Tauri client); 9 new pure-`PollState` tests cover
foreground/pending/focus union across two clients, disconnect recomputing the
union and dropping pending, TTL eviction with the Tauri-id exemption, `touch`
keeping a client alive past the boundary, `forced` surviving the forcing
client's disconnect, and single-client equivalence to Phase 1.

Verification:
- `cargo check` (src-tauri) — clean, no warnings.
- `cargo test pr_poll_scheduler` — 16/16 pass (7 ported + 9 new; 291 other
  tests unaffected).
- `pnpm check` — 0 errors / 0 warnings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
…er + slow-path logs)

Port the project-switch diagnostics from the upstream working tree so the
PR-poll-scheduler rework on this branch can be measured against the original
freeze. Console-only; no behavior change.

- switchTracer.ts (new): traces a project→project switch — synchronous
  milestones (selectProject, ProjectSection mount/destroy, the App
  selectedProjectId effect), component mount/unmount counts, and
  requestAnimationFrame gaps as main-thread stalls — so genuine render cost
  is distinguishable from event-loop starvation. Settles on quiet (300ms) or
  a 3s cap; frame gaps >700ms are treated as background throttling and not
  counted as stalls. Lifecycle lines use `[switch #N ...]`; one-off slow-path
  warnings from other modules use the bare `[switch]` prefix.
- navigation.svelte.ts: beginSwitch() at the top of selectProject and a
  "sync complete" mark at the end, bracketing the synchronous switch work.
- App.svelte: mark the selectedProjectId effect that hints prPolling.
- ProjectSection.svelte / BranchCard.svelte: count mounts/unmounts and mark
  ProjectSection mount/destroy; BranchCard also records its synchronous
  timeline-cache hydration cost (recordSync).
- DiffModal.svelte: time selected-file → painted via a double rAF
  (`[diff] render painted in Nms`).
- persistentStore.ts: warn when a store.set disk write blocks >=50ms.

Skipped from the upstream tree: the prPollingService.ts slow-`refreshAllPrStatuses`
log. That diagnostic instrumented the frontend poll loop, which this branch
already moved into the backend scheduler (phases 0–2) — the instrumented code
no longer exists here, so there is no landing site for it.

Verification: `pnpm check` — 0 errors / 0 warnings.
Signed-off-by: Matt Toohey <contact@matttoohey.com>
… in switchTracer

Switch #11 was a ~7.5s continuous main-thread freeze that switchTracer
reported as `maxGap 0 / stalls 0`: not a single requestAnimationFrame
fired the entire switch, so onFrame never ran, and the safety-net timer
won the post-freeze race to finalize — which folded the missing frame into
`firstFrame ?? settled` and recorded no stall. The freeze was invisible in
exactly the case we most need to catch. This makes the next reproduction
name its own cause.

switchTracer.ts:
- finalize now closes the post-freeze race. It measures the trailing gap
  from the last frame it saw to settle time and synthesizes the implied
  stall — symmetric with onFrame (record at >= STALL_MS, log immediately at
  >= FREEZE_MS), gated on foreground visibility so a hidden window's paused
  rAF isn't mistaken for a freeze. A continuous freeze (firstFrame still
  null) is logged as "continuous — no frame fired the entire switch", a tail
  freeze as "at switch tail"; either way the stall now lands in maxGap and
  the stall stats instead of vanishing.
- the done line reports `firstFrame none (no frame fired)` instead of the
  misleading `firstFrame +<settled>ms`, so the no-frame fingerprint is
  unambiguous.
- the done line gains `freezeBracket` — the largest gap between two adjacent
  synchronous milestones (begin, selectProject sync complete, ProjectSection
  destroy/mount, the App effect). Marks are recorded even during a total
  freeze (they fire from the reactive flush itself), so this names the span
  the freeze lives in (e.g. "navigation.selectProject sync complete →
  ProjectSection destroy 6220ms"), pointing the next investigation straight
  at the un-instrumented reactive cascade to profile — as close to the cause
  as the logs can get without a JS profiler. Gated at FREEZE_MS so healthy
  switches stay uncluttered.

Also lands the previously-uncommitted diagnostics this builds on (the
"separate slow disk write from blocked main thread" work): persistentStore
schedules an event-loop-lag probe alongside each store.set and logs
`set=<n>ms eventLoopLag=<n>ms`, so a slow set can be attributed to a genuine
backend write (set >> lag) vs. a co-victim of a main-thread freeze
(set ≈ lag); and switchTracer's stall classification moved from a size upper
bound (THROTTLE_MS) to foreground visibility, with maxGap always reported.

Console-only; no behavior change. Verification: pnpm check — 0 errors / 0
warnings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
…tch cascade

The timeline recording in the latest analysis proved the worst project
switch is ~670ms of pure JavaScript in the synchronous reactive cascade
(not rendering, layout, the persistentStore write, or PR-polling), but
the export carried no sampling data so it couldn't name the offending
function — and switchTracer's freezeBracket could only point at one
opaque span, "selectProject sync complete → ProjectSection destroy".

Add the finer in-app marks the analysis recommended so the next
reproduction names the stage in the console without a profiler. The marks
subdivide that previously-opaque span into derive / old-subtree teardown
/ new-subtree build:

- ProjectHome.svelte: a `$effect.pre` emits `ProjectHome derive` (with the
  visible-project count). `$effect.pre` runs inside the flush *before* the
  keyed `{#each visibleProjects}` block reconciles, so the mark lands
  between `navigation.selectProject sync complete` and the old
  `ProjectSection destroy` — splitting the freeze bracket into the derive
  stage (up to this mark) and the old-subtree teardown that follows.
- ProjectSection.svelte: a top-level `ProjectSection construct` mark.
  Top-level script runs once per new instance, before the template builds
  any child BranchCards, so the gap to the existing `ProjectSection mount`
  (onMount, fired after all cards are constructed) isolates the new-subtree
  build — BranchCard construction plus each card's synchronous
  timeline-cache hydration (already captured as `BranchCard.hydrateSync`)
  — from the teardown. The construction-time `project.id` read is
  intentional, suppressed with `svelte-ignore state_referenced_locally`
  (same pattern BranchCard already uses for its hydration read).
- switchTracer.ts: doc comment updated to list the new milestones and note
  that a freeze now names its stage (derive vs. teardown vs. build) rather
  than landing in one span.

Console-only; no behavior change (the marks read pure deriveds/props and
`mark()` no-ops when no switch is active). Verification: pnpm check —
0 errors / 0 warnings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Toohey <contact@matttoohey.com>
… just the PR-poll fix

The project-switch slowdown will be pursued on a separate branch, so strip
all the diagnostic/logging code that was added here to measure the freeze.
This reverses the three console-only instrumentation commits (65ec3ad,
fd9b7a6, 66d6820) in full, leaving the branch with only the PR-management
fix (the backend-owned PR-poll scheduler, plan phases 0-2).

Removed:
- switchTracer.ts (the whole project-switch tracer module).
- navigation.svelte.ts: the beginSwitch() / "sync complete" marks bracketing
  selectProject.
- App.svelte: the selectedProjectId-effect mark.
- ProjectHome.svelte: the `$effect.pre` "ProjectHome derive" mark (the Phase 0
  pr-status-changed event buffering / rAF flush stays — that is the fix).
- ProjectSection.svelte: the construct/mount/destroy marks and mount/unmount
  counts.
- BranchCard.svelte: the mount/unmount counts and the hydrateSync timing.
- DiffModal.svelte: the selected-file -> painted double-rAF timing log.
- persistentStore.ts: the slow-set warning and its event-loop-lag probe;
  setStoreValue is back to a plain awaited backend.store.set.

The resulting tree is byte-for-byte equal to the phase-2 fix commit
(fa22aa6); no PR-poll-scheduler code is touched.

Verification: `pnpm check` - 0 errors / 0 warnings; no remaining references
to the removed instrumentation.

Signed-off-by: Matt Toohey <contact@matttoohey.com>
@matt2e matt2e requested review from baxen and wesbillman as code owners June 12, 2026 02:48

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c74fce2d32

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

fn record_success(&mut self, project_id: &str, now: i64) -> bool {
self.last_polled_at.insert(project_id.to_string(), now);
self.failures.remove(project_id);
self.forced.remove(project_id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve forced refreshes queued during in-flight polls

When refresh_now is called for a project that is already in in_flight, due() skips it before consuming forced, so the nudge should survive until the current poll finishes. This removal clears that pending nudge on successful completion, meaning a push/PR-create refresh requested during an existing poll is dropped and the UI may keep pre-push PR status until the normal 15s/60s/5m interval elapses; the old frontend explicitly queued these same-project refreshes after the in-flight operation.

Useful? React with 👍 / 👎.

matt2e added 4 commits June 12, 2026 13:04
Keep refresh_now nudges that arrive while a project refresh is already in flight by only consuming forced entries when the scheduler schedules work, not when an older refresh completes.

Add a scheduler regression covering both successful and failed completions.

Signed-off-by: Matt Toohey <contact@matttoohey.com>
Collect every spawned branch refresh result before deciding whether the project refresh failed. A single branch task panic or join error is now logged and ignored when other branch tasks complete, so the refreshed count is preserved and the final pr-statuses-refreshed event is emitted.

Keep returning an error when every spawned branch refresh task fails, and cover both partial and total task-failure cases with targeted async regressions.

Signed-off-by: Matt Toohey <contact@matttoohey.com>
Validate scheduler pending-branch interest against live branch ids during prune so deleted branches cannot pin a surviving project to the pending poll tier.

Add coverage for the deleted-branch scheduler regression and the store branch-id query.

Signed-off-by: Matt Toohey <contact@matttoohey.com>
Treat branch-level PR status fetch/update failures as failed refreshes in the aggregation path, so an all-false project refresh returns an error and drives scheduler backoff/stale handling.

Keep mixed success/failure refreshes tolerant and cover the all-branches-failed regression.

Signed-off-by: Matt Toohey <contact@matttoohey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant