Skip to content

hotfix: remove gc-after-disconnect, add /livez liveness probe, enable Porter metrics scraping#2353

Merged
isaiahb merged 5 commits intomainfrom
hotfix/remove-forced-gc-add-livez
Mar 29, 2026
Merged

hotfix: remove gc-after-disconnect, add /livez liveness probe, enable Porter metrics scraping#2353
isaiahb merged 5 commits intomainfrom
hotfix/remove-forced-gc-add-livez

Conversation

@isaiahb
Copy link
Copy Markdown
Contributor

@isaiahb isaiahb commented Mar 29, 2026

What

Three changes to improve pod stability and observability:

  1. Remove gc-after-disconnect — eliminates forced GC blocking on every session disconnect
  2. Configure separate liveness/readiness probes — liveness → /livez (zero computation, 3s timeout), readiness → /health (5s timeout)
  3. Enable Prometheus metrics scraping — session count, event loop lag, UDP stats visible in Porter's dashboard

Why

gc-after-disconnect

Confirmed wasteful: 31 calls/hour on US Central, 2,242ms total event loop blocking, freed 0 bytes every single time. All objects are live session data — nothing to collect after a single disconnect. Adds unnecessary event loop blocking during disconnect storms, contributing to liveness probe failures.

Liveness probe on /livez

Previously, liveness and readiness both hit /health which iterates all sessions, counts WebSockets, updates metric gauges, and serializes JSON. With 60 sessions under load, this could take >1 second, causing liveness failures → SIGKILL. /livez just returns "ok" — if the event loop can return 2 bytes, the process is alive. /health stays as the readiness probe (if slow, pod is removed from LB gracefully instead of killed).

Metrics scraping

The /metrics endpoint already exposes Prometheus gauges (mentra_user_sessions, mentra_event_loop_lag_ms, mentra_miniapp_sessions, UDP stats, WS message counts). Enabling metricsScraping in porter.yaml tells Porter's Prometheus to scrape it, making these visible in Porter's built-in dashboard. The whole team can see session count and event loop health without needing BetterStack access.

Tested on

Deployed to cloud-debug on US Central. Verified:

  • Logs flowing via Vector with structured fields
  • Zero gc-after-disconnect events (removal confirmed working)
  • GC probes healthy (12-23ms)
  • 1 session connected, stable RSS

Changes

File Change
UserSession.ts Remove gc-after-disconnect block, canRunPostDisconnectGc(), static fields
porter.yaml Add livenessCheck/livez, readinessCheck/health, metricsScraping, terminationGracePeriodSeconds
porter-debug.yml Add branch to debug deploy workflow

Summary by CodeRabbit

  • Chores
    • Removed automatic garbage collection behavior triggered after user disconnects.
    • Added liveness and readiness health check endpoints.
    • Enabled Prometheus metrics scraping for service monitoring.
    • Extended graceful shutdown period to 30 seconds.

…king, 0 bytes freed), contributes to liveness probe failures under load
@isaiahb isaiahb requested a review from a team as a code owner March 29, 2026 13:19
@github-actions
Copy link
Copy Markdown

📋 PR Review Helper

📱 Mobile App Build

Waiting for build...

🕶️ ASG Client Build

Waiting for build...


🔀 Test Locally

gh pr checkout 2353

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e38b248c-3431-4de2-8903-1139552e2f0e

📥 Commits

Reviewing files that changed from the base of the PR and between 579a54a and 24ecba5.

📒 Files selected for processing (1)
  • cloud/porter.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • cloud/porter.yaml

📝 Walkthrough

Walkthrough

Removed the post-disconnect forced GC from UserSession.dispose(), added Kubernetes-style liveness/readiness probes, metrics scraping, and a 30s termination grace period in cloud/porter.yaml, and expanded a CI workflow trigger to include branch hotfix/remove-forced-gc-add-livez.

Changes

Cohort / File(s) Summary
UserSession cleanup
cloud/packages/cloud/src/services/session/UserSession.ts
Deleted post-disconnect setTimeout that invoked Bun.gc(true), removed heap/RSS delta measurement and logging, and removed static rate-limiter state and helper (lastPostDisconnectGc, POST_DISCONNECT_GC_COOLDOWN_MS, canRunPostDisconnectGc()); simplified dispose() body.
Kubernetes probes & metrics
cloud/porter.yaml
Added livenessCheck (httpPath: /livez, timeoutSeconds: 3, initialDelaySeconds: 15), readinessCheck (httpPath: /health, timeoutSeconds: 5, initialDelaySeconds: 15), enabled metricsScraping (path: /metrics, port: 80, scrapeIntervalSeconds: 30), and set terminationGracePeriodSeconds: 30.
CI trigger
.github/workflows/porter-debug.yml
Expanded push trigger to include branch hotfix/remove-forced-gc-add-livez (no other workflow changes).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

🐇 I hopped through code at break of dawn,

No sudden GC to startle the lawn,
Probes keep watch while metrics sing,
Graceful exits softly bring,
My burrow’s tidy—joys are drawn.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes all three main changes in the pull request: removing forced GC on disconnect, adding the /livez liveness probe, and enabling Prometheus metrics scraping.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hotfix/remove-forced-gc-add-livez

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…obes in porter.yaml — liveness is zero-computation, 3s timeout
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cloud/porter.yaml (1)

45-49: Optional: deduplicate repeated ingress timeout rationale comments.

Line 45–49 repeats the same rationale already documented at Line 21–25. Keeping one canonical block will reduce drift risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cloud/porter.yaml` around lines 45 - 49, The YAML contains a duplicated
comment block explaining "Extended timeouts for WebSocket connections
(/glasses-ws, /app-ws)" (same rationale appears twice); remove the repeated
comment so only the canonical rationale block remains (keep the first occurrence
that documents the 3600s proxy timeout rationale and delete the later duplicate)
to avoid drift while preserving the explanation for the nginx/Porter timeout
tweak.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cloud/porter.yaml`:
- Around line 45-49: The YAML contains a duplicated comment block explaining
"Extended timeouts for WebSocket connections (/glasses-ws, /app-ws)" (same
rationale appears twice); remove the repeated comment so only the canonical
rationale block remains (keep the first occurrence that documents the 3600s
proxy timeout rationale and delete the later duplicate) to avoid drift while
preserving the explanation for the nginx/Porter timeout tweak.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 352dc3cf-48ff-4840-ab1c-e8ce1effe435

📥 Commits

Reviewing files that changed from the base of the PR and between a22fd53 and 0377ade.

📒 Files selected for processing (1)
  • cloud/porter.yaml

@isaiahb isaiahb changed the title hotfix: remove gc-after-disconnect — eliminates forced GC blocking on every session disconnect hotfix: remove gc-after-disconnect, add /livez liveness probe, enable Porter metrics scraping Mar 29, 2026
@isaiahb isaiahb self-assigned this Mar 29, 2026
@isaiahb isaiahb merged commit c4bb538 into main Mar 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant