feat(069): WebSocket disconnect observability — prove client-side issues with data by isaiahb · Pull Request #2355 · Mentra-Community/MentraOS

isaiahb · 2026-03-29T18:39:18Z

What

Adds diagnostic instrumentation that definitively proves whether WebSocket disconnects are client-initiated or server-initiated. Three new structured log events (ws-close, ws-reconnect, ws-dispose) and connection churn counters in SystemVitalsLogger.

Why

The team keeps being told that disconnect churn is a client-side issue, but there's no instrumented proof. Today's production data shows:

One user disconnected 30 times in 42 minutes — all code 1006 (client dropped)
Another user has an exact 5-minute disconnect cycle — code 1000 (deliberate close)
Some users are stable for hours on the same server, same code, same region

But we can't prove the server was healthy at the moment of disconnect because we don't log what was happening. This PR adds that evidence.

What it adds

`ws-close` event (on every glasses WebSocket close)

{"feature": "ws-close", "code": 1006, "timeSinceLastClientMessage": 34500, "timeSinceLastAppPong": 34200, "sessionDurationSeconds": 142, "reconnectCount": 3}

If timeSinceLastClientMessage > 10000 on a 1006 close → the CLIENT went silent. The server was sending pings every 2 seconds.

`ws-reconnect` event (on every session reconnection)

{"feature": "ws-reconnect", "reconnectCount": 4, "downtimeMs": 7200, "lastCloseCode": 1006, "timeSinceLastClientMessage": 41700}

`ws-dispose` event (when session is disposed after grace period)

{"feature": "ws-dispose", "sessionDurationSeconds": 142, "reconnectCount": 3, "lastCloseCode": 1006, "timeSinceLastClientMessage": 94500, "disposalReason": "grace_period_timeout"}

Connection churn in system-vitals (every 30s)

wsDisconnects, wsReconnects, wsAvgDowntimeMs, wsCloseCodeDist

Changes

File	Change
`UserSession.ts`	Add 5 tracking fields, `ws-reconnect` log in `createOrReconnect()`, `ws-dispose` log in `dispose()`, make `lastPongTime` public
`bun-websocket.ts`	Track `lastClientMessageTime`/`lastAppLevelPongTime` on every message, `ws-close` structured log, `connectionChurnTracker.recordDisconnect()`
`SystemVitalsLogger.ts`	Add `ConnectionChurnTracker` class, include churn counters in vitals

Investigation docs

069 spike — full investigation, evidence gap, BetterStack queries to prove client-side issues

…t, ws-dispose events with client silence tracking and connection churn counters

github-actions · 2026-03-29T18:39:26Z

📋 PR Review Helper

📱 Mobile App Build

⏳ Waiting for build...

🕶️ ASG Client Build

⏳ Waiting for build...

🔀 Test Locally

gh pr checkout 2355

coderabbitai · 2026-03-29T18:39:47Z

📝 Walkthrough

Walkthrough

Adds client-side WebSocket disconnect/reconnect observability: new UserSession fields for reconnects, timestamps, and close info; structured lifecycle logs (ws-close, ws-reconnect, ws-dispose); a ConnectionChurnTracker exported singleton; and SystemVitalsLogger emission of churn metrics every 30s.

Changes

Cohort / File(s)	Summary
Specification `cloud/issues/069-ws-disconnect-observability/spike.md`	New spike doc specifying session state fields, structured events (`ws-close`, `ws-reconnect`, `ws-dispose`), connection-churn tracking and example queries.
Metrics & Vitals `cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts`	Adds `ConnectionChurnTracker` and exported `connectionChurnTracker` singleton; SystemVitalsLogger reads/resets tracker every 30s and emits `wsDisconnects`, `wsReconnects`, `wsAvgDowntimeMs`, and `wsCloseCodeDist`.
Session State & Lifecycle `cloud/packages/cloud/src/services/session/UserSession.ts`	Adds public fields: `reconnectCount`, `lastCloseCode`, `lastCloseReason`, `lastClientMessageTime`, `lastAppLevelPongTime`; makes `lastPongTime` public; updates `createOrReconnect()` to compute downtime, increment `reconnectCount`, emit `ws-reconnect` and call tracker; updates `dispose()` to emit structured `ws-dispose`.
WebSocket Handling `cloud/packages/cloud/src/services/websocket/bun-websocket.ts`	Tracks `lastClientMessageTime` and `lastAppLevelPongTime` on messages, stashes close `code`/`reason` to session, calls `connectionChurnTracker.recordDisconnect(code)` on close, and replaces simple close log with structured `ws-close` metrics (durations, time-since, reconnectCount).

Sequence Diagram

sequenceDiagram
    participant Client
    participant WS as WebSocket Handler
    participant Session as UserSession
    participant Churn as ConnectionChurnTracker
    participant Vitals as SystemVitalsLogger

    Client->>WS: send message
    WS->>Session: lastClientMessageTime = now()
    Client->>WS: send {type: "pong"}
    WS->>Session: lastAppLevelPongTime = now()

    Client-->>WS: disconnect (code, reason)
    WS->>Session: lastCloseCode = code\nlastCloseReason = reason
    WS->>Churn: recordDisconnect(code)
    WS->>WS: emit structured ws-close (duration, silent ms, reconnectCount, metrics)

    Note over Client: downtime window

    Client->>WS: reconnect
    WS->>Session: createOrReconnect() -> compute downtime\nreconnectCount++
    Session->>Churn: recordReconnect(downtimeMs)
    Session->>Session: emit structured ws-reconnect (downtime, deltas)

    Vitals->>Churn: read & reset metrics (30s)
    Churn-->>Vitals: {disconnects, reconnects, avgDowntime, closeCodeDist}
    Vitals->>Vitals: emit vitals including churn fields

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hotfix: remove gc-after-disconnect, add /livez liveness probe, enable Porter metrics scraping #2353: Modifies UserSession.dispose(); touches same disposal/logging area instrumented here.

Poem

🐰 I hopped through sockets, counted each goodbye—

Reconnects, silent beats, timestamps passing by.
Close codes whispered, churn tallied neat,
Vitals hum a rhythm of every heartbeat.
Crunch the numbers, let observability bloom. 🥕✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and specifically summarizes the main change: adding WebSocket disconnect observability instrumentation with client-side diagnostic capabilities, matching the core objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hotfix/ws-disconnect-observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78dd826d14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-29T18:43:50Z

cloud/packages/cloud/src/services/session/UserSession.ts

      );

+      // Record reconnect for churn tracking in SystemVitalsLogger
+      connectionChurnTracker.recordReconnect(downtimeMs);


Gate reconnect metrics on prior disconnect state

createOrReconnect() records a reconnect even when disconnectedAt is null, so any second socket attach while the previous session is still considered connected (for example, reconnect-open arriving before old-close, which is explicitly handled elsewhere) increments reconnectCount/wsReconnects despite no confirmed disconnect. This skews the new churn telemetry and can make stable clients look noisy.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-29T18:43:50Z

cloud/packages/cloud/src/services/websocket/bun-websocket.ts

+  );
+
+  // Record disconnect for churn tracking in SystemVitalsLogger
+  connectionChurnTracker.recordDisconnect(code);


Count stale-socket closes in disconnect observability

Disconnect accounting is executed only after the stale-WebSocket early return path, so when an old socket closes after a newer one is already active (the race described in this function), recordDisconnect() is never called and no structured ws-close event is emitted for that close. This systematically underreports disconnect churn in fast-reconnect scenarios and weakens the evidence this change is meant to collect.

Useful? React with 👍 / 👎.

coderabbitai

🧹 Nitpick comments (1)

cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts (1)

320-328: Consider extracting the IIFE to a local variable for clarity.

The IIFE works correctly but is a bit unconventional. A simple variable assignment would be more readable:

♻️ Optional: Extract to local variable

       const operationSnapshot = operationTimers.getAndReset();
       const totalOperationMs = Object.values(operationSnapshot).reduce((a, b) => a + b, 0);
+      const churn = connectionChurnTracker.getAndReset();

       logger.info(
         {
           // ... existing fields ...

-          // Connection churn — the key evidence for client-side vs server-side disconnects.
-          // If disconnects >> reconnects, sessions are being disposed (not surviving grace period).
-          // Close code distribution tells us WHO is killing the connection:
-          //   1006 = abnormal (client went dark / network loss — CLIENT-SIDE)
-          //   1001 = going away (server shutting down — SERVER-SIDE)
-          //   1000 = normal close (clean disconnect — EITHER SIDE)
-          ...(() => {
-            const churn = connectionChurnTracker.getAndReset();
-            return {
-              wsDisconnects: churn.disconnects,
-              wsReconnects: churn.reconnects,
-              wsAvgDowntimeMs: churn.avgDowntimeMs,
-              wsCloseCodeDist: Object.keys(churn.closeCodes).length > 0 ? JSON.stringify(churn.closeCodes) : undefined,
-            };
-          })(),
+          // Connection churn — the key evidence for client-side vs server-side disconnects.
+          wsDisconnects: churn.disconnects,
+          wsReconnects: churn.reconnects,
+          wsAvgDowntimeMs: churn.avgDowntimeMs,
+          wsCloseCodeDist: Object.keys(churn.closeCodes).length > 0 ? JSON.stringify(churn.closeCodes) : undefined,

           // Uptime

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts` around lines
320 - 328, Extract the IIFE that builds the churn object into a clearly named
local variable to improve readability: call connectionChurnTracker.getAndReset()
into a variable (e.g., churn or churnMetrics) and then create the object with
wsDisconnects, wsReconnects, wsAvgDowntimeMs and wsCloseCodeDist, replacing the
inline (() => { ... })() expression; update the surrounding code that used the
IIFE to use this new local variable instead (symbols to touch:
connectionChurnTracker.getAndReset(), wsDisconnects, wsReconnects,
wsAvgDowntimeMs, wsCloseCodeDist).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts`:
- Around line 320-328: Extract the IIFE that builds the churn object into a
clearly named local variable to improve readability: call
connectionChurnTracker.getAndReset() into a variable (e.g., churn or
churnMetrics) and then create the object with wsDisconnects, wsReconnects,
wsAvgDowntimeMs and wsCloseCodeDist, replacing the inline (() => { ... })()
expression; update the surrounding code that used the IIFE to use this new local
variable instead (symbols to touch: connectionChurnTracker.getAndReset(),
wsDisconnects, wsReconnects, wsAvgDowntimeMs, wsCloseCodeDist).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0be65fa0-9efa-4afd-96d0-21925f8afb1b

📥 Commits

Reviewing files that changed from the base of the PR and between 475c667 and 78dd826.

📒 Files selected for processing (4)

cloud/issues/069-ws-disconnect-observability/spike.md
cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts
cloud/packages/cloud/src/services/session/UserSession.ts
cloud/packages/cloud/src/services/websocket/bun-websocket.ts

…ocket upgrades as reconnects (Codex review)

isaiahb · 2026-03-29T18:46:20Z

Addressed review comments in 0436f0c:

Codex P2 #1 (Gate reconnect metrics on prior disconnect state): Valid — fixed. ws-reconnect log and connectionChurnTracker.recordReconnect() now only fire when disconnectedAt is set. If a new WebSocket arrives before the old one closes (upgrade, not a reconnect), we log it separately without incrementing the reconnect counter.

Codex P2 #2 (Count stale-socket closes in disconnect observability): Intentionally excluded. Stale closes happen when a new WebSocket is already active — the session is still connected. Counting these as disconnects would inflate the churn metrics and produce misleading data. The stale path already logs "stale — newer WebSocket already active, ignoring" for debugging.

CodeRabbit (Extract IIFE): Minor style preference — keeping as-is for consistency with the existing operationTimers pattern in the same function.

feat(069): WebSocket disconnect observability — ws-close, ws-reconnec…

78dd826

…t, ws-dispose events with client silence tracking and connection churn counters

isaiahb requested a review from a team as a code owner March 29, 2026 18:39

chatgpt-codex-connector bot reviewed Mar 29, 2026

View reviewed changes

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

fix(069): gate reconnect metrics on disconnectedAt — don't count WebS…

0436f0c

…ocket upgrades as reconnects (Codex review)

isaiahb self-assigned this Mar 29, 2026

isaiahb merged commit 06a36cd into main Mar 29, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(069): WebSocket disconnect observability — prove client-side issues with data#2355

feat(069): WebSocket disconnect observability — prove client-side issues with data#2355
isaiahb merged 2 commits intomainfrom
hotfix/ws-disconnect-observability

isaiahb commented Mar 29, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

coderabbitai bot commented Mar 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

isaiahb commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isaiahb commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

What it adds

ws-close event (on every glasses WebSocket close)

ws-reconnect event (on every session reconnection)

ws-dispose event (when session is disposed after grace period)

Connection churn in system-vitals (every 30s)

Changes

Investigation docs

Uh oh!

github-actions bot commented Mar 29, 2026

📋 PR Review Helper

📱 Mobile App Build

🕶️ ASG Client Build

🔀 Test Locally

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

isaiahb commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaiahb commented Mar 29, 2026 •

edited

Loading

`ws-close` event (on every glasses WebSocket close)

`ws-reconnect` event (on every session reconnection)

`ws-dispose` event (when session is disposed after grace period)

coderabbitai bot commented Mar 29, 2026 •

edited

Loading