Skip to content

feat(069): WebSocket disconnect observability — prove client-side issues with data#2355

Merged
isaiahb merged 2 commits intomainfrom
hotfix/ws-disconnect-observability
Mar 29, 2026
Merged

feat(069): WebSocket disconnect observability — prove client-side issues with data#2355
isaiahb merged 2 commits intomainfrom
hotfix/ws-disconnect-observability

Conversation

@isaiahb
Copy link
Copy Markdown
Contributor

@isaiahb isaiahb commented Mar 29, 2026

What

Adds diagnostic instrumentation that definitively proves whether WebSocket disconnects are client-initiated or server-initiated. Three new structured log events (ws-close, ws-reconnect, ws-dispose) and connection churn counters in SystemVitalsLogger.

Why

The team keeps being told that disconnect churn is a client-side issue, but there's no instrumented proof. Today's production data shows:

  • One user disconnected 30 times in 42 minutes — all code 1006 (client dropped)
  • Another user has an exact 5-minute disconnect cycle — code 1000 (deliberate close)
  • Some users are stable for hours on the same server, same code, same region

But we can't prove the server was healthy at the moment of disconnect because we don't log what was happening. This PR adds that evidence.

What it adds

ws-close event (on every glasses WebSocket close)

{"feature": "ws-close", "code": 1006, "timeSinceLastClientMessage": 34500, "timeSinceLastAppPong": 34200, "sessionDurationSeconds": 142, "reconnectCount": 3}

If timeSinceLastClientMessage > 10000 on a 1006 close → the CLIENT went silent. The server was sending pings every 2 seconds.

ws-reconnect event (on every session reconnection)

{"feature": "ws-reconnect", "reconnectCount": 4, "downtimeMs": 7200, "lastCloseCode": 1006, "timeSinceLastClientMessage": 41700}

ws-dispose event (when session is disposed after grace period)

{"feature": "ws-dispose", "sessionDurationSeconds": 142, "reconnectCount": 3, "lastCloseCode": 1006, "timeSinceLastClientMessage": 94500, "disposalReason": "grace_period_timeout"}

Connection churn in system-vitals (every 30s)

wsDisconnects, wsReconnects, wsAvgDowntimeMs, wsCloseCodeDist

Changes

File Change
UserSession.ts Add 5 tracking fields, ws-reconnect log in createOrReconnect(), ws-dispose log in dispose(), make lastPongTime public
bun-websocket.ts Track lastClientMessageTime/lastAppLevelPongTime on every message, ws-close structured log, connectionChurnTracker.recordDisconnect()
SystemVitalsLogger.ts Add ConnectionChurnTracker class, include churn counters in vitals

Investigation docs

  • 069 spike — full investigation, evidence gap, BetterStack queries to prove client-side issues

…t, ws-dispose events with client silence tracking and connection churn counters
@isaiahb isaiahb requested a review from a team as a code owner March 29, 2026 18:39
@github-actions
Copy link
Copy Markdown

📋 PR Review Helper

📱 Mobile App Build

Waiting for build...

🕶️ ASG Client Build

Waiting for build...


🔀 Test Locally

gh pr checkout 2355

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

📝 Walkthrough

Walkthrough

Adds client-side WebSocket disconnect/reconnect observability: new UserSession fields for reconnects, timestamps, and close info; structured lifecycle logs (ws-close, ws-reconnect, ws-dispose); a ConnectionChurnTracker exported singleton; and SystemVitalsLogger emission of churn metrics every 30s.

Changes

Cohort / File(s) Summary
Specification
cloud/issues/069-ws-disconnect-observability/spike.md
New spike doc specifying session state fields, structured events (ws-close, ws-reconnect, ws-dispose), connection-churn tracking and example queries.
Metrics & Vitals
cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts
Adds ConnectionChurnTracker and exported connectionChurnTracker singleton; SystemVitalsLogger reads/resets tracker every 30s and emits wsDisconnects, wsReconnects, wsAvgDowntimeMs, and wsCloseCodeDist.
Session State & Lifecycle
cloud/packages/cloud/src/services/session/UserSession.ts
Adds public fields: reconnectCount, lastCloseCode, lastCloseReason, lastClientMessageTime, lastAppLevelPongTime; makes lastPongTime public; updates createOrReconnect() to compute downtime, increment reconnectCount, emit ws-reconnect and call tracker; updates dispose() to emit structured ws-dispose.
WebSocket Handling
cloud/packages/cloud/src/services/websocket/bun-websocket.ts
Tracks lastClientMessageTime and lastAppLevelPongTime on messages, stashes close code/reason to session, calls connectionChurnTracker.recordDisconnect(code) on close, and replaces simple close log with structured ws-close metrics (durations, time-since, reconnectCount).

Sequence Diagram

sequenceDiagram
    participant Client
    participant WS as WebSocket Handler
    participant Session as UserSession
    participant Churn as ConnectionChurnTracker
    participant Vitals as SystemVitalsLogger

    Client->>WS: send message
    WS->>Session: lastClientMessageTime = now()
    Client->>WS: send {type: "pong"}
    WS->>Session: lastAppLevelPongTime = now()

    Client-->>WS: disconnect (code, reason)
    WS->>Session: lastCloseCode = code\nlastCloseReason = reason
    WS->>Churn: recordDisconnect(code)
    WS->>WS: emit structured ws-close (duration, silent ms, reconnectCount, metrics)

    Note over Client: downtime window

    Client->>WS: reconnect
    WS->>Session: createOrReconnect() -> compute downtime\nreconnectCount++
    Session->>Churn: recordReconnect(downtimeMs)
    Session->>Session: emit structured ws-reconnect (downtime, deltas)

    Vitals->>Churn: read & reset metrics (30s)
    Churn-->>Vitals: {disconnects, reconnects, avgDowntime, closeCodeDist}
    Vitals->>Vitals: emit vitals including churn fields
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hopped through sockets, counted each goodbye—

Reconnects, silent beats, timestamps passing by.
Close codes whispered, churn tallied neat,
Vitals hum a rhythm of every heartbeat.
Crunch the numbers, let observability bloom. 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically summarizes the main change: adding WebSocket disconnect observability instrumentation with client-side diagnostic capabilities, matching the core objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hotfix/ws-disconnect-observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78dd826d14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

);

// Record reconnect for churn tracking in SystemVitalsLogger
connectionChurnTracker.recordReconnect(downtimeMs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate reconnect metrics on prior disconnect state

createOrReconnect() records a reconnect even when disconnectedAt is null, so any second socket attach while the previous session is still considered connected (for example, reconnect-open arriving before old-close, which is explicitly handled elsewhere) increments reconnectCount/wsReconnects despite no confirmed disconnect. This skews the new churn telemetry and can make stable clients look noisy.

Useful? React with 👍 / 👎.

);

// Record disconnect for churn tracking in SystemVitalsLogger
connectionChurnTracker.recordDisconnect(code);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count stale-socket closes in disconnect observability

Disconnect accounting is executed only after the stale-WebSocket early return path, so when an old socket closes after a newer one is already active (the race described in this function), recordDisconnect() is never called and no structured ws-close event is emitted for that close. This systematically underreports disconnect churn in fast-reconnect scenarios and weakens the evidence this change is meant to collect.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts (1)

320-328: Consider extracting the IIFE to a local variable for clarity.

The IIFE works correctly but is a bit unconventional. A simple variable assignment would be more readable:

♻️ Optional: Extract to local variable
       const operationSnapshot = operationTimers.getAndReset();
       const totalOperationMs = Object.values(operationSnapshot).reduce((a, b) => a + b, 0);
+      const churn = connectionChurnTracker.getAndReset();

       logger.info(
         {
           // ... existing fields ...

-          // Connection churn — the key evidence for client-side vs server-side disconnects.
-          // If disconnects >> reconnects, sessions are being disposed (not surviving grace period).
-          // Close code distribution tells us WHO is killing the connection:
-          //   1006 = abnormal (client went dark / network loss — CLIENT-SIDE)
-          //   1001 = going away (server shutting down — SERVER-SIDE)
-          //   1000 = normal close (clean disconnect — EITHER SIDE)
-          ...(() => {
-            const churn = connectionChurnTracker.getAndReset();
-            return {
-              wsDisconnects: churn.disconnects,
-              wsReconnects: churn.reconnects,
-              wsAvgDowntimeMs: churn.avgDowntimeMs,
-              wsCloseCodeDist: Object.keys(churn.closeCodes).length > 0 ? JSON.stringify(churn.closeCodes) : undefined,
-            };
-          })(),
+          // Connection churn — the key evidence for client-side vs server-side disconnects.
+          wsDisconnects: churn.disconnects,
+          wsReconnects: churn.reconnects,
+          wsAvgDowntimeMs: churn.avgDowntimeMs,
+          wsCloseCodeDist: Object.keys(churn.closeCodes).length > 0 ? JSON.stringify(churn.closeCodes) : undefined,

           // Uptime
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts` around lines
320 - 328, Extract the IIFE that builds the churn object into a clearly named
local variable to improve readability: call connectionChurnTracker.getAndReset()
into a variable (e.g., churn or churnMetrics) and then create the object with
wsDisconnects, wsReconnects, wsAvgDowntimeMs and wsCloseCodeDist, replacing the
inline (() => { ... })() expression; update the surrounding code that used the
IIFE to use this new local variable instead (symbols to touch:
connectionChurnTracker.getAndReset(), wsDisconnects, wsReconnects,
wsAvgDowntimeMs, wsCloseCodeDist).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts`:
- Around line 320-328: Extract the IIFE that builds the churn object into a
clearly named local variable to improve readability: call
connectionChurnTracker.getAndReset() into a variable (e.g., churn or
churnMetrics) and then create the object with wsDisconnects, wsReconnects,
wsAvgDowntimeMs and wsCloseCodeDist, replacing the inline (() => { ... })()
expression; update the surrounding code that used the IIFE to use this new local
variable instead (symbols to touch: connectionChurnTracker.getAndReset(),
wsDisconnects, wsReconnects, wsAvgDowntimeMs, wsCloseCodeDist).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0be65fa0-9efa-4afd-96d0-21925f8afb1b

📥 Commits

Reviewing files that changed from the base of the PR and between 475c667 and 78dd826.

📒 Files selected for processing (4)
  • cloud/issues/069-ws-disconnect-observability/spike.md
  • cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts
  • cloud/packages/cloud/src/services/session/UserSession.ts
  • cloud/packages/cloud/src/services/websocket/bun-websocket.ts

@isaiahb
Copy link
Copy Markdown
Contributor Author

isaiahb commented Mar 29, 2026

Addressed review comments in 0436f0c:

Codex P2 #1 (Gate reconnect metrics on prior disconnect state): Valid — fixed. ws-reconnect log and connectionChurnTracker.recordReconnect() now only fire when disconnectedAt is set. If a new WebSocket arrives before the old one closes (upgrade, not a reconnect), we log it separately without incrementing the reconnect counter.

Codex P2 #2 (Count stale-socket closes in disconnect observability): Intentionally excluded. Stale closes happen when a new WebSocket is already active — the session is still connected. Counting these as disconnects would inflate the churn metrics and produce misleading data. The stale path already logs "stale — newer WebSocket already active, ignoring" for debugging.

CodeRabbit (Extract IIFE): Minor style preference — keeping as-is for consistency with the existing operationTimers pattern in the same function.

@isaiahb isaiahb self-assigned this Mar 29, 2026
@isaiahb isaiahb merged commit 06a36cd into main Mar 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant