feat(069): WebSocket disconnect observability — prove client-side issues with data#2355
feat(069): WebSocket disconnect observability — prove client-side issues with data#2355
Conversation
…t, ws-dispose events with client silence tracking and connection churn counters
📋 PR Review Helper📱 Mobile App Build⏳ Waiting for build... 🕶️ ASG Client Build⏳ Waiting for build... 🔀 Test Locallygh pr checkout 2355 |
📝 WalkthroughWalkthroughAdds client-side WebSocket disconnect/reconnect observability: new UserSession fields for reconnects, timestamps, and close info; structured lifecycle logs ( Changes
Sequence DiagramsequenceDiagram
participant Client
participant WS as WebSocket Handler
participant Session as UserSession
participant Churn as ConnectionChurnTracker
participant Vitals as SystemVitalsLogger
Client->>WS: send message
WS->>Session: lastClientMessageTime = now()
Client->>WS: send {type: "pong"}
WS->>Session: lastAppLevelPongTime = now()
Client-->>WS: disconnect (code, reason)
WS->>Session: lastCloseCode = code\nlastCloseReason = reason
WS->>Churn: recordDisconnect(code)
WS->>WS: emit structured ws-close (duration, silent ms, reconnectCount, metrics)
Note over Client: downtime window
Client->>WS: reconnect
WS->>Session: createOrReconnect() -> compute downtime\nreconnectCount++
Session->>Churn: recordReconnect(downtimeMs)
Session->>Session: emit structured ws-reconnect (downtime, deltas)
Vitals->>Churn: read & reset metrics (30s)
Churn-->>Vitals: {disconnects, reconnects, avgDowntime, closeCodeDist}
Vitals->>Vitals: emit vitals including churn fields
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78dd826d14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ); | ||
|
|
||
| // Record reconnect for churn tracking in SystemVitalsLogger | ||
| connectionChurnTracker.recordReconnect(downtimeMs); |
There was a problem hiding this comment.
Gate reconnect metrics on prior disconnect state
createOrReconnect() records a reconnect even when disconnectedAt is null, so any second socket attach while the previous session is still considered connected (for example, reconnect-open arriving before old-close, which is explicitly handled elsewhere) increments reconnectCount/wsReconnects despite no confirmed disconnect. This skews the new churn telemetry and can make stable clients look noisy.
Useful? React with 👍 / 👎.
| ); | ||
|
|
||
| // Record disconnect for churn tracking in SystemVitalsLogger | ||
| connectionChurnTracker.recordDisconnect(code); |
There was a problem hiding this comment.
Count stale-socket closes in disconnect observability
Disconnect accounting is executed only after the stale-WebSocket early return path, so when an old socket closes after a newer one is already active (the race described in this function), recordDisconnect() is never called and no structured ws-close event is emitted for that close. This systematically underreports disconnect churn in fast-reconnect scenarios and weakens the evidence this change is meant to collect.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts (1)
320-328: Consider extracting the IIFE to a local variable for clarity.The IIFE works correctly but is a bit unconventional. A simple variable assignment would be more readable:
♻️ Optional: Extract to local variable
const operationSnapshot = operationTimers.getAndReset(); const totalOperationMs = Object.values(operationSnapshot).reduce((a, b) => a + b, 0); + const churn = connectionChurnTracker.getAndReset(); logger.info( { // ... existing fields ... - // Connection churn — the key evidence for client-side vs server-side disconnects. - // If disconnects >> reconnects, sessions are being disposed (not surviving grace period). - // Close code distribution tells us WHO is killing the connection: - // 1006 = abnormal (client went dark / network loss — CLIENT-SIDE) - // 1001 = going away (server shutting down — SERVER-SIDE) - // 1000 = normal close (clean disconnect — EITHER SIDE) - ...(() => { - const churn = connectionChurnTracker.getAndReset(); - return { - wsDisconnects: churn.disconnects, - wsReconnects: churn.reconnects, - wsAvgDowntimeMs: churn.avgDowntimeMs, - wsCloseCodeDist: Object.keys(churn.closeCodes).length > 0 ? JSON.stringify(churn.closeCodes) : undefined, - }; - })(), + // Connection churn — the key evidence for client-side vs server-side disconnects. + wsDisconnects: churn.disconnects, + wsReconnects: churn.reconnects, + wsAvgDowntimeMs: churn.avgDowntimeMs, + wsCloseCodeDist: Object.keys(churn.closeCodes).length > 0 ? JSON.stringify(churn.closeCodes) : undefined, // Uptime🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts` around lines 320 - 328, Extract the IIFE that builds the churn object into a clearly named local variable to improve readability: call connectionChurnTracker.getAndReset() into a variable (e.g., churn or churnMetrics) and then create the object with wsDisconnects, wsReconnects, wsAvgDowntimeMs and wsCloseCodeDist, replacing the inline (() => { ... })() expression; update the surrounding code that used the IIFE to use this new local variable instead (symbols to touch: connectionChurnTracker.getAndReset(), wsDisconnects, wsReconnects, wsAvgDowntimeMs, wsCloseCodeDist).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts`:
- Around line 320-328: Extract the IIFE that builds the churn object into a
clearly named local variable to improve readability: call
connectionChurnTracker.getAndReset() into a variable (e.g., churn or
churnMetrics) and then create the object with wsDisconnects, wsReconnects,
wsAvgDowntimeMs and wsCloseCodeDist, replacing the inline (() => { ... })()
expression; update the surrounding code that used the IIFE to use this new local
variable instead (symbols to touch: connectionChurnTracker.getAndReset(),
wsDisconnects, wsReconnects, wsAvgDowntimeMs, wsCloseCodeDist).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0be65fa0-9efa-4afd-96d0-21925f8afb1b
📒 Files selected for processing (4)
cloud/issues/069-ws-disconnect-observability/spike.mdcloud/packages/cloud/src/services/metrics/SystemVitalsLogger.tscloud/packages/cloud/src/services/session/UserSession.tscloud/packages/cloud/src/services/websocket/bun-websocket.ts
…ocket upgrades as reconnects (Codex review)
|
Addressed review comments in Codex P2 #1 (Gate reconnect metrics on prior disconnect state): Valid — fixed. Codex P2 #2 (Count stale-socket closes in disconnect observability): Intentionally excluded. Stale closes happen when a new WebSocket is already active — the session is still connected. Counting these as disconnects would inflate the churn metrics and produce misleading data. The stale path already logs CodeRabbit (Extract IIFE): Minor style preference — keeping as-is for consistency with the existing |
What
Adds diagnostic instrumentation that definitively proves whether WebSocket disconnects are client-initiated or server-initiated. Three new structured log events (
ws-close,ws-reconnect,ws-dispose) and connection churn counters in SystemVitalsLogger.Why
The team keeps being told that disconnect churn is a client-side issue, but there's no instrumented proof. Today's production data shows:
But we can't prove the server was healthy at the moment of disconnect because we don't log what was happening. This PR adds that evidence.
What it adds
ws-closeevent (on every glasses WebSocket close){"feature": "ws-close", "code": 1006, "timeSinceLastClientMessage": 34500, "timeSinceLastAppPong": 34200, "sessionDurationSeconds": 142, "reconnectCount": 3}If
timeSinceLastClientMessage > 10000on a 1006 close → the CLIENT went silent. The server was sending pings every 2 seconds.ws-reconnectevent (on every session reconnection){"feature": "ws-reconnect", "reconnectCount": 4, "downtimeMs": 7200, "lastCloseCode": 1006, "timeSinceLastClientMessage": 41700}ws-disposeevent (when session is disposed after grace period){"feature": "ws-dispose", "sessionDurationSeconds": 142, "reconnectCount": 3, "lastCloseCode": 1006, "timeSinceLastClientMessage": 94500, "disposalReason": "grace_period_timeout"}Connection churn in system-vitals (every 30s)
wsDisconnects,wsReconnects,wsAvgDowntimeMs,wsCloseCodeDistChanges
UserSession.tsws-reconnectlog increateOrReconnect(),ws-disposelog indispose(), makelastPongTimepublicbun-websocket.tslastClientMessageTime/lastAppLevelPongTimeon every message,ws-closestructured log,connectionChurnTracker.recordDisconnect()SystemVitalsLogger.tsConnectionChurnTrackerclass, include churn counters in vitalsInvestigation docs