Conversation
Staging => Main
Staging => Main
Bug fixes: - A1: Store and clear ManagedStreamingExtension interval (leaked every disposed UserSession) - A2: Store Soniox listener refs, use typed .off() in close() (7 listeners pinning UserSession chain) - A3: Add 4 missing manager dispose calls in UserSession.dispose() - A4: Identity check before sessions.delete() (prevents orphaning newer session) - A5: Email case normalization in WebSocket upgrade - A6: Disposed guard on TranscriptionManager scheduled reconnects - A7: Disposed guard on TranslationManager scheduled retry Observability: - Event loop lag warnings (>100ms) logged to BetterStack with heap/session context - /health enriched with heapUsedMB, rssMB, eventLoopLagMs, activeSessions, uptimeSeconds - /livez lightweight liveness endpoint (zero computation) - /api/admin/heap-snapshot endpoint (Bun.generateHeapSnapshot) - SystemVitalsLogger: 30s periodic structured log with Golden Signals + operation timing framework - MetricsService.getCurrentLag() public method - porter-debug.yml updated to deploy from this branch Docs: spikes and specs for issues 055, 056, 057
- Never commit tokens, secrets, API keys, or credentials (use placeholders) - Never commit PII (customer emails, user IDs — anonymize) - Never include private conversations verbatim - AI agent-specific guidelines (build before push, confirm before infra changes) - Remove duplicate Anti-Patterns section, add security anti-patterns - Clean up old Anti-Patterns that were duplicated
…ision log - Replace removeAllListeners approach with stored refs + typed .off() - Document all three attempts and why each was chosen/rejected - Add context explaining what this.session is (Soniox, not UserSession) - Show the reference chain that causes the memory leak - Update README design.md template to include Decision Log and guidance on keeping design docs current when implementation changes
Critical: - Remove unauthenticated heap snapshot from hono-app.ts (already exists behind validateAdminEmail in admin.routes.ts) - Add Bun-native GET /memory/heap-snapshot-bun to admin routes (with auth) - Normalize userId to lowercase in handleAppUpgrade + legacy CONNECTION_INIT Major: - Move listener .off() to finally block after session.finish() so finalized/finished handlers can flush final transcript data - Track timer handles in pendingTimers Set for both TranscriptionManager and TranslationManager, clear all on dispose() Minor: - Add /livez to noisy endpoint filter (prevents probe log spam) - Throttle event loop lag warnings (30s cooldown) - Add getDisposedPendingGCCount() to MemoryLeakDetector (typed getter) - Use typed getter in SystemVitalsLogger instead of as any cast - Simplify getAndReset() to replace object - Add try/catch around stream count access in SystemVitalsLogger
- StreamRegistry: add dispose() that clears userStreams, streamToUser, cfInputToUser Maps - UserSettingsManager: improve dispose() to release loadPromise closure + mark loaded - DeviceManager: already had dispose() (no-op, no state to clean) - CalendarManager: already had dispose() (clears events + subscribedApps) - UserSession: replace (as any).dispose?.() casts with typed calls
…tension startPlaybackUrlPolling() had an untracked 60s setTimeout that could keep a disposed ManagedStreamingExtension (and transitively UserSession) alive. Now stored as this.playbackUrlTimeout, cleared in dispose().
…lity cloud/issues-057: memory leak fixes + cloud observability overhaul
…, connection counting, slow query monitoring
Adds instrumentation to identify WHY pods become unresponsive before crashing:
A1. GC Probe (SystemVitalsLogger): Forces Bun.gc(true) every 60s, times it,
logs duration + memory freed. Tells us if GC pauses are 5ms or 500ms.
A2. GC on Session Disconnect (UserSession): After dispose(), forces GC on
next tick with timing. Rate-limited to 1 per 10s to avoid thrashing
during crash cascades. Shows if sessions are properly freed.
A3. Health Check Timing (hono-app): Wraps /health in performance.now().
Warns at >50ms. Tells us if the health handler itself is slow vs the
event loop being blocked before the handler runs.
A4. Soniox Send Timing (SonioxSdkStream): Times sendAudio() calls. Warns
at >50ms, rate-limited to 1 per 30s per stream. Tells us if Soniox
WebSocket I/O is blocking the event loop.
A5. Connection Counting (SystemVitalsLogger): Adds glassesWebSockets,
totalConnections, micActiveCount to 30s vitals. Correlates crashes
with connection count (200-325) not just session count (65).
A6. MongoDB Slow Query Plugin (mongodb.connection): Global Mongoose plugin
that times all queries. Warns when >MONGOOSE_SLOW_QUERY_MS (env var,
default disabled). Set to 100 in Doppler to enable.
Also: spec.md for the full plan, tsconfig excludes scripts/ from build.
See: cloud/issues/061-crash-investigation/spec.md
…g, cleanup - Fix glassesWebSockets: check session.websocket (actual field name) - Move mongoose.plugin() to module load time (before models import) - Remove .env fallback from analyze-heap — just require the env var - Add division-by-zero guard in analyze-heap
hotfix/crash-diagnostics
…oof framework, tech debt - Complete audit of every MongoDB call in cloud server (11 collections, 262 call sites) - 18 hot-path calls identified (session connect, message handling, location updates) - Only 18% of reads use .lean() — 82% instantiate full Mongoose Documents unnecessarily - France event loop blocked 162s/1800s (9%) by MongoDB RTT alone - GC confirmed NOT primary cause (54ms probes, 0MB freed) - Proof framework: 4 steps to definitively confirm/deny MongoDB as crash cause - Tech debt doc: 8 items found (app cache, atomic updates, lean, N+1, JWT centralization) - Key finding: latency is pure network RTT, not query execution (indexes work, 0ms on server)
…mory app cache
Three changes:
B1. Event loop gap detector — 1s setInterval that detects >2s gaps, logs the
blocking duration. The missing link between 'query was slow' and 'health
check failed because the event loop was blocked at that exact moment.'
B2. Cumulative MongoDB blocking metric — mongoQueryCount, mongoTotalBlockingMs,
mongoMaxQueryMs in 30s system vitals. Extends existing slow-query plugin.
B3. In-memory app cache — loads all 1,314 apps (~2MB) at boot, refreshes
every 5 min with write-through invalidation. Eliminates 18 hot-path DB
round-trips (80-370ms each) per session. Largest single-change impact
on crash frequency.
Migration: 9 hot-path call sites switch from App.findOne() to appCache.
Cold paths left as-is.
…plica guide, B4 operation timing - Detailed every write path (16 across 5 files) that must call invalidate() - Documented every staleness edge case: publicUrl, hashedApiKey, permissions are critical — bounded to 30s max staleness (down from 5 min) - Added refresh failure detection (warn if 3 missed cycles) - Multi-pod staleness: write-through only works on local pod, other regions wait for 30s timer — this is the tradeoff vs 9% event loop blocking - Atlas read replica step-by-step: UI steps, readPreference=nearest, cost (~$228/mo for 4 replicas). Future work, not in scope for this spec. - B4: wire operationTimers to audio processing, glasses/app message handlers, display rendering. Measures actual synchronous CPU budget consumption.
…tion timing
B1. Event Loop Gap Detector (SystemVitalsLogger)
1s setInterval detects >2s gaps — catches ALL blocking regardless of cause.
Logs 'event-loop-gap' with gap duration, RSS, session count.
The missing link between 'query was slow' and 'health check failed.'
B2. Cumulative MongoDB Blocking Metric (mongodb.connection + SystemVitalsLogger)
MongoQueryStats accumulator records all slow queries (>MONGOOSE_SLOW_QUERY_MS).
SystemVitalsLogger reads/resets every 30s: mongoQueryCount, mongoTotalBlockingMs,
mongoMaxQueryMs. Shows exact % of event loop time consumed by MongoDB.
B3. In-Memory App Cache (app-cache.service.ts + 9 hot-path migrations)
Loads all 1,314 apps (~2MB) at boot. Refreshes every 30s.
9 hot-path call sites migrated from App.findOne() to appCache.getByPackageName()
with DB fallback. 16 write paths call appCache.invalidate() fire-and-forget.
Eliminates 80-370ms RTT per hot-path app lookup.
Files: AppManager, SubscriptionManager, app-message-handler, sdk.auth.service,
system-app.api (5 call sites). Invalidation in: console.apps.service, app.service,
developer.service, developer.routes, permissions.routes, admin.routes.
B4. Hot Path Operation Timing (UdpAudioServer, bun-websocket, DisplayManager)
Wires existing operationTimers framework to: audioProcessing (3,250 ops/sec),
glassesMessage, appMessage, displayRendering. Vitals now show op_*_ms and
opBudgetUsedPct — exact % of event loop consumed by application code.
Also: .gitignore for .heap/
…robustness CRITICAL: - SDK auth: revert to DB-only credential validation — never use stale cache for hashedApiKey. Security: stale cache = revoked keys still validate. - SDK auth: export clearSdkAuthCache(), called from appCache.invalidate() so key rotation clears both caches. - developer.service: stop logging API key material (raw key + hash were in debug logs — credential leak to all log sinks). MAJOR: - bun-websocket: restore ws.close(1011) in app message error handler (accidentally removed when adding finally block — regression). - system-app.api + AppManager: partial cache hits now fall back to DB. Changed cachedApps.length > 0 to cachedApps.length === names.length. - app-cache: concurrent refresh protection (refreshing flag prevents overlapping refresh() calls from racing). - app-cache: start interval before initial refresh() so transient first-load failures don't permanently disable the cache. - UdpAudioServer: move audioProcessing timing to finally block. MINOR: - spec.md: fix 5-min vs 30-sec inconsistency in B3 description and code.
Two App.updateOne() calls for onboardingStatus were missing cache invalidation. Found during manual review — bots missed these.
Cloud/062 mongodb audit
… exit
Adds SIGTERM/SIGINT handler that sends WebSocket close frames (code 1001
'Going Away') to every connected glasses and app WebSocket before the
process exits. This reduces deploy-caused user disruption from 30-60s
(waiting for ping timeout on unclean disconnect) to <2s (immediate close
frame detection).
A1. SIGTERM handler (index.ts): closes all glasses + app WebSockets,
stops timers (vitals, cache, metrics), closes MongoDB, exits 0.
A2. Health drain mode (hono-app.ts): /health returns 503 during shutdown
so LB stops routing new requests to the dying pod.
A3. Reject new WS upgrades (bun-websocket.ts): returns 503 during
shutdown so new sessions go to the new pod.
Shared shutdown state via services/shutdown.ts module.
Issue doc: cloud/issues/063-graceful-shutdown/spec.md
Deploy workflow: added hotfix/graceful-shutdown to porter-debug.yml
…ng shutdown Without this, the LB could route REST requests to the dying pod during the SIGTERM grace period. Those requests hit a pod with no sessions → 503. This was likely causing some of the intermittent 503s users reported. The middleware returns 503 on every request except /livez (so K8s can still check process liveness during drain).
…ents process.exit(0) immediately after ws.close() can kill the runtime before Bun flushes the close handshake. Adding a 2-second drain delay ensures close frames are sent on the wire before the process exits.
hotfix/graceful-shutdown
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📋 PR Review Helper📱 Mobile App Build⏳ Waiting for build... 🕶️ ASG Client Build✅ Ready to test! (commit 🔀 Test Locallygh pr checkout 2349 |
Deploying mentra-store-dev with
|
| Latest commit: |
5d6ade6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://f85f4d60.augmentos-appstore-2.pages.dev |
| Branch Preview URL: | https://cloud-sync-main-to-dev.augmentos-appstore-2.pages.dev |
Deploying dev-augmentos-console with
|
| Latest commit: |
5d6ade6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://823bdd1f.dev-augmentos-console.pages.dev |
| Branch Preview URL: | https://cloud-sync-main-to-dev.dev-augmentos-console.pages.dev |
Deploying prod-augmentos-account with
|
| Latest commit: |
5d6ade6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://f9eded3d.augmentos-e84.pages.dev |
| Branch Preview URL: | https://cloud-sync-main-to-dev.augmentos-e84.pages.dev |
Deploying mentra-live-ota-site with
|
| Latest commit: |
5d6ade6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://1cc6f944.mentra-live-ota-site.pages.dev |
| Branch Preview URL: | https://cloud-sync-main-to-dev.mentra-live-ota-site.pages.dev |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d6ade6dcc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const memBefore = process.memoryUsage(); | ||
|
|
||
| const t0 = performance.now(); | ||
| Bun.gc(true); |
There was a problem hiding this comment.
runGcProbe() executes Bun.gc(true) (a synchronous full GC) and start() schedules this probe every 60 seconds for all deployments, which adds guaranteed stop-the-world pauses on live traffic. Under moderate/high heap pressure this can itself cause event-loop stalls and health-check failures, so the observability feature can degrade availability in production; this should be gated behind an explicit debug/diagnostic flag (default off).
Useful? React with 👍 / 👎.
No description provided.