Cloud/sync main to dev by isaiahb · Pull Request #2349 · Mentra-Community/MentraOS

isaiahb · 2026-03-28T23:05:41Z

No description provided.

Staging => Main

Bug fixes: - A1: Store and clear ManagedStreamingExtension interval (leaked every disposed UserSession) - A2: Store Soniox listener refs, use typed .off() in close() (7 listeners pinning UserSession chain) - A3: Add 4 missing manager dispose calls in UserSession.dispose() - A4: Identity check before sessions.delete() (prevents orphaning newer session) - A5: Email case normalization in WebSocket upgrade - A6: Disposed guard on TranscriptionManager scheduled reconnects - A7: Disposed guard on TranslationManager scheduled retry Observability: - Event loop lag warnings (>100ms) logged to BetterStack with heap/session context - /health enriched with heapUsedMB, rssMB, eventLoopLagMs, activeSessions, uptimeSeconds - /livez lightweight liveness endpoint (zero computation) - /api/admin/heap-snapshot endpoint (Bun.generateHeapSnapshot) - SystemVitalsLogger: 30s periodic structured log with Golden Signals + operation timing framework - MetricsService.getCurrentLag() public method - porter-debug.yml updated to deploy from this branch Docs: spikes and specs for issues 055, 056, 057

- Never commit tokens, secrets, API keys, or credentials (use placeholders) - Never commit PII (customer emails, user IDs — anonymize) - Never include private conversations verbatim - AI agent-specific guidelines (build before push, confirm before infra changes) - Remove duplicate Anti-Patterns section, add security anti-patterns - Clean up old Anti-Patterns that were duplicated

…ision log - Replace removeAllListeners approach with stored refs + typed .off() - Document all three attempts and why each was chosen/rejected - Add context explaining what this.session is (Soniox, not UserSession) - Show the reference chain that causes the memory leak - Update README design.md template to include Decision Log and guidance on keeping design docs current when implementation changes

Critical: - Remove unauthenticated heap snapshot from hono-app.ts (already exists behind validateAdminEmail in admin.routes.ts) - Add Bun-native GET /memory/heap-snapshot-bun to admin routes (with auth) - Normalize userId to lowercase in handleAppUpgrade + legacy CONNECTION_INIT Major: - Move listener .off() to finally block after session.finish() so finalized/finished handlers can flush final transcript data - Track timer handles in pendingTimers Set for both TranscriptionManager and TranslationManager, clear all on dispose() Minor: - Add /livez to noisy endpoint filter (prevents probe log spam) - Throttle event loop lag warnings (30s cooldown) - Add getDisposedPendingGCCount() to MemoryLeakDetector (typed getter) - Use typed getter in SystemVitalsLogger instead of as any cast - Simplify getAndReset() to replace object - Add try/catch around stream count access in SystemVitalsLogger

- StreamRegistry: add dispose() that clears userStreams, streamToUser, cfInputToUser Maps - UserSettingsManager: improve dispose() to release loadPromise closure + mark loaded - DeviceManager: already had dispose() (no-op, no state to clean) - CalendarManager: already had dispose() (clears events + subscribedApps) - UserSession: replace (as any).dispose?.() casts with typed calls

…tension startPlaybackUrlPolling() had an untracked 60s setTimeout that could keep a disposed ManagedStreamingExtension (and transitively UserSession) alive. Now stored as this.playbackUrlTimeout, cleared in dispose().

…lity cloud/issues-057: memory leak fixes + cloud observability overhaul

…, connection counting, slow query monitoring Adds instrumentation to identify WHY pods become unresponsive before crashing: A1. GC Probe (SystemVitalsLogger): Forces Bun.gc(true) every 60s, times it, logs duration + memory freed. Tells us if GC pauses are 5ms or 500ms. A2. GC on Session Disconnect (UserSession): After dispose(), forces GC on next tick with timing. Rate-limited to 1 per 10s to avoid thrashing during crash cascades. Shows if sessions are properly freed. A3. Health Check Timing (hono-app): Wraps /health in performance.now(). Warns at >50ms. Tells us if the health handler itself is slow vs the event loop being blocked before the handler runs. A4. Soniox Send Timing (SonioxSdkStream): Times sendAudio() calls. Warns at >50ms, rate-limited to 1 per 30s per stream. Tells us if Soniox WebSocket I/O is blocking the event loop. A5. Connection Counting (SystemVitalsLogger): Adds glassesWebSockets, totalConnections, micActiveCount to 30s vitals. Correlates crashes with connection count (200-325) not just session count (65). A6. MongoDB Slow Query Plugin (mongodb.connection): Global Mongoose plugin that times all queries. Warns when >MONGOOSE_SLOW_QUERY_MS (env var, default disabled). Set to 100 in Doppler to enable. Also: spec.md for the full plan, tsconfig excludes scripts/ from build. See: cloud/issues/061-crash-investigation/spec.md

…g, cleanup - Fix glassesWebSockets: check session.websocket (actual field name) - Move mongoose.plugin() to module load time (before models import) - Remove .env fallback from analyze-heap — just require the env var - Add division-by-zero guard in analyze-heap

hotfix/crash-diagnostics

…oof framework, tech debt - Complete audit of every MongoDB call in cloud server (11 collections, 262 call sites) - 18 hot-path calls identified (session connect, message handling, location updates) - Only 18% of reads use .lean() — 82% instantiate full Mongoose Documents unnecessarily - France event loop blocked 162s/1800s (9%) by MongoDB RTT alone - GC confirmed NOT primary cause (54ms probes, 0MB freed) - Proof framework: 4 steps to definitively confirm/deny MongoDB as crash cause - Tech debt doc: 8 items found (app cache, atomic updates, lean, N+1, JWT centralization) - Key finding: latency is pure network RTT, not query execution (indexes work, 0ms on server)

…mory app cache Three changes: B1. Event loop gap detector — 1s setInterval that detects >2s gaps, logs the blocking duration. The missing link between 'query was slow' and 'health check failed because the event loop was blocked at that exact moment.' B2. Cumulative MongoDB blocking metric — mongoQueryCount, mongoTotalBlockingMs, mongoMaxQueryMs in 30s system vitals. Extends existing slow-query plugin. B3. In-memory app cache — loads all 1,314 apps (~2MB) at boot, refreshes every 5 min with write-through invalidation. Eliminates 18 hot-path DB round-trips (80-370ms each) per session. Largest single-change impact on crash frequency. Migration: 9 hot-path call sites switch from App.findOne() to appCache. Cold paths left as-is.

…plica guide, B4 operation timing - Detailed every write path (16 across 5 files) that must call invalidate() - Documented every staleness edge case: publicUrl, hashedApiKey, permissions are critical — bounded to 30s max staleness (down from 5 min) - Added refresh failure detection (warn if 3 missed cycles) - Multi-pod staleness: write-through only works on local pod, other regions wait for 30s timer — this is the tradeoff vs 9% event loop blocking - Atlas read replica step-by-step: UI steps, readPreference=nearest, cost (~$228/mo for 4 replicas). Future work, not in scope for this spec. - B4: wire operationTimers to audio processing, glasses/app message handlers, display rendering. Measures actual synchronous CPU budget consumption.

…tion timing B1. Event Loop Gap Detector (SystemVitalsLogger) 1s setInterval detects >2s gaps — catches ALL blocking regardless of cause. Logs 'event-loop-gap' with gap duration, RSS, session count. The missing link between 'query was slow' and 'health check failed.' B2. Cumulative MongoDB Blocking Metric (mongodb.connection + SystemVitalsLogger) MongoQueryStats accumulator records all slow queries (>MONGOOSE_SLOW_QUERY_MS). SystemVitalsLogger reads/resets every 30s: mongoQueryCount, mongoTotalBlockingMs, mongoMaxQueryMs. Shows exact % of event loop time consumed by MongoDB. B3. In-Memory App Cache (app-cache.service.ts + 9 hot-path migrations) Loads all 1,314 apps (~2MB) at boot. Refreshes every 30s. 9 hot-path call sites migrated from App.findOne() to appCache.getByPackageName() with DB fallback. 16 write paths call appCache.invalidate() fire-and-forget. Eliminates 80-370ms RTT per hot-path app lookup. Files: AppManager, SubscriptionManager, app-message-handler, sdk.auth.service, system-app.api (5 call sites). Invalidation in: console.apps.service, app.service, developer.service, developer.routes, permissions.routes, admin.routes. B4. Hot Path Operation Timing (UdpAudioServer, bun-websocket, DisplayManager) Wires existing operationTimers framework to: audioProcessing (3,250 ops/sec), glassesMessage, appMessage, displayRendering. Vitals now show op_*_ms and opBudgetUsedPct — exact % of event loop consumed by application code. Also: .gitignore for .heap/

…robustness CRITICAL: - SDK auth: revert to DB-only credential validation — never use stale cache for hashedApiKey. Security: stale cache = revoked keys still validate. - SDK auth: export clearSdkAuthCache(), called from appCache.invalidate() so key rotation clears both caches. - developer.service: stop logging API key material (raw key + hash were in debug logs — credential leak to all log sinks). MAJOR: - bun-websocket: restore ws.close(1011) in app message error handler (accidentally removed when adding finally block — regression). - system-app.api + AppManager: partial cache hits now fall back to DB. Changed cachedApps.length > 0 to cachedApps.length === names.length. - app-cache: concurrent refresh protection (refreshing flag prevents overlapping refresh() calls from racing). - app-cache: start interval before initial refresh() so transient first-load failures don't permanently disable the cache. - UdpAudioServer: move audioProcessing timing to finally block. MINOR: - spec.md: fix 5-min vs 30-sec inconsistency in B3 description and code.

Two App.updateOne() calls for onboardingStatus were missing cache invalidation. Found during manual review — bots missed these.

Cloud/062 mongodb audit

… exit Adds SIGTERM/SIGINT handler that sends WebSocket close frames (code 1001 'Going Away') to every connected glasses and app WebSocket before the process exits. This reduces deploy-caused user disruption from 30-60s (waiting for ping timeout on unclean disconnect) to <2s (immediate close frame detection). A1. SIGTERM handler (index.ts): closes all glasses + app WebSockets, stops timers (vitals, cache, metrics), closes MongoDB, exits 0. A2. Health drain mode (hono-app.ts): /health returns 503 during shutdown so LB stops routing new requests to the dying pod. A3. Reject new WS upgrades (bun-websocket.ts): returns 503 during shutdown so new sessions go to the new pod. Shared shutdown state via services/shutdown.ts module. Issue doc: cloud/issues/063-graceful-shutdown/spec.md Deploy workflow: added hotfix/graceful-shutdown to porter-debug.yml

…ng shutdown Without this, the LB could route REST requests to the dying pod during the SIGTERM grace period. Those requests hit a pod with no sessions → 503. This was likely causing some of the intermittent 503s users reported. The middleware returns 503 on every request except /livez (so K8s can still check process liveness during drain).

…ents process.exit(0) immediately after ws.close() can kill the runtime before Bun flushes the close handshake. Adding a 2-second drain delay ensures close frames are sent on the wire before the process exits.

hotfix/graceful-shutdown

coderabbitai · 2026-03-28T23:05:47Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 154e5ed9-194e-477a-bcd9-10a05fc43c0b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cloud/sync-main-to-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-28T23:05:49Z

📋 PR Review Helper

📱 Mobile App Build

⏳ Waiting for build...

🕶️ ASG Client Build

✅ Ready to test! (commit 5d6ade6)

📥 Download ASG APK

🔀 Test Locally

gh pr checkout 2349

cloudflare-workers-and-pages · 2026-03-28T23:06:51Z

Deploying mentra-store-dev with Cloudflare Pages

Latest commit:	`5d6ade6`
Status:	✅ Deploy successful!
Preview URL:	https://f85f4d60.augmentos-appstore-2.pages.dev
Branch Preview URL:	https://cloud-sync-main-to-dev.augmentos-appstore-2.pages.dev

View logs

cloudflare-workers-and-pages · 2026-03-28T23:08:18Z

Deploying dev-augmentos-console with Cloudflare Pages

Latest commit:	`5d6ade6`
Status:	✅ Deploy successful!
Preview URL:	https://823bdd1f.dev-augmentos-console.pages.dev
Branch Preview URL:	https://cloud-sync-main-to-dev.dev-augmentos-console.pages.dev

View logs

cloudflare-workers-and-pages · 2026-03-28T23:09:31Z

Deploying prod-augmentos-account with Cloudflare Pages

Latest commit:	`5d6ade6`
Status:	✅ Deploy successful!
Preview URL:	https://f9eded3d.augmentos-e84.pages.dev
Branch Preview URL:	https://cloud-sync-main-to-dev.augmentos-e84.pages.dev

View logs

cloudflare-workers-and-pages · 2026-03-28T23:09:57Z

Deploying mentra-live-ota-site with Cloudflare Pages

Latest commit:	`5d6ade6`
Status:	✅ Deploy successful!
Preview URL:	https://1cc6f944.mentra-live-ota-site.pages.dev
Branch Preview URL:	https://cloud-sync-main-to-dev.mentra-live-ota-site.pages.dev

View logs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d6ade6dcc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-28T23:14:12Z

cloud/packages/cloud/src/services/metrics/SystemVitalsLogger.ts

+      const memBefore = process.memoryUsage();
+
+      const t0 = performance.now();
+      Bun.gc(true);


Make forced GC probe opt-in

runGcProbe() executes Bun.gc(true) (a synchronous full GC) and start() schedules this probe every 60 seconds for all deployments, which adds guaranteed stop-the-world pauses on live traffic. Under moderate/high heap pressure this can itself cause event-loop stalls and health-check failures, so the observability feature can degrade availability in production; this should be gated behind an explicit debug/diagnostic flag (default off).

Useful? React with 👍 / 👎.

isaiahb and others added 30 commits March 2, 2026 16:45

Merge pull request #2155 from Mentra-Community/staging

fb50674

Staging => Main

Update test_bes_ota_prod_live_version.json

e9851c6

Update test_bes_ota_prod_live_version.json

5e5d931

bump asg ver

c815446

update test

e94fcb5

Merge branch 'main' into staging

90cef29

Merge pull request #2225 from Mentra-Community/staging

130b31f

Staging => Main

release asg 35 and fix some things

6d80d84

fix store button

dcbc966

Merge pull request #2319 from Mentra-Community/hotfix/cloud-observabi…

62351a0

…lity cloud/issues-057: memory leak fixes + cloud observability overhaul

ci: add hotfix/crash-diagnostics to debug deploy workflow

74d78a7

Merge pull request #2328 from Mentra-Community/hotfix/crash-diagnostics

781775f

hotfix/crash-diagnostics

ci: add cloud/062-mongodb-audit to debug deploy workflow

76dd611

fix(062): add missing appCache.invalidate() to onboarding.routes.ts

57b98dd

Two App.updateOne() calls for onboardingStatus were missing cache invalidation. Found during manual review — bots missed these.

Merge pull request #2333 from Mentra-Community/cloud/062-mongodb-audit

91350cb

Cloud/062 mongodb audit

isaiahb added 4 commits March 28, 2026 12:57

test: trigger redeploy to test graceful shutdown

96d512b

fix(063): add 2s drain delay before exit so close frames flush to cli…

2e0dc95

…ents process.exit(0) immediately after ws.close() can kill the runtime before Bun flushes the close handshake. Adding a 2-second drain delay ensures close frames are sent on the wire before the process exits.

Merge pull request #2343 from Mentra-Community/hotfix/graceful-shutdown

bafeddb

hotfix/graceful-shutdown

Merge remote-tracking branch 'origin/main' into cloud/sync-main-to-dev

5d6ade6

isaiahb self-assigned this Mar 28, 2026

isaiahb requested a review from a team as a code owner March 28, 2026 23:05

isaiahb merged commit 77cb79d into dev Mar 28, 2026
15 checks passed

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud/sync main to dev#2349

Cloud/sync main to dev#2349
isaiahb merged 34 commits intodevfrom
cloud/sync-main-to-dev

isaiahb commented Mar 28, 2026

Uh oh!

coderabbitai bot commented Mar 28, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

isaiahb commented Mar 28, 2026

Uh oh!

coderabbitai bot commented Mar 28, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 PR Review Helper

📱 Mobile App Build

🕶️ ASG Client Build

🔀 Test Locally

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying mentra-store-dev with Cloudflare Pages

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying dev-augmentos-console with Cloudflare Pages

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying prod-augmentos-account with Cloudflare Pages

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying mentra-live-ota-site with Cloudflare Pages

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 28, 2026 •

edited

Loading