Skip to content

Cloud/sync main to dev#2349

Merged
isaiahb merged 34 commits intodevfrom
cloud/sync-main-to-dev
Mar 28, 2026
Merged

Cloud/sync main to dev#2349
isaiahb merged 34 commits intodevfrom
cloud/sync-main-to-dev

Conversation

@isaiahb
Copy link
Copy Markdown
Contributor

@isaiahb isaiahb commented Mar 28, 2026

No description provided.

isaiahb and others added 30 commits March 2, 2026 16:45
Bug fixes:
- A1: Store and clear ManagedStreamingExtension interval (leaked every disposed UserSession)
- A2: Store Soniox listener refs, use typed .off() in close() (7 listeners pinning UserSession chain)
- A3: Add 4 missing manager dispose calls in UserSession.dispose()
- A4: Identity check before sessions.delete() (prevents orphaning newer session)
- A5: Email case normalization in WebSocket upgrade
- A6: Disposed guard on TranscriptionManager scheduled reconnects
- A7: Disposed guard on TranslationManager scheduled retry

Observability:
- Event loop lag warnings (>100ms) logged to BetterStack with heap/session context
- /health enriched with heapUsedMB, rssMB, eventLoopLagMs, activeSessions, uptimeSeconds
- /livez lightweight liveness endpoint (zero computation)
- /api/admin/heap-snapshot endpoint (Bun.generateHeapSnapshot)
- SystemVitalsLogger: 30s periodic structured log with Golden Signals + operation timing framework
- MetricsService.getCurrentLag() public method
- porter-debug.yml updated to deploy from this branch

Docs: spikes and specs for issues 055, 056, 057
- Never commit tokens, secrets, API keys, or credentials (use placeholders)
- Never commit PII (customer emails, user IDs — anonymize)
- Never include private conversations verbatim
- AI agent-specific guidelines (build before push, confirm before infra changes)
- Remove duplicate Anti-Patterns section, add security anti-patterns
- Clean up old Anti-Patterns that were duplicated
…ision log

- Replace removeAllListeners approach with stored refs + typed .off()
- Document all three attempts and why each was chosen/rejected
- Add context explaining what this.session is (Soniox, not UserSession)
- Show the reference chain that causes the memory leak
- Update README design.md template to include Decision Log and
  guidance on keeping design docs current when implementation changes
Critical:
- Remove unauthenticated heap snapshot from hono-app.ts (already exists
  behind validateAdminEmail in admin.routes.ts)
- Add Bun-native GET /memory/heap-snapshot-bun to admin routes (with auth)
- Normalize userId to lowercase in handleAppUpgrade + legacy CONNECTION_INIT

Major:
- Move listener .off() to finally block after session.finish() so
  finalized/finished handlers can flush final transcript data
- Track timer handles in pendingTimers Set for both TranscriptionManager
  and TranslationManager, clear all on dispose()

Minor:
- Add /livez to noisy endpoint filter (prevents probe log spam)
- Throttle event loop lag warnings (30s cooldown)
- Add getDisposedPendingGCCount() to MemoryLeakDetector (typed getter)
- Use typed getter in SystemVitalsLogger instead of as any cast
- Simplify getAndReset() to replace object
- Add try/catch around stream count access in SystemVitalsLogger
- StreamRegistry: add dispose() that clears userStreams, streamToUser, cfInputToUser Maps
- UserSettingsManager: improve dispose() to release loadPromise closure + mark loaded
- DeviceManager: already had dispose() (no-op, no state to clean)
- CalendarManager: already had dispose() (clears events + subscribedApps)
- UserSession: replace (as any).dispose?.() casts with typed calls
…tension

startPlaybackUrlPolling() had an untracked 60s setTimeout that could keep
a disposed ManagedStreamingExtension (and transitively UserSession) alive.
Now stored as this.playbackUrlTimeout, cleared in dispose().
…lity

cloud/issues-057: memory leak fixes + cloud observability overhaul
…, connection counting, slow query monitoring

Adds instrumentation to identify WHY pods become unresponsive before crashing:

A1. GC Probe (SystemVitalsLogger): Forces Bun.gc(true) every 60s, times it,
    logs duration + memory freed. Tells us if GC pauses are 5ms or 500ms.

A2. GC on Session Disconnect (UserSession): After dispose(), forces GC on
    next tick with timing. Rate-limited to 1 per 10s to avoid thrashing
    during crash cascades. Shows if sessions are properly freed.

A3. Health Check Timing (hono-app): Wraps /health in performance.now().
    Warns at >50ms. Tells us if the health handler itself is slow vs the
    event loop being blocked before the handler runs.

A4. Soniox Send Timing (SonioxSdkStream): Times sendAudio() calls. Warns
    at >50ms, rate-limited to 1 per 30s per stream. Tells us if Soniox
    WebSocket I/O is blocking the event loop.

A5. Connection Counting (SystemVitalsLogger): Adds glassesWebSockets,
    totalConnections, micActiveCount to 30s vitals. Correlates crashes
    with connection count (200-325) not just session count (65).

A6. MongoDB Slow Query Plugin (mongodb.connection): Global Mongoose plugin
    that times all queries. Warns when >MONGOOSE_SLOW_QUERY_MS (env var,
    default disabled). Set to 100 in Doppler to enable.

Also: spec.md for the full plan, tsconfig excludes scripts/ from build.

See: cloud/issues/061-crash-investigation/spec.md
…g, cleanup

- Fix glassesWebSockets: check session.websocket (actual field name)
- Move mongoose.plugin() to module load time (before models import)
- Remove .env fallback from analyze-heap — just require the env var
- Add division-by-zero guard in analyze-heap
…oof framework, tech debt

- Complete audit of every MongoDB call in cloud server (11 collections, 262 call sites)
- 18 hot-path calls identified (session connect, message handling, location updates)
- Only 18% of reads use .lean() — 82% instantiate full Mongoose Documents unnecessarily
- France event loop blocked 162s/1800s (9%) by MongoDB RTT alone
- GC confirmed NOT primary cause (54ms probes, 0MB freed)
- Proof framework: 4 steps to definitively confirm/deny MongoDB as crash cause
- Tech debt doc: 8 items found (app cache, atomic updates, lean, N+1, JWT centralization)
- Key finding: latency is pure network RTT, not query execution (indexes work, 0ms on server)
…mory app cache

Three changes:
B1. Event loop gap detector — 1s setInterval that detects >2s gaps, logs the
    blocking duration. The missing link between 'query was slow' and 'health
    check failed because the event loop was blocked at that exact moment.'

B2. Cumulative MongoDB blocking metric — mongoQueryCount, mongoTotalBlockingMs,
    mongoMaxQueryMs in 30s system vitals. Extends existing slow-query plugin.

B3. In-memory app cache — loads all 1,314 apps (~2MB) at boot, refreshes
    every 5 min with write-through invalidation. Eliminates 18 hot-path DB
    round-trips (80-370ms each) per session. Largest single-change impact
    on crash frequency.

Migration: 9 hot-path call sites switch from App.findOne() to appCache.
Cold paths left as-is.
…plica guide, B4 operation timing

- Detailed every write path (16 across 5 files) that must call invalidate()
- Documented every staleness edge case: publicUrl, hashedApiKey, permissions
  are critical — bounded to 30s max staleness (down from 5 min)
- Added refresh failure detection (warn if 3 missed cycles)
- Multi-pod staleness: write-through only works on local pod, other regions
  wait for 30s timer — this is the tradeoff vs 9% event loop blocking
- Atlas read replica step-by-step: UI steps, readPreference=nearest, cost
  (~$228/mo for 4 replicas). Future work, not in scope for this spec.
- B4: wire operationTimers to audio processing, glasses/app message handlers,
  display rendering. Measures actual synchronous CPU budget consumption.
…tion timing

B1. Event Loop Gap Detector (SystemVitalsLogger)
    1s setInterval detects >2s gaps — catches ALL blocking regardless of cause.
    Logs 'event-loop-gap' with gap duration, RSS, session count.
    The missing link between 'query was slow' and 'health check failed.'

B2. Cumulative MongoDB Blocking Metric (mongodb.connection + SystemVitalsLogger)
    MongoQueryStats accumulator records all slow queries (>MONGOOSE_SLOW_QUERY_MS).
    SystemVitalsLogger reads/resets every 30s: mongoQueryCount, mongoTotalBlockingMs,
    mongoMaxQueryMs. Shows exact % of event loop time consumed by MongoDB.

B3. In-Memory App Cache (app-cache.service.ts + 9 hot-path migrations)
    Loads all 1,314 apps (~2MB) at boot. Refreshes every 30s.
    9 hot-path call sites migrated from App.findOne() to appCache.getByPackageName()
    with DB fallback. 16 write paths call appCache.invalidate() fire-and-forget.
    Eliminates 80-370ms RTT per hot-path app lookup.
    Files: AppManager, SubscriptionManager, app-message-handler, sdk.auth.service,
    system-app.api (5 call sites). Invalidation in: console.apps.service, app.service,
    developer.service, developer.routes, permissions.routes, admin.routes.

B4. Hot Path Operation Timing (UdpAudioServer, bun-websocket, DisplayManager)
    Wires existing operationTimers framework to: audioProcessing (3,250 ops/sec),
    glassesMessage, appMessage, displayRendering. Vitals now show op_*_ms and
    opBudgetUsedPct — exact % of event loop consumed by application code.

Also: .gitignore for .heap/
…robustness

CRITICAL:
- SDK auth: revert to DB-only credential validation — never use stale cache
  for hashedApiKey. Security: stale cache = revoked keys still validate.
- SDK auth: export clearSdkAuthCache(), called from appCache.invalidate()
  so key rotation clears both caches.
- developer.service: stop logging API key material (raw key + hash were in
  debug logs — credential leak to all log sinks).

MAJOR:
- bun-websocket: restore ws.close(1011) in app message error handler
  (accidentally removed when adding finally block — regression).
- system-app.api + AppManager: partial cache hits now fall back to DB.
  Changed cachedApps.length > 0 to cachedApps.length === names.length.
- app-cache: concurrent refresh protection (refreshing flag prevents
  overlapping refresh() calls from racing).
- app-cache: start interval before initial refresh() so transient first-load
  failures don't permanently disable the cache.
- UdpAudioServer: move audioProcessing timing to finally block.

MINOR:
- spec.md: fix 5-min vs 30-sec inconsistency in B3 description and code.
Two App.updateOne() calls for onboardingStatus were missing cache
invalidation. Found during manual review — bots missed these.
… exit

Adds SIGTERM/SIGINT handler that sends WebSocket close frames (code 1001
'Going Away') to every connected glasses and app WebSocket before the
process exits. This reduces deploy-caused user disruption from 30-60s
(waiting for ping timeout on unclean disconnect) to <2s (immediate close
frame detection).

A1. SIGTERM handler (index.ts): closes all glasses + app WebSockets,
    stops timers (vitals, cache, metrics), closes MongoDB, exits 0.
A2. Health drain mode (hono-app.ts): /health returns 503 during shutdown
    so LB stops routing new requests to the dying pod.
A3. Reject new WS upgrades (bun-websocket.ts): returns 503 during
    shutdown so new sessions go to the new pod.

Shared shutdown state via services/shutdown.ts module.
Issue doc: cloud/issues/063-graceful-shutdown/spec.md
Deploy workflow: added hotfix/graceful-shutdown to porter-debug.yml
…ng shutdown

Without this, the LB could route REST requests to the dying pod during
the SIGTERM grace period. Those requests hit a pod with no sessions → 503.
This was likely causing some of the intermittent 503s users reported.

The middleware returns 503 on every request except /livez (so K8s can
still check process liveness during drain).
isaiahb added 4 commits March 28, 2026 12:57
…ents

process.exit(0) immediately after ws.close() can kill the runtime before
Bun flushes the close handshake. Adding a 2-second drain delay ensures
close frames are sent on the wire before the process exits.
@isaiahb isaiahb self-assigned this Mar 28, 2026
@isaiahb isaiahb requested a review from a team as a code owner March 28, 2026 23:05
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 154e5ed9-194e-477a-bcd9-10a05fc43c0b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cloud/sync-main-to-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

📋 PR Review Helper

📱 Mobile App Build

Waiting for build...

🕶️ ASG Client Build

Ready to test! (commit 5d6ade6)

📥 Download ASG APK


🔀 Test Locally

gh pr checkout 2349

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying mentra-store-dev with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5d6ade6
Status: ✅  Deploy successful!
Preview URL: https://f85f4d60.augmentos-appstore-2.pages.dev
Branch Preview URL: https://cloud-sync-main-to-dev.augmentos-appstore-2.pages.dev

View logs

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying dev-augmentos-console with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5d6ade6
Status: ✅  Deploy successful!
Preview URL: https://823bdd1f.dev-augmentos-console.pages.dev
Branch Preview URL: https://cloud-sync-main-to-dev.dev-augmentos-console.pages.dev

View logs

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying prod-augmentos-account with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5d6ade6
Status: ✅  Deploy successful!
Preview URL: https://f9eded3d.augmentos-e84.pages.dev
Branch Preview URL: https://cloud-sync-main-to-dev.augmentos-e84.pages.dev

View logs

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying mentra-live-ota-site with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5d6ade6
Status: ✅  Deploy successful!
Preview URL: https://1cc6f944.mentra-live-ota-site.pages.dev
Branch Preview URL: https://cloud-sync-main-to-dev.mentra-live-ota-site.pages.dev

View logs

@isaiahb isaiahb merged commit 77cb79d into dev Mar 28, 2026
15 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d6ade6dcc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const memBefore = process.memoryUsage();

const t0 = performance.now();
Bun.gc(true);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make forced GC probe opt-in

runGcProbe() executes Bun.gc(true) (a synchronous full GC) and start() schedules this probe every 60 seconds for all deployments, which adds guaranteed stop-the-world pauses on live traffic. Under moderate/high heap pressure this can itself cause event-loop stalls and health-check failures, so the observability feature can degrade availability in production; this should be gated behind an explicit debug/diagnostic flag (default off).

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants