@@ -1040,50 +1040,137 @@ function cmdHelp() {
10401040 console . log ( `
10411041bstack — BetterStack CLI for MentraCloud SRE
10421042
1043- Investigation commands (new):
1043+ ═══════════════════════════════════════════════════════════════════════════
1044+ HOW IT WORKS (for humans and AI agents)
1045+ ═══════════════════════════════════════════════════════════════════════════
1046+
1047+ This CLI sends ClickHouse SQL queries to BetterStack's HTTP API at:
1048+ ${ SQL_ENDPOINT }
1049+
1050+ Logs are stored in two places with different tradeoffs:
1051+ HOT storage — last ~2-5 minutes only, fast (<1s queries)
1052+ Prod: remote(t373499_mentracloud_prod_logs)
1053+ Dev: remote(t373499_augmentos_logs)
1054+ COLD storage — full history, slower (3-5s queries), needs _row_type = 1
1055+ Prod: s3Cluster(primary, t373499_mentracloud_prod_s3)
1056+ Dev: s3Cluster(primary, t373499_augmentos_s3)
1057+
1058+ The CLI auto-selects hot vs cold based on --duration:
1059+ ≤5 min → hot table (fast, recent data only)
1060+ >5 min → cold/S3 table (slow, full history, adds WHERE _row_type = 1)
1061+ If you get zero rows for a query you expect data for, the duration might
1062+ be too short for cold storage or too long for hot storage.
1063+
1064+ Log fields are in a JSON blob called 'raw'. To extract fields use:
1065+ JSONExtractString(raw, 'level') → "error", "warn", "info", "debug"
1066+ JSONExtractString(raw, 'message') → the log message
1067+ JSONExtractString(raw, 'service') → "AppManager", "UserSession", etc.
1068+ JSONExtractString(raw, 'region') → "us-central", "france", etc.
1069+ JSONExtractString(raw, 'feature') → "system-vitals", "gc-probe", etc.
1070+ JSONExtractString(raw, 'userId') → user email
1071+ JSONExtractFloat(raw, 'rssMB') → RSS memory in MB
1072+ JSONExtractInt(raw, 'activeSessions') → session count
1073+ Do NOT use json.field dot notation — it doesn't work on these tables.
1074+
1075+ Credentials are loaded in this order:
1076+ 1. Environment variables (BETTERSTACK_USERNAME, BETTERSTACK_PASSWORD, etc.)
1077+ 2. Auto-load from Doppler: doppler secrets get --project mentra-sre --config dev
1078+ (runs automatically if env vars are missing and doppler CLI is installed)
1079+ SRE credentials live in Doppler project "mentra-sre" (NOT "mentraos-cloud").
1080+ Cloud runtime secrets (MONGO_URL, etc.) are in "mentraos-cloud" — don't mix them.
1081+
1082+ The admin API (for 'session' command) hits the cloud's /api/admin/memory/now
1083+ endpoint with a Bearer JWT. The JWT is MENTRA_ADMIN_JWT from mentra-sre.
1084+
1085+ Health checks hit each region's /health endpoint directly (no BetterStack):
1086+ ${ Object . entries ( REGIONS )
1087+ . map ( ( [ id , r ] ) => ` ${ id . padEnd ( 12 ) } → ${ r . healthUrl } ` )
1088+ . join ( "\n" ) }
1089+
1090+ If a command isn't doing what you need, use 'bstack sql' to run raw
1091+ ClickHouse SQL directly. The patterns above show exactly how to query.
1092+
1093+ ═══════════════════════════════════════════════════════════════════════════
1094+ COMMANDS
1095+ ═══════════════════════════════════════════════════════════════════════════
1096+
1097+ Investigation:
10441098 bstack logs <user|keyword> Search logs by user or keyword
1045- --level error,warn Filter by log level
1046- --duration 1h How far back to search (default: 15m)
1047- --region france Filter by region
1048- --service AppManager Filter by service
1049- --env dev Search dev/debug source (default: prod)
1050- bstack errors --region <r> Top errors by count
1051- --duration 4h Time window (default: 4h)
1052- bstack leaks --region <r> Memory leak detection
1053- --duration 12h Time window (default: 12h)
1054- bstack session <userId> --host <hostname> Live session inspection via admin API
1055-
1056- Diagnostics commands:
1057- bstack health Quick health check across all regions
1058- bstack diagnostics --region <r> Full diagnostics (GC, gaps, MongoDB, budget)
1059- bstack crash-timeline --region <r> What happened before the last crash
1060- bstack memory --region <r> [--duration 1h] Memory trend over time
1061- bstack gc --region <r> [--duration 1h] GC probe analysis
1062- bstack gaps --region <r> [--duration 1h] Event loop gap analysis
1063- bstack budget --region <r> Operation budget (CPU consumers)
1064- bstack slow-queries --region <r> MongoDB slow query breakdown
1065- bstack cache --region <r> App cache status
1066-
1067- Infrastructure commands:
1068- bstack incidents [--limit 10] Recent uptime incidents
1069- bstack sources List all BetterStack sources/collectors
1070- bstack sql "SELECT ..." Raw ClickHouse SQL query
1071- bstack runbook <name> Open a runbook
1099+ --level error,warn Filters: JSONExtractString(raw, 'level') IN (...)
1100+ --duration 1h How far back (default: 15m). Controls hot vs cold table.
1101+ --region france Filters: JSONExtractString(raw, 'region') = '...'
1102+ --service AppManager Filters: JSONExtractString(raw, 'service') = '...'
1103+ --env dev Search dev/debug source instead of prod (default: prod)
1104+ Internally: SELECT dt, level, message, service FROM <table> WHERE raw LIKE '%<query>%' ...
10721105
1073- Regions: ${ getAllRegions ( ) . join ( ", " ) }
1106+ bstack errors --region <r> Top errors grouped by service + message
1107+ --duration 4h Time window (default: 4h)
1108+ Internally: GROUP BY service, message ORDER BY count() DESC
1109+
1110+ bstack leaks --region <r> Memory leak detection — 3 queries:
1111+ --duration 12h Time window (default: 12h)
1112+ 1. disposedSessionsPendingGC trend (from system-vitals, grouped by hour)
1113+ Should be 0. If climbing, sessions are stuck in memory after dispose().
1114+ 2. GC probe trend (from gc-probe). avg_freed_mb should be > 0.
1115+ If GC frees 0MB consistently, objects are reachable but shouldn't be.
1116+ 3. MemoryLeakDetector events — "Potential leak" = disposed but not GC'd
1117+ within 60s. "Object finalized by GC" = eventually collected (ok).
1118+
1119+ bstack session <userId> --host <hostname> Hits GET /api/admin/memory/now on the host,
1120+ finds the user's session, shows running apps, subscriptions, mic state.
1121+ Requires MENTRA_ADMIN_JWT. Host must be the actual cloud hostname.
1122+
1123+ Diagnostics:
1124+ bstack health Fetches /health from ALL regions in parallel.
1125+ Shows: sessions, uptime, RSS, heap, event loop lag.
1126+ Also fetches BetterStack uptime monitors if API_TOKEN is set.
1127+
1128+ bstack diagnostics --region <r> Runs 5 queries: GC probes, event loop gaps,
1129+ MongoDB slow queries, operation budget, app cache. All from system-vitals/gc-probe/etc.
1130+
1131+ bstack crash-timeline --region <r> Shows interleaved system-vitals, gc-probe,
1132+ event-loop-gap, and slow-query events to reconstruct what happened before a crash.
1133+
1134+ bstack memory --region <r> [--duration 1h] RSS/heap/external/arraybuf trend over time.
1135+ Calculates growth rate and estimates time to 1GB.
1136+
1137+ bstack gc --region <r> [--duration 1h] GC probe durations and freed MB.
1138+ Warns if max GC > 100ms (contributing to event loop blocking).
1139+
1140+ bstack gaps --region <r> [--duration 1h] Event loop gaps (>1s freezes).
1141+ Zero gaps = healthy. Any gaps = something blocking (GC, MongoDB, etc.).
10741142
1075- Credentials:
1076- Auto-loaded from Doppler (mentra-sre project) if available.
1077- Or set manually:
1078- BETTERSTACK_USERNAME ClickHouse HTTP API username
1079- BETTERSTACK_PASSWORD ClickHouse HTTP API password
1080- BETTERSTACK_API_TOKEN Management API token (for uptime/incidents)
1081- MENTRA_ADMIN_JWT Admin API token (for session inspection)
1082-
1083- 📖 Runbooks:
1084- bstack runbook pod-crash What to do when a pod crashes
1085- bstack runbook weekly-error-audit Weekly error audit process
1086- bstack runbook client-disconnect Investigate client disconnection patterns
1143+ bstack budget --region <r> Per-operation CPU time breakdown.
1144+ Shows audio processing, app messages, display rendering, MongoDB, etc.
1145+ Budget > 50% = event loop is CPU-bound. < 20% = healthy headroom.
1146+
1147+ bstack slow-queries --region <r> MongoDB queries > 100ms, grouped by
1148+ collection + operation. Shows count, avg/max duration, total blocking time.
1149+
1150+ bstack cache --region <r> App cache refresh stats (count, timing).
1151+
1152+ Infrastructure:
1153+ bstack incidents [--limit 10] Fetches from BetterStack Uptime API.
1154+ Requires BETTERSTACK_API_TOKEN. Shows start time, cause, resolution.
1155+
1156+ bstack sources Lists all BetterStack log sources, collectors,
1157+ dashboards, and uptime monitors with their IDs and table names.
1158+
1159+ bstack sql "SELECT ..." Runs raw ClickHouse SQL. Append FORMAT JSON
1160+ automatically. Use this when no built-in command covers your query.
1161+
1162+ bstack runbook <name> Prints a runbook from cloud/tools/bstack/runbooks/.
1163+ Runbooks contain step-by-step investigation procedures with example queries.
1164+
1165+ Regions: ${ getAllRegions ( ) . join ( ", " ) }
1166+ Cluster IDs: ${ Object . entries ( REGIONS )
1167+ . map ( ( [ id , r ] ) => `${ id } =${ r . clusterId } ` )
1168+ . join ( ", " ) }
1169+
1170+ 📖 Runbooks (run 'bstack runbook <name>' to read):
1171+ pod-crash What to do when a pod crashes (exit codes, heap analysis, timer audit)
1172+ weekly-error-audit Weekly error audit process (top errors, log volume, churn, memory)
1173+ client-disconnect Investigate client disconnection patterns (ws-close, ws-reconnect)
10871174` )
10881175}
10891176
0 commit comments