Skip to content

Commit 0508261

Browse files
committed
docs(bstack): transparent help text — show internals for humans and AI agents
Help text now explains: - What tables exist (hot vs cold, prod vs dev) and their actual names - Auto-selection logic (≤5min → hot, >5min → cold/S3 with _row_type=1) - JSONExtract syntax with all common field names - Credential loading order (env vars → Doppler mentra-sre auto-load) - Health endpoint URLs per region - What each command does internally (which feature/table it queries) - When and how to fall back to raw SQL - Cluster IDs in the help output The goal: if the CLI has a bug or returns unexpected results, the reader (human or AI) can reason about what's happening and work around it.
1 parent 628ce29 commit 0508261

File tree

1 file changed

+128
-41
lines changed

1 file changed

+128
-41
lines changed

cloud/tools/bstack/bstack.ts

Lines changed: 128 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1040,50 +1040,137 @@ function cmdHelp() {
10401040
console.log(`
10411041
bstack — BetterStack CLI for MentraCloud SRE
10421042
1043-
Investigation commands (new):
1043+
═══════════════════════════════════════════════════════════════════════════
1044+
HOW IT WORKS (for humans and AI agents)
1045+
═══════════════════════════════════════════════════════════════════════════
1046+
1047+
This CLI sends ClickHouse SQL queries to BetterStack's HTTP API at:
1048+
${SQL_ENDPOINT}
1049+
1050+
Logs are stored in two places with different tradeoffs:
1051+
HOT storage — last ~2-5 minutes only, fast (<1s queries)
1052+
Prod: remote(t373499_mentracloud_prod_logs)
1053+
Dev: remote(t373499_augmentos_logs)
1054+
COLD storage — full history, slower (3-5s queries), needs _row_type = 1
1055+
Prod: s3Cluster(primary, t373499_mentracloud_prod_s3)
1056+
Dev: s3Cluster(primary, t373499_augmentos_s3)
1057+
1058+
The CLI auto-selects hot vs cold based on --duration:
1059+
≤5 min → hot table (fast, recent data only)
1060+
>5 min → cold/S3 table (slow, full history, adds WHERE _row_type = 1)
1061+
If you get zero rows for a query you expect data for, the duration might
1062+
be too short for cold storage or too long for hot storage.
1063+
1064+
Log fields are in a JSON blob called 'raw'. To extract fields use:
1065+
JSONExtractString(raw, 'level') → "error", "warn", "info", "debug"
1066+
JSONExtractString(raw, 'message') → the log message
1067+
JSONExtractString(raw, 'service') → "AppManager", "UserSession", etc.
1068+
JSONExtractString(raw, 'region') → "us-central", "france", etc.
1069+
JSONExtractString(raw, 'feature') → "system-vitals", "gc-probe", etc.
1070+
JSONExtractString(raw, 'userId') → user email
1071+
JSONExtractFloat(raw, 'rssMB') → RSS memory in MB
1072+
JSONExtractInt(raw, 'activeSessions') → session count
1073+
Do NOT use json.field dot notation — it doesn't work on these tables.
1074+
1075+
Credentials are loaded in this order:
1076+
1. Environment variables (BETTERSTACK_USERNAME, BETTERSTACK_PASSWORD, etc.)
1077+
2. Auto-load from Doppler: doppler secrets get --project mentra-sre --config dev
1078+
(runs automatically if env vars are missing and doppler CLI is installed)
1079+
SRE credentials live in Doppler project "mentra-sre" (NOT "mentraos-cloud").
1080+
Cloud runtime secrets (MONGO_URL, etc.) are in "mentraos-cloud" — don't mix them.
1081+
1082+
The admin API (for 'session' command) hits the cloud's /api/admin/memory/now
1083+
endpoint with a Bearer JWT. The JWT is MENTRA_ADMIN_JWT from mentra-sre.
1084+
1085+
Health checks hit each region's /health endpoint directly (no BetterStack):
1086+
${Object.entries(REGIONS)
1087+
.map(([id, r]) => ` ${id.padEnd(12)}${r.healthUrl}`)
1088+
.join("\n")}
1089+
1090+
If a command isn't doing what you need, use 'bstack sql' to run raw
1091+
ClickHouse SQL directly. The patterns above show exactly how to query.
1092+
1093+
═══════════════════════════════════════════════════════════════════════════
1094+
COMMANDS
1095+
═══════════════════════════════════════════════════════════════════════════
1096+
1097+
Investigation:
10441098
bstack logs <user|keyword> Search logs by user or keyword
1045-
--level error,warn Filter by log level
1046-
--duration 1h How far back to search (default: 15m)
1047-
--region france Filter by region
1048-
--service AppManager Filter by service
1049-
--env dev Search dev/debug source (default: prod)
1050-
bstack errors --region <r> Top errors by count
1051-
--duration 4h Time window (default: 4h)
1052-
bstack leaks --region <r> Memory leak detection
1053-
--duration 12h Time window (default: 12h)
1054-
bstack session <userId> --host <hostname> Live session inspection via admin API
1055-
1056-
Diagnostics commands:
1057-
bstack health Quick health check across all regions
1058-
bstack diagnostics --region <r> Full diagnostics (GC, gaps, MongoDB, budget)
1059-
bstack crash-timeline --region <r> What happened before the last crash
1060-
bstack memory --region <r> [--duration 1h] Memory trend over time
1061-
bstack gc --region <r> [--duration 1h] GC probe analysis
1062-
bstack gaps --region <r> [--duration 1h] Event loop gap analysis
1063-
bstack budget --region <r> Operation budget (CPU consumers)
1064-
bstack slow-queries --region <r> MongoDB slow query breakdown
1065-
bstack cache --region <r> App cache status
1066-
1067-
Infrastructure commands:
1068-
bstack incidents [--limit 10] Recent uptime incidents
1069-
bstack sources List all BetterStack sources/collectors
1070-
bstack sql "SELECT ..." Raw ClickHouse SQL query
1071-
bstack runbook <name> Open a runbook
1099+
--level error,warn Filters: JSONExtractString(raw, 'level') IN (...)
1100+
--duration 1h How far back (default: 15m). Controls hot vs cold table.
1101+
--region france Filters: JSONExtractString(raw, 'region') = '...'
1102+
--service AppManager Filters: JSONExtractString(raw, 'service') = '...'
1103+
--env dev Search dev/debug source instead of prod (default: prod)
1104+
Internally: SELECT dt, level, message, service FROM <table> WHERE raw LIKE '%<query>%' ...
10721105
1073-
Regions: ${getAllRegions().join(", ")}
1106+
bstack errors --region <r> Top errors grouped by service + message
1107+
--duration 4h Time window (default: 4h)
1108+
Internally: GROUP BY service, message ORDER BY count() DESC
1109+
1110+
bstack leaks --region <r> Memory leak detection — 3 queries:
1111+
--duration 12h Time window (default: 12h)
1112+
1. disposedSessionsPendingGC trend (from system-vitals, grouped by hour)
1113+
Should be 0. If climbing, sessions are stuck in memory after dispose().
1114+
2. GC probe trend (from gc-probe). avg_freed_mb should be > 0.
1115+
If GC frees 0MB consistently, objects are reachable but shouldn't be.
1116+
3. MemoryLeakDetector events — "Potential leak" = disposed but not GC'd
1117+
within 60s. "Object finalized by GC" = eventually collected (ok).
1118+
1119+
bstack session <userId> --host <hostname> Hits GET /api/admin/memory/now on the host,
1120+
finds the user's session, shows running apps, subscriptions, mic state.
1121+
Requires MENTRA_ADMIN_JWT. Host must be the actual cloud hostname.
1122+
1123+
Diagnostics:
1124+
bstack health Fetches /health from ALL regions in parallel.
1125+
Shows: sessions, uptime, RSS, heap, event loop lag.
1126+
Also fetches BetterStack uptime monitors if API_TOKEN is set.
1127+
1128+
bstack diagnostics --region <r> Runs 5 queries: GC probes, event loop gaps,
1129+
MongoDB slow queries, operation budget, app cache. All from system-vitals/gc-probe/etc.
1130+
1131+
bstack crash-timeline --region <r> Shows interleaved system-vitals, gc-probe,
1132+
event-loop-gap, and slow-query events to reconstruct what happened before a crash.
1133+
1134+
bstack memory --region <r> [--duration 1h] RSS/heap/external/arraybuf trend over time.
1135+
Calculates growth rate and estimates time to 1GB.
1136+
1137+
bstack gc --region <r> [--duration 1h] GC probe durations and freed MB.
1138+
Warns if max GC > 100ms (contributing to event loop blocking).
1139+
1140+
bstack gaps --region <r> [--duration 1h] Event loop gaps (>1s freezes).
1141+
Zero gaps = healthy. Any gaps = something blocking (GC, MongoDB, etc.).
10741142
1075-
Credentials:
1076-
Auto-loaded from Doppler (mentra-sre project) if available.
1077-
Or set manually:
1078-
BETTERSTACK_USERNAME ClickHouse HTTP API username
1079-
BETTERSTACK_PASSWORD ClickHouse HTTP API password
1080-
BETTERSTACK_API_TOKEN Management API token (for uptime/incidents)
1081-
MENTRA_ADMIN_JWT Admin API token (for session inspection)
1082-
1083-
📖 Runbooks:
1084-
bstack runbook pod-crash What to do when a pod crashes
1085-
bstack runbook weekly-error-audit Weekly error audit process
1086-
bstack runbook client-disconnect Investigate client disconnection patterns
1143+
bstack budget --region <r> Per-operation CPU time breakdown.
1144+
Shows audio processing, app messages, display rendering, MongoDB, etc.
1145+
Budget > 50% = event loop is CPU-bound. < 20% = healthy headroom.
1146+
1147+
bstack slow-queries --region <r> MongoDB queries > 100ms, grouped by
1148+
collection + operation. Shows count, avg/max duration, total blocking time.
1149+
1150+
bstack cache --region <r> App cache refresh stats (count, timing).
1151+
1152+
Infrastructure:
1153+
bstack incidents [--limit 10] Fetches from BetterStack Uptime API.
1154+
Requires BETTERSTACK_API_TOKEN. Shows start time, cause, resolution.
1155+
1156+
bstack sources Lists all BetterStack log sources, collectors,
1157+
dashboards, and uptime monitors with their IDs and table names.
1158+
1159+
bstack sql "SELECT ..." Runs raw ClickHouse SQL. Append FORMAT JSON
1160+
automatically. Use this when no built-in command covers your query.
1161+
1162+
bstack runbook <name> Prints a runbook from cloud/tools/bstack/runbooks/.
1163+
Runbooks contain step-by-step investigation procedures with example queries.
1164+
1165+
Regions: ${getAllRegions().join(", ")}
1166+
Cluster IDs: ${Object.entries(REGIONS)
1167+
.map(([id, r]) => `${id}=${r.clusterId}`)
1168+
.join(", ")}
1169+
1170+
📖 Runbooks (run 'bstack runbook <name>' to read):
1171+
pod-crash What to do when a pod crashes (exit codes, heap analysis, timer audit)
1172+
weekly-error-audit Weekly error audit process (top errors, log volume, churn, memory)
1173+
client-disconnect Investigate client disconnection patterns (ws-close, ws-reconnect)
10871174
`)
10881175
}
10891176

0 commit comments

Comments
 (0)