hotfix(067): Switch from @logtail/pino to Vector log collection — fixes heap growth crash#2352
hotfix(067): Switch from @logtail/pino to Vector log collection — fixes heap growth crash#2352
Conversation
…d JSON to stdout for Vector collection, fixes createFilteredStream double-parse
…ry, eliminates heap growth
…for Vector log collection
…vel fields, filter to cloud containers only
…meric→string, nest metadata in _meta
…ken throw, region source routing, duration normalization
📋 PR Review Helper📱 Mobile App Build⏳ Waiting for build... 🕶️ ASG Client Build⏳ Waiting for build... 🔀 Test Locallygh pr checkout 2352 |
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis pull request introduces comprehensive BetterStack observability infrastructure including a new CLI diagnostic tool, Vector-based log transformation pipeline, refactored Pino logger with JSON stdout support, and a GitHub Actions workflow for US West deployments. Configuration and tooling are provided via a centralized config module. Changes
Sequence Diagram(s)sequenceDiagram
participant Pino as Pino Logger
participant Stream as Output Stream
participant Vector as Vector (transforms)
participant BS as BetterStack HTTP Sink
alt LOG_STDOUT_JSON enabled
Pino->>Stream: JSON formatted log
else LOG_STDOUT_JSON disabled
Pino->>Stream: Pretty-printed log
end
alt Filters configured
Stream->>Vector: Forward to cloud_only_filter
Vector->>Vector: Select cloud container logs
Vector->>Vector: Apply flatten_pino remap
Vector->>Vector: Normalize fields (msg→message, time→dt)
Vector->>Vector: Convert Pino levels to BetterStack format
Vector->>Vector: Annotate with ._meta (pod, container, source)
Vector->>BS: HTTP POST transformed events
else No filters
Stream->>Stream: Pass-through (early return)
end
sequenceDiagram
participant User as User/CLI
participant CLI as bstack.ts
participant Config as config.ts
participant CH as ClickHouse SQL Endpoint
participant BS as BetterStack API
participant Output as ASCII Table Output
User->>CLI: Execute command (e.g., diagnostics)
CLI->>Config: Validate credentials & get endpoint
Config->>Config: Check SQL_USERNAME, SQL_PASSWORD
rect rgba(100, 150, 255, 0.5)
CLI->>CH: Authenticate with Basic auth
CLI->>CH: Execute SQL query (+ FORMAT JSON)
CH-->>CLI: Return JSON results
end
alt Command requires uptime API
CLI->>Config: Get UPTIME_API & token
CLI->>BS: Fetch incidents/health status
BS-->>CLI: Return API response
end
CLI->>Output: Format results to ASCII table
Output-->>User: Display table output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8624b74d8e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (region === "france" || region === "east-asia") { | ||
| // These regions may still send to the legacy source until redeployed | ||
| // with the new BETTERSTACK_SOURCE_TOKEN. Check both — prefer prod. | ||
| return getLogsTable("prod"); | ||
| } |
There was a problem hiding this comment.
Route legacy regions to the correct log source
getSourceForRegion returns the prod logs table for every region, including france and east-asia. That contradicts the migration notes in this commit (LOG_SOURCES.dev still receives those regions until redeploy), so during that window all region-scoped commands (diagnostics, memory, gc, etc.) query the wrong table and can falsely show no data while incidents are ongoing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Not applicable anymore. France and East Asia now ship logs to the prod source via two paths:
- Vector (just installed on all 5 clusters tonight) — tails container stdout, ships to MentraCloud-Prod source
@logtail/pino(still running until this PR deploys) — also sends to MentraCloud-Prod source via theBETTERSTACK_SOURCE_TOKENrestored in Doppler
The getSourceForRegion helper was written as a future-proofing hook in case regions diverge again, but right now all regions route to prod. Once this PR deploys, Vector is the only log path for all regions — all pointing at the same prod source.
What
Removes the in-process
@logtail/pinoPino transport that causes unbounded heap growth (~15 MB/min) and replaces it with out-of-process log collection via BetterStack's Vector Helm chart.Why
The
@logtail/pinotransport uses Pino'sthread-streamworker thread to send logs to BetterStack's HTTP API. When the API can't consume as fast as we produce (~100-170 logs/sec on US Central), the buffer grows without bound. The heap reaches 500-600MB, JSC triggers a full GC (3+ second pause),/healthcan't respond, and Kubernetes kills the pod.Proved definitively:
What changed
pino-logger.ts: When
LOG_STDOUT_JSON=true(set in Doppler for all prod configs), writes raw JSON to stdout instead of using@logtail/pino. Also removes thecreateFilteredStreamdouble-parse overhead when no log filters are set.cloud/infra/betterstack-logs/values.yaml: Vector Helm chart config with transforms that flatten Pino JSON to top-level fields, convert
msg→message,time→dt, numericlevel→string, filter to only cloud containers, and nest Kubernetes metadata in_meta.bstack CLI fixes: Addresses 4 code review comments (nested health response, validateApiToken throw, region source routing, duration normalization).
Already deployed
LOG_STDOUT_JSON=trueset in Doppler for all prod configsRollout
Merging to main triggers deploy to all regions. Each region already has Vector running and
LOG_STDOUT_JSON=truein Doppler, so the code change takes effect immediately on deploy.Investigation docs
Summary by CodeRabbit
Release Notes
New Features
Refactoring