Skip to content

hotfix(067): Switch from @logtail/pino to Vector log collection — fixes heap growth crash#2352

Merged
isaiahb merged 6 commits intomainfrom
hotfix/vector-logging
Mar 29, 2026
Merged

hotfix(067): Switch from @logtail/pino to Vector log collection — fixes heap growth crash#2352
isaiahb merged 6 commits intomainfrom
hotfix/vector-logging

Conversation

@isaiahb
Copy link
Copy Markdown
Contributor

@isaiahb isaiahb commented Mar 29, 2026

What

Removes the in-process @logtail/pino Pino transport that causes unbounded heap growth (~15 MB/min) and replaces it with out-of-process log collection via BetterStack's Vector Helm chart.

Why

The @logtail/pino transport uses Pino's thread-stream worker thread to send logs to BetterStack's HTTP API. When the API can't consume as fast as we produce (~100-170 logs/sec on US Central), the buffer grows without bound. The heap reaches 500-600MB, JSC triggers a full GC (3+ second pause), /health can't respond, and Kubernetes kills the pod.

Proved definitively:

What changed

  1. pino-logger.ts: When LOG_STDOUT_JSON=true (set in Doppler for all prod configs), writes raw JSON to stdout instead of using @logtail/pino. Also removes the createFilteredStream double-parse overhead when no log filters are set.

  2. cloud/infra/betterstack-logs/values.yaml: Vector Helm chart config with transforms that flatten Pino JSON to top-level fields, convert msg→message, time→dt, numeric level→string, filter to only cloud containers, and nest Kubernetes metadata in _meta.

  3. bstack CLI fixes: Addresses 4 code review comments (nested health response, validateApiToken throw, region source routing, duration normalization).

Already deployed

  • Vector Helm chart installed on all 5 clusters (US Central, France, East Asia, US West, US East)
  • LOG_STDOUT_JSON=true set in Doppler for all prod configs
  • Tested end-to-end on US West with a live phone session — logs identical in BetterStack Live Tail
  • Heap confirmed stable (not growing) on US West

Rollout

Merging to main triggers deploy to all regions. Each region already has Vector running and LOG_STDOUT_JSON=true in Doppler, so the code change takes effect immediately on deploy.

Investigation docs

  • 067 spike — heap snapshot analysis proving 200K objects/10min from transport
  • 067 spec — full spec for the Vector migration
  • 066 spike — proved disconnect churn is NOT the cause of heap growth

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced BetterStack CLI diagnostic tool with commands for monitoring health, memory, garbage collection, event loops, database queries, and incidents across cloud environments
    • Enhanced logging infrastructure with configurable JSON output support and improved external log delivery
  • Refactoring

    • Optimized logging filters to reduce overhead when not in use

@isaiahb isaiahb requested a review from a team as a code owner March 29, 2026 06:09
@github-actions
Copy link
Copy Markdown

📋 PR Review Helper

📱 Mobile App Build

Waiting for build...

🕶️ ASG Client Build

Waiting for build...


🔀 Test Locally

gh pr checkout 2352

@isaiahb isaiahb self-assigned this Mar 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive BetterStack observability infrastructure including a new CLI diagnostic tool, Vector-based log transformation pipeline, refactored Pino logger with JSON stdout support, and a GitHub Actions workflow for US West deployments. Configuration and tooling are provided via a centralized config module.

Changes

Cohort / File(s) Summary
GitHub Actions Deployment
.github/workflows/porter-us-west.yml
New workflow for manual/branch-triggered deployments to US West using Porter CLI with short SHA tagging and 30-minute timeout.
Observability Infrastructure
cloud/infra/betterstack-logs/values.yaml, cloud/packages/cloud/src/services/logging/pino-logger.ts
Helm configuration for Vector-based log collection with cloud-environment filtering, Pino JSON parsing, field normalization, and HTTP sinking to BetterStack. Logger refactored to support JSON stdout mode, removes in-process BetterStack transport, and optimizes filtering logic.
BetterStack CLI Tooling
cloud/tools/bstack/bstack.ts, cloud/tools/bstack/config.ts
New Bun-based CLI for BetterStack SQL/uptime diagnostics with subcommands (health, diagnostics, memory, gc, gaps, budget, slow-queries, cache, incidents, sources, sql, runbook). Config module centralizes credentials, endpoints, log source definitions, and helper validators.

Sequence Diagram(s)

sequenceDiagram
    participant Pino as Pino Logger
    participant Stream as Output Stream
    participant Vector as Vector (transforms)
    participant BS as BetterStack HTTP Sink

    alt LOG_STDOUT_JSON enabled
        Pino->>Stream: JSON formatted log
    else LOG_STDOUT_JSON disabled
        Pino->>Stream: Pretty-printed log
    end

    alt Filters configured
        Stream->>Vector: Forward to cloud_only_filter
        Vector->>Vector: Select cloud container logs
        Vector->>Vector: Apply flatten_pino remap
        Vector->>Vector: Normalize fields (msg→message, time→dt)
        Vector->>Vector: Convert Pino levels to BetterStack format
        Vector->>Vector: Annotate with ._meta (pod, container, source)
        Vector->>BS: HTTP POST transformed events
    else No filters
        Stream->>Stream: Pass-through (early return)
    end
Loading
sequenceDiagram
    participant User as User/CLI
    participant CLI as bstack.ts
    participant Config as config.ts
    participant CH as ClickHouse SQL Endpoint
    participant BS as BetterStack API
    participant Output as ASCII Table Output

    User->>CLI: Execute command (e.g., diagnostics)
    CLI->>Config: Validate credentials & get endpoint
    Config->>Config: Check SQL_USERNAME, SQL_PASSWORD
    
    rect rgba(100, 150, 255, 0.5)
        CLI->>CH: Authenticate with Basic auth
        CLI->>CH: Execute SQL query (+ FORMAT JSON)
        CH-->>CLI: Return JSON results
    end

    alt Command requires uptime API
        CLI->>Config: Get UPTIME_API & token
        CLI->>BS: Fetch incidents/health status
        BS-->>CLI: Return API response
    end

    CLI->>Output: Format results to ASCII table
    Output-->>User: Display table output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 From logs to CLI, a rabbit's delight—
Vector transforms dancing through cloud's misty night,
JSON flows swift to BetterStack's door,
While CLI diagnostics reveal what's in store!
Observability blooms in the warren once more. 🌿

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: switching from @logtail/pino to Vector for log collection and identifies the critical issue being fixed (heap growth crash).
Docstring Coverage ✅ Passed Docstring coverage is 82.76% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hotfix/vector-logging

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8624b74d8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +112 to +116
if (region === "france" || region === "east-asia") {
// These regions may still send to the legacy source until redeployed
// with the new BETTERSTACK_SOURCE_TOKEN. Check both — prefer prod.
return getLogsTable("prod");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route legacy regions to the correct log source

getSourceForRegion returns the prod logs table for every region, including france and east-asia. That contradicts the migration notes in this commit (LOG_SOURCES.dev still receives those regions until redeploy), so during that window all region-scoped commands (diagnostics, memory, gc, etc.) query the wrong table and can falsely show no data while incidents are ongoing.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable anymore. France and East Asia now ship logs to the prod source via two paths:

  1. Vector (just installed on all 5 clusters tonight) — tails container stdout, ships to MentraCloud-Prod source
  2. @logtail/pino (still running until this PR deploys) — also sends to MentraCloud-Prod source via the BETTERSTACK_SOURCE_TOKEN restored in Doppler

The getSourceForRegion helper was written as a future-proofing hook in case regions diverge again, but right now all regions route to prod. Once this PR deploys, Vector is the only log path for all regions — all pointing at the same prod source.

@isaiahb isaiahb merged commit a89372d into main Mar 29, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant