Skip to content

Roadmap: complete out_stackdriver record-drop error handling #11541

@erain

Description

@erain

Background

plugins/out_stackdriver/stackdriver.c:stackdriver_format() still contains a mix of:

  • recoverable per-record validation failures,
  • request-scoped/resource-derivation failures, and
  • batch-fatal decoder/serialization failures.

PR #11539 is the first step in cleaning this up.

Task Breakdown

  • Task 1: convert invalid logging.googleapis.com/labels type from batch-fatal to per-record drop

  • Task 2: classify every stackdriver_format() failure path by recovery model

    • For each failure site in or directly used by stackdriver_format(), classify it as:
      • record-fatal: drop only the current record
      • request-fatal: cannot safely build the Cloud Logging request for the surviving records
      • batch-fatal: decoder/serialization/internal failure where recovery is not possible
    • This should cover at least:
      • flb_log_event_decoder_init() failure
      • flb_log_event_decoder_next() failure
      • invalid insertId
      • invalid labels
      • k8s local_resource_id extraction / processing failures
      • final msgpack-to-JSON serialization failure
  • Task 3: define the desired behavior for k8s local_resource_id failures

    • The main unresolved gap is request-level monitored resource derivation for:
      • k8s_container
      • k8s_node
      • k8s_pod
    • We need an explicit decision for mixed-validity batches:
      • continue using the first raw record,
      • use the first surviving valid record after prescan,
      • fall back to another monitored resource such as global, or
      • keep these cases request-fatal / batch-fatal.
    • This needs to be decided before expanding per-record recovery further, because monitored resource selection is request-scoped.
  • Task 4: refactor request-scoped resource derivation to match the chosen design

    • If Task 3 chooses a recoverable model, refactor formatter flow so request metadata is derived consistently with record dropping.
    • Likely requirements:
      • prescan surviving records first,
      • identify the request-defining record if needed,
      • derive monitored resource and request-scoped labels from that source,
      • pack only surviving records.
  • Task 5: add TDD coverage for request-scoped k8s/resource failures

    • Add tests before implementation for at least:
      • k8s_container: first record invalid/missing local_resource_id, later record valid
      • k8s_node: first record invalid/missing local_resource_id, later record valid
      • k8s_pod: first record invalid/missing local_resource_id, later record valid
      • all-invalid k8s batches
      • mixed valid/invalid batches where request-level resource fields come from record content
    • Each test should explicitly assert whether the expected result is:
      • no output,
      • one surviving entry,
      • multiple surviving entries, or
      • request-fatal behavior.
  • Task 6: add lower-level tests for unrecoverable decoder/serialization paths

    • Some failures are hard to reproduce through the runtime harness because valid msgpack is usually fed into the formatter.
    • Add lower-level/unit coverage for:
      • decoder init failure
      • decoder next/read failure mid-batch
      • serialization failure handling where feasible
    • These tests should confirm which paths intentionally remain batch-fatal.
  • Task 7: document intentional non-recoverable behavior

    • If some formatter failures remain batch-fatal after review, document that explicitly rather than leaving them ambiguous.
    • In particular, these likely remain intentionally unrecoverable:
      • decoder initialization failure
      • decoder iteration failure
      • final serialization failure

Current Technical Status

  • PR out_stackdriver: fix batch drop on invalid labels #11539 addresses one recoverable formatter failure: invalid logging.googleapis.com/labels type.
  • Invalid logging.googleapis.com/insertId was already per-record.
  • The main remaining technical gap is k8s local_resource_id and other request-scoped resource-derivation behavior.
  • Decoder init/read failures and final serialization failure are likely intentionally batch-fatal, but that should be explicitly codified.

Done Definition

This issue is complete when:

  • every known stackdriver_format() failure path is classified as record-fatal, request-fatal, or batch-fatal,
  • the expected behavior for k8s local_resource_id failures is explicitly defined,
  • any newly recoverable cases are implemented without prescan/main-loop divergence,
  • tests cover every documented scenario,
  • intentionally batch-fatal paths are documented as such.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions