Roadmap: complete out_stackdriver record-drop error handling

## Background

`plugins/out_stackdriver/stackdriver.c:stackdriver_format()` still contains a mix of:
- recoverable per-record validation failures,
- request-scoped/resource-derivation failures, and
- batch-fatal decoder/serialization failures.

PR #11539 is the first step in cleaning this up.

## Task Breakdown

- [x] Task 1: convert invalid `logging.googleapis.com/labels` type from batch-fatal to per-record drop
  - Implemented in PR #11539: https://github.com/fluent/fluent-bit/pull/11539
  - This also centralizes current per-record skip logic in `should_skip_record()` so prescan and packing stay aligned.

- [ ] Task 2: classify every `stackdriver_format()` failure path by recovery model
  - For each failure site in or directly used by `stackdriver_format()`, classify it as:
    - `record-fatal`: drop only the current record
    - `request-fatal`: cannot safely build the Cloud Logging request for the surviving records
    - `batch-fatal`: decoder/serialization/internal failure where recovery is not possible
  - This should cover at least:
    - `flb_log_event_decoder_init()` failure
    - `flb_log_event_decoder_next()` failure
    - invalid `insertId`
    - invalid `labels`
    - k8s `local_resource_id` extraction / processing failures
    - final msgpack-to-JSON serialization failure

- [ ] Task 3: define the desired behavior for k8s `local_resource_id` failures
  - The main unresolved gap is request-level monitored resource derivation for:
    - `k8s_container`
    - `k8s_node`
    - `k8s_pod`
  - We need an explicit decision for mixed-validity batches:
    - continue using the first raw record,
    - use the first surviving valid record after prescan,
    - fall back to another monitored resource such as `global`, or
    - keep these cases request-fatal / batch-fatal.
  - This needs to be decided before expanding per-record recovery further, because monitored resource selection is request-scoped.

- [ ] Task 4: refactor request-scoped resource derivation to match the chosen design
  - If Task 3 chooses a recoverable model, refactor formatter flow so request metadata is derived consistently with record dropping.
  - Likely requirements:
    - prescan surviving records first,
    - identify the request-defining record if needed,
    - derive monitored resource and request-scoped labels from that source,
    - pack only surviving records.

- [ ] Task 5: add TDD coverage for request-scoped k8s/resource failures
  - Add tests before implementation for at least:
    - `k8s_container`: first record invalid/missing `local_resource_id`, later record valid
    - `k8s_node`: first record invalid/missing `local_resource_id`, later record valid
    - `k8s_pod`: first record invalid/missing `local_resource_id`, later record valid
    - all-invalid k8s batches
    - mixed valid/invalid batches where request-level resource fields come from record content
  - Each test should explicitly assert whether the expected result is:
    - no output,
    - one surviving entry,
    - multiple surviving entries, or
    - request-fatal behavior.

- [ ] Task 6: add lower-level tests for unrecoverable decoder/serialization paths
  - Some failures are hard to reproduce through the runtime harness because valid msgpack is usually fed into the formatter.
  - Add lower-level/unit coverage for:
    - decoder init failure
    - decoder next/read failure mid-batch
    - serialization failure handling where feasible
  - These tests should confirm which paths intentionally remain batch-fatal.

- [ ] Task 7: document intentional non-recoverable behavior
  - If some formatter failures remain batch-fatal after review, document that explicitly rather than leaving them ambiguous.
  - In particular, these likely remain intentionally unrecoverable:
    - decoder initialization failure
    - decoder iteration failure
    - final serialization failure

## Current Technical Status

- PR #11539 addresses one recoverable formatter failure: invalid `logging.googleapis.com/labels` type.
- Invalid `logging.googleapis.com/insertId` was already per-record.
- The main remaining technical gap is k8s `local_resource_id` and other request-scoped resource-derivation behavior.
- Decoder init/read failures and final serialization failure are likely intentionally batch-fatal, but that should be explicitly codified.

## Done Definition

This issue is complete when:
- every known `stackdriver_format()` failure path is classified as `record-fatal`, `request-fatal`, or `batch-fatal`,
- the expected behavior for k8s `local_resource_id` failures is explicitly defined,
- any newly recoverable cases are implemented without prescan/main-loop divergence,
- tests cover every documented scenario,
- intentionally batch-fatal paths are documented as such.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: complete out_stackdriver record-drop error handling #11541

Background

Task Breakdown

Current Technical Status

Done Definition

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Roadmap: complete out_stackdriver record-drop error handling #11541

Description

Background

Task Breakdown

Current Technical Status

Done Definition

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions