Skip to content

Add metrics to track batch trigger reasons and sink performance #66

@rafael-gumiero

Description

@rafael-gumiero

Use case

We operate high-volume data pipelines processing observability data through Flink to ClickHouse using the async sink with batching parameters:

MAX_BATCH_SIZE = 100000
MAX_BATCH_SIZE_IN_BYTES = 100 * 1024 * 1024
MAX_TIME_IN_BUFFER_MS = 30000
MAX_IN_FLIGHT_REQUESTS = 2
MAX_BUFFERED_REQUESTS = 1000000
MAX_RECORD_SIZE_IN_BYTES = 5 * 1024 * 1024

Problem: We cannot determine which trigger condition (record count, byte size, or time threshold) is actually causing batch flushes in production. This makes it impossible to optimize sink configuration and troubleshoot performance issues effectively.

Describe the solution you'd like

Add metrics to expose batch trigger attribution - specifically, counters that track which condition triggered each batch flush:

  • Batches triggered by MAX_BATCH_SIZE (record count limit)
  • Batches triggered by MAX_BATCH_SIZE_IN_BYTES (byte size limit)
  • Batches triggered by MAX_TIME_IN_BUFFER_MS (timeout)
  • Batches triggered by backpressure conditions

Additional useful metrics to consider:

  • Batch characteristics (histograms): actual records per batch, actual bytes per batch, actual time in buffer
  • Backpressure indicators (gauges): current in-flight requests, current buffered requests count
  • Data quality (counters): serialization errors, records rejected for exceeding MAX_RECORD_SIZE_IN_BYTES
  • Write operations (counters/histograms): successful/failed writes, write latency, retry attempts

Example use case:
If metrics show 95% of batches are triggered by MAX_TIME_IN_BUFFER_MS, we can reduce the timeout from 30s to 5s to improve latency. If batches consistently hit MAX_BATCH_SIZE_IN_BYTES, we know byte size is the bottleneck and can adjust accordingly.

Describe the alternatives you've considered

  1. Application-level instrumentation - Wrapping the connector adds maintenance burden and doesn't help other users
  2. ClickHouse server-side metrics - Only shows what reached ClickHouse, not the Flink sink's internal batching behavior
  3. Flink TaskManager metrics - Too coarse-grained, doesn't expose sink-specific batching decisions
  4. Debug logging - High overhead, unsuitable for production

None of these expose the internal batching decision logic that is unique to this connector.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions