-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Use case
We operate high-volume data pipelines processing observability data through Flink to ClickHouse using the async sink with batching parameters:
MAX_BATCH_SIZE = 100000
MAX_BATCH_SIZE_IN_BYTES = 100 * 1024 * 1024
MAX_TIME_IN_BUFFER_MS = 30000
MAX_IN_FLIGHT_REQUESTS = 2
MAX_BUFFERED_REQUESTS = 1000000
MAX_RECORD_SIZE_IN_BYTES = 5 * 1024 * 1024Problem: We cannot determine which trigger condition (record count, byte size, or time threshold) is actually causing batch flushes in production. This makes it impossible to optimize sink configuration and troubleshoot performance issues effectively.
Describe the solution you'd like
Add metrics to expose batch trigger attribution - specifically, counters that track which condition triggered each batch flush:
- Batches triggered by
MAX_BATCH_SIZE(record count limit) - Batches triggered by
MAX_BATCH_SIZE_IN_BYTES(byte size limit) - Batches triggered by
MAX_TIME_IN_BUFFER_MS(timeout) - Batches triggered by backpressure conditions
Additional useful metrics to consider:
- Batch characteristics (histograms): actual records per batch, actual bytes per batch, actual time in buffer
- Backpressure indicators (gauges): current in-flight requests, current buffered requests count
- Data quality (counters): serialization errors, records rejected for exceeding
MAX_RECORD_SIZE_IN_BYTES - Write operations (counters/histograms): successful/failed writes, write latency, retry attempts
Example use case:
If metrics show 95% of batches are triggered by MAX_TIME_IN_BUFFER_MS, we can reduce the timeout from 30s to 5s to improve latency. If batches consistently hit MAX_BATCH_SIZE_IN_BYTES, we know byte size is the bottleneck and can adjust accordingly.
Describe the alternatives you've considered
- Application-level instrumentation - Wrapping the connector adds maintenance burden and doesn't help other users
- ClickHouse server-side metrics - Only shows what reached ClickHouse, not the Flink sink's internal batching behavior
- Flink TaskManager metrics - Too coarse-grained, doesn't expose sink-specific batching decisions
- Debug logging - High overhead, unsuitable for production
None of these expose the internal batching decision logic that is unique to this connector.