Add metrics to track batch trigger reasons and sink performance

### Use case

We operate high-volume data pipelines processing observability data through Flink to ClickHouse using the async sink with batching parameters:
```java
MAX_BATCH_SIZE = 100000
MAX_BATCH_SIZE_IN_BYTES = 100 * 1024 * 1024
MAX_TIME_IN_BUFFER_MS = 30000
MAX_IN_FLIGHT_REQUESTS = 2
MAX_BUFFERED_REQUESTS = 1000000
MAX_RECORD_SIZE_IN_BYTES = 5 * 1024 * 1024
```

**Problem**: We cannot determine which trigger condition (record count, byte size, or time threshold) is actually causing batch flushes in production. This makes it impossible to optimize sink configuration and troubleshoot performance issues effectively.

### Describe the solution you'd like

Add metrics to expose **batch trigger attribution** - specifically, counters that track which condition triggered each batch flush:

- Batches triggered by `MAX_BATCH_SIZE` (record count limit)
- Batches triggered by `MAX_BATCH_SIZE_IN_BYTES` (byte size limit)  
- Batches triggered by `MAX_TIME_IN_BUFFER_MS` (timeout)
- Batches triggered by backpressure conditions

**Additional useful metrics to consider:**

- **Batch characteristics** (histograms): actual records per batch, actual bytes per batch, actual time in buffer
- **Backpressure indicators** (gauges): current in-flight requests, current buffered requests count
- **Data quality** (counters): serialization errors, records rejected for exceeding `MAX_RECORD_SIZE_IN_BYTES`
- **Write operations** (counters/histograms): successful/failed writes, write latency, retry attempts

**Example use case:**
If metrics show 95% of batches are triggered by `MAX_TIME_IN_BUFFER_MS`, we can reduce the timeout from 30s to 5s to improve latency. If batches consistently hit `MAX_BATCH_SIZE_IN_BYTES`, we know byte size is the bottleneck and can adjust accordingly.

### Describe the alternatives you've considered

1. **Application-level instrumentation** - Wrapping the connector adds maintenance burden and doesn't help other users
2. **ClickHouse server-side metrics** - Only shows what reached ClickHouse, not the Flink sink's internal batching behavior
3. **Flink TaskManager metrics** - Too coarse-grained, doesn't expose sink-specific batching decisions
4. **Debug logging** - High overhead, unsuitable for production

None of these expose the **internal batching decision logic** that is unique to this connector.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add metrics to track batch trigger reasons and sink performance #66

Use case

Describe the solution you'd like

Describe the alternatives you've considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add metrics to track batch trigger reasons and sink performance #66

Description

Use case

Describe the solution you'd like

Describe the alternatives you've considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions