Skip to content

Commit 18bdb9b

Browse files
committed
administration: monitoring: metrics: add new output latency metrics
Signed-off-by: Eduardo Silva <[email protected]>
1 parent ab21009 commit 18bdb9b

File tree

1 file changed

+84
-0
lines changed

1 file changed

+84
-0
lines changed

administration/monitoring.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,7 @@ The following terms are key to understanding how Fluent Bit processes metrics:
203203
| `fluentbit_output_retried_records_total` | name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
204204
| `fluentbit_output_retries_failed_total` | name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a `Retry_Limit`, which applies to chunks. When the `Retry_Limit` is exceeded, the chunk is discarded and this metric is incremented. | counter | chunks |
205205
| `fluentbit_output_retries_total` | name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
206+
| `fluentbit_output_latency_seconds` | input: the name of the input plugin instance, output: the name of the output plugin instance | End-to-end latency from chunk creation to successful delivery. Provides observability into chunk-level pipeline performance. | histogram | seconds |
206207
| `fluentbit_uptime` | hostname: the hostname on running Fluent Bit | The number of seconds that Fluent Bit has been running. | counter | seconds |
207208
| `fluentbit_process_start_time_seconds` | hostname: the hostname on running Fluent Bit | The Unix Epoch time stamp for when Fluent Bit started. | gauge | seconds |
208209
| `fluentbit_build_info` | hostname: the hostname, version: the version of Fluent Bit, os: OS type | Build version information. The returned value is originated from initializing the Unix Epoch time stamp of configuration context. | gauge | seconds |
@@ -231,6 +232,89 @@ The following are detailed descriptions for the metrics collected by the storage
231232
| `fluentbit_output_upstream_total_connections` | name: the name or alias for the output instance | The sum of the connection count of each output plugins. | gauge | bytes |
232233
| `fluentbit_output_upstream_busy_connections` | name: the name or alias for the output instance | The sum of the connection count in a busy state of each output plugins. | gauge | bytes |
233234

235+
### Output latency metric
236+
237+
> note: feature introduced in v4.0.6.
238+
239+
The `fluentbit_output_latency_seconds` histogram metric captures end-to-end latency from the time a chunk is created by an input plugin until it is successfully delivered by an output plugin. This provides observability into chunk-level pipeline performance and helps identify slowdowns or bottlenecks in the output path.
240+
241+
#### Bucket configuration
242+
243+
The histogram uses the following default bucket boundaries, designed around Fluent Bit's typical flush interval of 1 second:
244+
245+
```
246+
0.5, 1.0, 1.5, 2.5, 5.0, 10.0, 20.0, 30.0, +Inf
247+
```
248+
249+
These boundaries provide:
250+
- **High resolution around 1s latency**: Captures normal operation near the default flush interval
251+
- **Small backpressure detection**: Identifies minor delays in the 1-2.5s range
252+
- **Bottleneck identification**: Detects retry cycles, network stalls, or plugin bottlenecks in higher ranges
253+
- **Complete coverage**: The `+Inf` bucket ensures all latencies are captured
254+
255+
#### Example output
256+
257+
When exposed via Fluent Bit's built-in HTTP server, the metric appears in Prometheus format:
258+
259+
```prometheus
260+
# HELP fluentbit_output_latency_seconds End-to-end latency in seconds
261+
# TYPE fluentbit_output_latency_seconds histogram
262+
fluentbit_output_latency_seconds_bucket{le="0.5",input="random.0",output="stdout.0"} 0
263+
fluentbit_output_latency_seconds_bucket{le="1.0",input="random.0",output="stdout.0"} 1
264+
fluentbit_output_latency_seconds_bucket{le="1.5",input="random.0",output="stdout.0"} 6
265+
fluentbit_output_latency_seconds_bucket{le="2.5",input="random.0",output="stdout.0"} 6
266+
fluentbit_output_latency_seconds_bucket{le="5.0",input="random.0",output="stdout.0"} 6
267+
fluentbit_output_latency_seconds_bucket{le="10.0",input="random.0",output="stdout.0"} 6
268+
fluentbit_output_latency_seconds_bucket{le="20.0",input="random.0",output="stdout.0"} 6
269+
fluentbit_output_latency_seconds_bucket{le="30.0",input="random.0",output="stdout.0"} 6
270+
fluentbit_output_latency_seconds_bucket{le="+Inf",input="random.0",output="stdout.0"} 6
271+
fluentbit_output_latency_seconds_sum{input="random.0",output="stdout.0"} 6.0015411376953125
272+
fluentbit_output_latency_seconds_count{input="random.0",output="stdout.0"} 6
273+
```
274+
275+
#### Use cases
276+
277+
**Performance monitoring**: Monitor overall pipeline health by tracking latency percentiles:
278+
279+
```promql
280+
# 95th percentile latency
281+
histogram_quantile(0.95, rate(fluentbit_output_latency_seconds_bucket[5m]))
282+
283+
# Average latency
284+
rate(fluentbit_output_latency_seconds_sum[5m]) / rate(fluentbit_output_latency_seconds_count[5m])
285+
```
286+
287+
**Bottleneck detection**: Identify specific input/output pairs experiencing high latency:
288+
289+
```promql
290+
# Outputs with highest average latency
291+
topk(5, rate(fluentbit_output_latency_seconds_sum[5m]) / rate(fluentbit_output_latency_seconds_count[5m]))
292+
```
293+
294+
**SLA monitoring**: Track how many chunks are delivered within acceptable time bounds:
295+
296+
```promql
297+
# Percentage of chunks delivered within 2 seconds
298+
(
299+
rate(fluentbit_output_latency_seconds_bucket{le="2.0"}[5m]) /
300+
rate(fluentbit_output_latency_seconds_count[5m])
301+
) * 100
302+
```
303+
304+
**Alerting**: Create alerts for degraded pipeline performance:
305+
306+
```yaml
307+
# Example Prometheus alerting rule
308+
- alert: FluentBitHighLatency
309+
expr: histogram_quantile(0.95, rate(fluentbit_output_latency_seconds_bucket[5m])) > 5
310+
for: 2m
311+
labels:
312+
severity: warning
313+
annotations:
314+
summary: "Fluent Bit pipeline experiencing high latency"
315+
description: "95th percentile latency is {{ $value }}s for {{ $labels.input }} -> {{ $labels.output }}"
316+
```
317+
234318
### Uptime example
235319
236320
Query the service uptime with the following command:

0 commit comments

Comments
 (0)