-
Notifications
You must be signed in to change notification settings - Fork 540
administration: monitoring: metrics: add new output latency metrics #2037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -203,6 +203,7 @@ | |||||||
| | `fluentbit_output_retried_records_total` | name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records | | ||||||||
| | `fluentbit_output_retries_failed_total` | name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a `Retry_Limit`, which applies to chunks. When the `Retry_Limit` is exceeded, the chunk is discarded and this metric is incremented. | counter | chunks | | ||||||||
| | `fluentbit_output_retries_total` | name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks | | ||||||||
| | `fluentbit_output_latency_seconds` | input: the name of the input plugin instance, output: the name of the output plugin instance | End-to-end latency from chunk creation to successful delivery. Provides observability into chunk-level pipeline performance. | histogram | seconds | | ||||||||
| | `fluentbit_uptime` | hostname: the hostname on running Fluent Bit | The number of seconds that Fluent Bit has been running. | counter | seconds | | ||||||||
| | `fluentbit_process_start_time_seconds` | hostname: the hostname on running Fluent Bit | The Unix Epoch time stamp for when Fluent Bit started. | gauge | seconds | | ||||||||
| | `fluentbit_build_info` | hostname: the hostname, version: the version of Fluent Bit, os: OS type | Build version information. The returned value is originated from initializing the Unix Epoch time stamp of configuration context. | gauge | seconds | | ||||||||
|
|
@@ -231,6 +232,89 @@ | |||||||
| | `fluentbit_output_upstream_total_connections` | name: the name or alias for the output instance | The sum of the connection count of each output plugins. | gauge | bytes | | ||||||||
| | `fluentbit_output_upstream_busy_connections` | name: the name or alias for the output instance | The sum of the connection count in a busy state of each output plugins. | gauge | bytes | | ||||||||
|
|
||||||||
| ### Output latency metric | ||||||||
|
|
||||||||
| > note: feature introduced in v4.0.6. | ||||||||
|
Check failure on line 237 in administration/monitoring.md
|
||||||||
| The `fluentbit_output_latency_seconds` histogram metric captures end-to-end latency from the time a chunk is created by an input plugin until it is successfully delivered by an output plugin. This provides observability into chunk-level pipeline performance and helps identify slowdowns or bottlenecks in the output path. | ||||||||
|
Check warning on line 239 in administration/monitoring.md
|
||||||||
|
|
||||||||
| #### Bucket configuration | ||||||||
|
|
||||||||
| The histogram uses the following default bucket boundaries, designed around Fluent Bit's typical flush interval of 1 second: | ||||||||
|
Check warning on line 243 in administration/monitoring.md
|
||||||||
|
|
||||||||
| ``` | ||||||||
| 0.5, 1.0, 1.5, 2.5, 5.0, 10.0, 20.0, 30.0, +Inf | ||||||||
| ``` | ||||||||
|
|
||||||||
| These boundaries provide: | ||||||||
| - **High resolution around 1s latency**: Captures normal operation near the default flush interval | ||||||||
|
Check warning on line 250 in administration/monitoring.md
|
||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [markdownlint] reported by reviewdog 🐶 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [markdownlint-fix] reported by reviewdog 🐶
Suggested change
|
||||||||
| - **Small backpressure detection**: Identifies minor delays in the 1-2.5s range | ||||||||
|
Check warning on line 251 in administration/monitoring.md
|
||||||||
| - **Bottleneck identification**: Detects retry cycles, network stalls, or plugin bottlenecks in higher ranges | ||||||||
| - **Complete coverage**: The `+Inf` bucket ensures all latencies are captured | ||||||||
|
|
||||||||
| #### Example output | ||||||||
|
|
||||||||
| When exposed via Fluent Bit's built-in HTTP server, the metric appears in Prometheus format: | ||||||||
|
Check warning on line 257 in administration/monitoring.md
|
||||||||
|
|
||||||||
| ```prometheus | ||||||||
| # HELP fluentbit_output_latency_seconds End-to-end latency in seconds | ||||||||
| # TYPE fluentbit_output_latency_seconds histogram | ||||||||
| fluentbit_output_latency_seconds_bucket{le="0.5",input="random.0",output="stdout.0"} 0 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="1.0",input="random.0",output="stdout.0"} 1 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="1.5",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="2.5",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="5.0",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="10.0",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="20.0",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="30.0",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_bucket{le="+Inf",input="random.0",output="stdout.0"} 6 | ||||||||
| fluentbit_output_latency_seconds_sum{input="random.0",output="stdout.0"} 6.0015411376953125 | ||||||||
| fluentbit_output_latency_seconds_count{input="random.0",output="stdout.0"} 6 | ||||||||
| ``` | ||||||||
|
|
||||||||
| #### Use cases | ||||||||
|
|
||||||||
| **Performance monitoring**: Monitor overall pipeline health by tracking latency percentiles: | ||||||||
|
|
||||||||
| ```promql | ||||||||
| # 95th percentile latency | ||||||||
| histogram_quantile(0.95, rate(fluentbit_output_latency_seconds_bucket[5m])) | ||||||||
| # Average latency | ||||||||
| rate(fluentbit_output_latency_seconds_sum[5m]) / rate(fluentbit_output_latency_seconds_count[5m]) | ||||||||
| ``` | ||||||||
|
|
||||||||
| **Bottleneck detection**: Identify specific input/output pairs experiencing high latency: | ||||||||
|
|
||||||||
| ```promql | ||||||||
| # Outputs with highest average latency | ||||||||
| topk(5, rate(fluentbit_output_latency_seconds_sum[5m]) / rate(fluentbit_output_latency_seconds_count[5m])) | ||||||||
| ``` | ||||||||
|
|
||||||||
| **SLA monitoring**: Track how many chunks are delivered within acceptable time bounds: | ||||||||
|
|
||||||||
| ```promql | ||||||||
| # Percentage of chunks delivered within 2 seconds | ||||||||
| ( | ||||||||
| rate(fluentbit_output_latency_seconds_bucket{le="2.0"}[5m]) / | ||||||||
| rate(fluentbit_output_latency_seconds_count[5m]) | ||||||||
| ) * 100 | ||||||||
| ``` | ||||||||
|
|
||||||||
| **Alerting**: Create alerts for degraded pipeline performance: | ||||||||
|
|
||||||||
| ```yaml | ||||||||
| # Example Prometheus alerting rule | ||||||||
| - alert: FluentBitHighLatency | ||||||||
| expr: histogram_quantile(0.95, rate(fluentbit_output_latency_seconds_bucket[5m])) > 5 | ||||||||
| for: 2m | ||||||||
| labels: | ||||||||
| severity: warning | ||||||||
| annotations: | ||||||||
| summary: "Fluent Bit pipeline experiencing high latency" | ||||||||
| description: "95th percentile latency is {{ $value }}s for {{ $labels.input }} -> {{ $labels.output }}" | ||||||||
| ``` | ||||||||
| ### Uptime example | ||||||||
| Query the service uptime with the following command: | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]