Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions administration/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@
| `fluentbit_output_retried_records_total` | name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
| `fluentbit_output_retries_failed_total` | name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a `Retry_Limit`, which applies to chunks. When the `Retry_Limit` is exceeded, the chunk is discarded and this metric is incremented. | counter | chunks |
| `fluentbit_output_retries_total` | name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
| `fluentbit_output_latency_seconds` | input: the name of the input plugin instance, output: the name of the output plugin instance | End-to-end latency from chunk creation to successful delivery. Provides observability into chunk-level pipeline performance. | histogram | seconds |
| `fluentbit_uptime` | hostname: the hostname on running Fluent Bit | The number of seconds that Fluent Bit has been running. | counter | seconds |
| `fluentbit_process_start_time_seconds` | hostname: the hostname on running Fluent Bit | The Unix Epoch time stamp for when Fluent Bit started. | gauge | seconds |
| `fluentbit_build_info` | hostname: the hostname, version: the version of Fluent Bit, os: OS type | Build version information. The returned value is originated from initializing the Unix Epoch time stamp of configuration context. | gauge | seconds |
Expand Down Expand Up @@ -231,6 +232,89 @@
| `fluentbit_output_upstream_total_connections` | name: the name or alias for the output instance | The sum of the connection count of each output plugins. | gauge | bytes |
| `fluentbit_output_upstream_busy_connections` | name: the name or alias for the output instance | The sum of the connection count in a busy state of each output plugins. | gauge | bytes |

### Output latency metric

> note: feature introduced in v4.0.6.

Check failure on line 237 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Hints] Instead of using `>` to call out information, use GitBook hint syntax. https://gitbook.com/docs/creating-content/blocks/hint#representation-in-markdown Raw Output: {"message": "[FluentBit.Hints] Instead of using `\u003e` to call out information, use GitBook hint syntax. https://gitbook.com/docs/creating-content/blocks/hint#representation-in-markdown", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 237, "column": 1}}}, "severity": "ERROR"}
The `fluentbit_output_latency_seconds` histogram metric captures end-to-end latency from the time a chunk is created by an input plugin until it is successfully delivered by an output plugin. This provides observability into chunk-level pipeline performance and helps identify slowdowns or bottlenecks in the output path.

Check warning on line 239 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Contractions] Feel free to use 'it's' instead of 'it is'. Raw Output: {"message": "[FluentBit.Contractions] Feel free to use 'it's' instead of 'it is'.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 239, "column": 143}}}, "severity": "INFO"}

#### Bucket configuration

The histogram uses the following default bucket boundaries, designed around Fluent Bit's typical flush interval of 1 second:

Check warning on line 243 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Possessives] Rewrite 'Bit's' to not use 's. Raw Output: {"message": "[FluentBit.Possessives] Rewrite 'Bit's' to not use 's.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 243, "column": 84}}}, "severity": "WARNING"}

```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]

0.5, 1.0, 1.5, 2.5, 5.0, 10.0, 20.0, 30.0, +Inf
```

These boundaries provide:
- **High resolution around 1s latency**: Captures normal operation near the default flush interval

Check warning on line 250 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Units] Put a nonbreaking space between the number and the unit in '1s'. Raw Output: {"message": "[FluentBit.Units] Put a nonbreaking space between the number and the unit in '1s'.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 250, "column": 28}}}, "severity": "INFO"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **High resolution around 1s ..."]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
- **High resolution around 1s latency**: Captures normal operation near the default flush interval
- **High resolution around 1s latency**: Captures normal operation near the default flush interval

- **Small backpressure detection**: Identifies minor delays in the 1-2.5s range

Check warning on line 251 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Units] Put a nonbreaking space between the number and the unit in '5s'. Raw Output: {"message": "[FluentBit.Units] Put a nonbreaking space between the number and the unit in '5s'.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 251, "column": 72}}}, "severity": "INFO"}
- **Bottleneck identification**: Detects retry cycles, network stalls, or plugin bottlenecks in higher ranges
- **Complete coverage**: The `+Inf` bucket ensures all latencies are captured

#### Example output

When exposed via Fluent Bit's built-in HTTP server, the metric appears in Prometheus format:

Check warning on line 257 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Possessives] Rewrite 'Bit's' to not use 's. Raw Output: {"message": "[FluentBit.Possessives] Rewrite 'Bit's' to not use 's.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 257, "column": 25}}}, "severity": "WARNING"}

Check failure on line 257 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Latin] Use 'through or by using' instead of ' via '. Raw Output: {"message": "[FluentBit.Latin] Use 'through or by using' instead of ' via '.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 257, "column": 13}}}, "severity": "ERROR"}

```prometheus
# HELP fluentbit_output_latency_seconds End-to-end latency in seconds
# TYPE fluentbit_output_latency_seconds histogram
fluentbit_output_latency_seconds_bucket{le="0.5",input="random.0",output="stdout.0"} 0
fluentbit_output_latency_seconds_bucket{le="1.0",input="random.0",output="stdout.0"} 1
fluentbit_output_latency_seconds_bucket{le="1.5",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_bucket{le="2.5",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_bucket{le="5.0",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_bucket{le="10.0",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_bucket{le="20.0",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_bucket{le="30.0",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_bucket{le="+Inf",input="random.0",output="stdout.0"} 6
fluentbit_output_latency_seconds_sum{input="random.0",output="stdout.0"} 6.0015411376953125
fluentbit_output_latency_seconds_count{input="random.0",output="stdout.0"} 6
```

#### Use cases

**Performance monitoring**: Monitor overall pipeline health by tracking latency percentiles:

```promql
# 95th percentile latency
histogram_quantile(0.95, rate(fluentbit_output_latency_seconds_bucket[5m]))
# Average latency
rate(fluentbit_output_latency_seconds_sum[5m]) / rate(fluentbit_output_latency_seconds_count[5m])
```

**Bottleneck detection**: Identify specific input/output pairs experiencing high latency:

```promql
# Outputs with highest average latency
topk(5, rate(fluentbit_output_latency_seconds_sum[5m]) / rate(fluentbit_output_latency_seconds_count[5m]))
```

**SLA monitoring**: Track how many chunks are delivered within acceptable time bounds:

```promql
# Percentage of chunks delivered within 2 seconds
(
rate(fluentbit_output_latency_seconds_bucket{le="2.0"}[5m]) /
rate(fluentbit_output_latency_seconds_count[5m])
) * 100
```

**Alerting**: Create alerts for degraded pipeline performance:

```yaml
# Example Prometheus alerting rule
- alert: FluentBitHighLatency
expr: histogram_quantile(0.95, rate(fluentbit_output_latency_seconds_bucket[5m])) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "Fluent Bit pipeline experiencing high latency"
description: "95th percentile latency is {{ $value }}s for {{ $labels.input }} -> {{ $labels.output }}"
```
### Uptime example
Query the service uptime with the following command:
Expand Down
Loading