Skip to content

Commit 83ccc05

Browse files
authored
Merge pull request #1647 from fluent/lynettemiles/sc-135594/update-fluent-bit-administration-monitoring
2 parents f65a25c + c13792c commit 83ccc05

File tree

2 files changed

+34
-33
lines changed

2 files changed

+34
-33
lines changed

administration/monitoring.md

Lines changed: 32 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,11 @@
11
---
2-
title: Monitor data pipelines
32
description: Learn how to monitor your Fluent Bit data pipelines
43
---
54

6-
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=e9ca51eb-7faf-491d-a62e-618a21c94506" />
7-
85
# Monitor data pipelines
96

7+
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=e9ca51eb-7faf-491d-a62e-618a21c94506" />
8+
109
Fluent Bit includes features for monitoring the internals of your pipeline, in
1110
addition to connecting to Prometheus and Grafana, Health checks, and connectors to
1211
use external services:
@@ -23,7 +22,7 @@ metrics of each running plugin.
2322

2423
You can integrate the monitoring interface with Prometheus.
2524

26-
### Getting started
25+
### Get started
2726

2827
To get started, enable the HTTP server from the configuration file. The following
2928
configuration instructs Fluent Bit to start an HTTP server on TCP port `2020` and
@@ -66,7 +65,7 @@ Use `curl` to gather information about the HTTP server. The following command se
6665
the command output to the `jq` program, which outputs human-readable JSON data to the
6766
terminal.
6867

69-
```curl
68+
```bash
7069
curl -s http://127.0.0.1:2020 | jq
7170
{
7271
"fluent-bit": {
@@ -99,15 +98,15 @@ Fluent Bit exposes the following endpoints for monitoring.
9998

10099
| URI | Description | Data format |
101100
| -------------------------- | ------------- | --------------------- |
102-
| / | Fluent Bit build information. | JSON |
103-
| /api/v1/uptime | Return uptime information in seconds. | JSON |
104-
| /api/v1/metrics | Display internal metrics per loaded plugin. | JSON |
105-
| /api/v1/metrics/prometheus | Display internal metrics per loaded plugin in Prometheus Server format. | Prometheus Text 0.0.4 |
106-
| /api/v1/storage | Get internal metrics of the storage layer / buffered data. This option is enabled only if in the `SERVICE` section of the property `storage.metrics` is enabled. | JSON |
107-
| /api/v1/health | Display the Fluent Bit health check result. | String |
108-
| /api/v2/metrics | Display internal metrics per loaded plugin. | [cmetrics text format](https://github.com/fluent/cmetrics) |
109-
| /api/v2/metrics/prometheus | Display internal metrics per loaded plugin ready in Prometheus Server format. | Prometheus Text 0.0.4 |
110-
| /api/v2/reload | Execute hot reloading or get the status of hot reloading. See the [hot-reloading documentation](hot-reload.md). | JSON |
101+
| `/` | Fluent Bit build information. | JSON |
102+
| `/api/v1/uptime` | Return uptime information in seconds. | JSON |
103+
| `/api/v1/metrics` | Display internal metrics per loaded plugin. | JSON |
104+
| `/api/v1/metrics/prometheus` | Display internal metrics per loaded plugin in Prometheus Server format. | Prometheus Text 0.0.4 |
105+
| `/api/v1/storage` | Get internal metrics of the storage layer / buffered data. This option is enabled only if in the `SERVICE` section of the property `storage.metrics` is enabled. | JSON |
106+
| `/api/v1/health` | Display the Fluent Bit health check result. | String |
107+
| `/api/v2/metrics` | Display internal metrics per loaded plugin. | [cmetrics text format](https://github.com/fluent/cmetrics) |
108+
| `/api/v2/metrics/prometheus` | Display internal metrics per loaded plugin ready in Prometheus Server format. | Prometheus Text 0.0.4 |
109+
| `/api/v2/reload | Execute hot reloading or get the status of hot reloading. See the [hot-reloading documentation](hot-reload.md). | JSON |
111110

112111
### v1 metrics
113112

@@ -131,14 +130,14 @@ The following terms are key to understanding how Fluent Bit processes metrics:
131130
as successful, or it can fail the chunk entirely if an unrecoverable error is
132131
encountered, or it can ask for the chunk to be retried.
133132

134-
| Metric name | Labels | Description | Type | Unit |
135-
|----------------------------------------|-------------------------------------------------|-------------|---------|---------|
133+
| Metric name | Labels | Description | Type | Unit |
134+
| ----------- | ------ | ----------- | ---- | ---- |
136135
| `fluentbit_input_bytes_total` | name: the name or alias for the input instance | The number of bytes of log records that this input instance has ingested successfully. | counter | bytes |
137136
| `fluentbit_input_records_total` | name: the name or alias for the input instance | The number of log records this input ingested successfully. | counter | records |
138137
| `fluentbit_output_dropped_records_total` | name: the name or alias for the output instance | The number of log records dropped by the output. These records hit an unrecoverable error or retries expired for their chunk. | counter | records |
139138
| `fluentbit_output_errors_total` | name: the name or alias for the output instance | The number of chunks with an error that's either unrecoverable or unable to retry. This metric represents the number of times a chunk failed, and doesn't correspond with the number of error messages visible in the Fluent Bit log output. | counter | chunks |
140-
| `fluentbit_output_proc_bytes_total` | name: the name or alias for the output instance | The number of bytes of log records that this output instance sent successfully. This metric represents the total byte size of all unique chunks sent by this output. If a record is not sent due to some error, it doesn't count towards this metric. | counter | bytes |
141-
| `fluentbit_output_proc_records_total` | name: the name or alias for the output instance | The number of log records that this output instance sent successfully. This metric represents the total record count of all unique chunks sent by this output. If a record is not sent successfully, it doesn't count towards this metric. | counter | records |
139+
| `fluentbit_output_proc_bytes_total` | name: the name or alias for the output instance | The number of bytes of log records that this output instance sent successfully. This metric represents the total byte size of all unique chunks sent by this output. If a record isn't sent due to some error, it doesn't count towards this metric. | counter | bytes |
140+
| `fluentbit_output_proc_records_total` | name: the name or alias for the output instance | The number of log records that this output instance sent successfully. This metric represents the total record count of all unique chunks sent by this output. If a record isn't sent successfully, it doesn't count towards this metric. | counter | records |
142141
| `fluentbit_output_retried_records_total` | name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
143142
| `fluentbit_output_retries_failed_total` | name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a `Retry_Limit`, which applies to chunks. When the `Retry_Limit` is exceeded, the chunk is discarded and this metric is incremented. | counter | chunks |
144143
| `fluentbit_output_retries_total` | name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
@@ -163,7 +162,7 @@ The following descriptions apply to metrics outputted in JSON format by the
163162
| `input_chunks.{plugin name}.chunks.total` | The current total number of chunks owned by this input instance. | chunks |
164163
| `input_chunks.{plugin name}.chunks.up` | The current number of chunks that are in memory for this input. If file system storage is enabled, chunks that are "up" are also stored in the filesystem layer. | chunks |
165164
| `input_chunks.{plugin name}.chunks.down` | The current number of chunks that are "down" in the filesystem for this input. | chunks |
166-
| `input_chunks.{plugin name}.chunks.busy` | Chunks are that are being processed or sent by outputs and are not eligible to have new data appended. | chunks |
165+
| `input_chunks.{plugin name}.chunks.busy` | Chunks are that are being processed or sent by outputs and aren't eligible to have new data appended. | chunks |
167166
| `input_chunks.{plugin name}.chunks.busy_size` | The sum of the byte size of each chunk which is currently marked as busy. | bytes |
168167

169168
### v2 metrics
@@ -198,8 +197,8 @@ The following terms are key to understanding how Fluent Bit processes metrics:
198197
| `fluentbit_filter_drop_records_total` | name: the name or alias for the filter instance | The number of log records dropped by the filter and removed from the data pipeline. | counter | records |
199198
| `fluentbit_output_dropped_records_total` | name: the name or alias for the output instance | The number of log records dropped by the output. These records hit an unrecoverable error or retries expired for their chunk. | counter | records |
200199
| `fluentbit_output_errors_total` | name: the name or alias for the output instance | The number of chunks with an error that's either unrecoverable or unable to retry. This metric represents the number of times a chunk failed, and doesn't correspond with the number of error messages visible in the Fluent Bit log output. | counter | chunks |
201-
| `fluentbit_output_proc_bytes_total` | name: the name or alias for the output instance | The number of bytes of log records that this output instance sent successfully. This metric represents the total byte size of all unique chunks sent by this output. If a record is not sent due to some error, it doesn't count towards this metric. | counter | bytes |
202-
| `fluentbit_output_proc_records_total` | name: the name or alias for the output instance | The number of log records that this output instance sent successfully. This metric represents the total record count of all unique chunks sent by this output. If a record is not sent successfully, it doesn't count towards this metric. | counter | records |
200+
| `fluentbit_output_proc_bytes_total` | name: the name or alias for the output instance | The number of bytes of log records that this output instance sent successfully. This metric represents the total byte size of all unique chunks sent by this output. If a record isn't sent due to some error, it doesn't count towards this metric. | counter | bytes |
201+
| `fluentbit_output_proc_records_total` | name: the name or alias for the output instance | The number of log records that this output instance sent successfully. This metric represents the total record count of all unique chunks sent by this output. If a record isn't sent successfully, it doesn't count towards this metric. | counter | records |
203202
| `fluentbit_output_retried_records_total` | name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
204203
| `fluentbit_output_retries_failed_total` | name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a `Retry_Limit`, which applies to chunks. When the `Retry_Limit` is exceeded, the chunk is discarded and this metric is incremented. | counter | chunks |
205204
| `fluentbit_output_retries_total` | name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
@@ -227,7 +226,7 @@ layer.
227226
| `fluentbit_input_storage_chunks` | name: the name or alias for the input instance | The current total number of chunks owned by this input instance. | gauge | chunks |
228227
| `fluentbit_input_storage_chunks_up` | name: the name or alias for the input instance | The current number of chunks that are in memory for this input. If file system storage is enabled, chunks that are "up" are also stored in the filesystem layer. | gauge | chunks |
229228
| `fluentbit_input_storage_chunks_down` | name: the name or alias for the input instance | The current number of chunks that are "down" in the filesystem for this input. | gauge | chunks |
230-
| `fluentbit_input_storage_chunks_busy` | name: the name or alias for the input instance | Chunks are that are being processed or sent by outputs and are not eligible to have new data appended. | gauge | chunks |
229+
| `fluentbit_input_storage_chunks_busy` | name: the name or alias for the input instance | Chunks are that are being processed or sent by outputs and aren't eligible to have new data appended. | gauge | chunks |
231230
| `fluentbit_input_storage_chunks_busy_bytes` | name: the name or alias for the input instance | The sum of the byte size of each chunk which is currently marked as busy. | gauge | bytes |
232231
| `fluentbit_output_upstream_total_connections` | name: the name or alias for the output instance | The sum of the connection count of each output plugins. | gauge | bytes |
233232
| `fluentbit_output_upstream_busy_connections` | name: the name or alias for the output instance | The sum of the connection count in a busy state of each output plugins. | gauge | bytes |
@@ -236,8 +235,8 @@ layer.
236235

237236
Query the service uptime with the following command:
238237

239-
```curl
240-
$ curl -s http://127.0.0.1:2020/api/v1/uptime | jq
238+
```bash
239+
curl -s http://127.0.0.1:2020/api/v1/uptime | jq
241240
```
242241

243242
The command prints a similar output like this:
@@ -254,7 +253,7 @@ The command prints a similar output like this:
254253
Query internal metrics in JSON format with the following command:
255254

256255
```bash
257-
$ curl -s http://127.0.0.1:2020/api/v1/metrics | jq
256+
curl -s http://127.0.0.1:2020/api/v1/metrics | jq
258257
```
259258

260259
The command prints a similar output like this:
@@ -284,7 +283,7 @@ The command prints a similar output like this:
284283
Query internal metrics in Prometheus Text 0.0.4 format:
285284

286285
```bash
287-
$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
286+
curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
288287
```
289288

290289
This command returns the same metrics in Prometheus format instead of JSON:
@@ -371,14 +370,14 @@ Sample alerts are available [here](https://github.com/fluent/fluent-bit-docs/tre
371370

372371
## Health Check for Fluent Bit
373372

374-
Fluent bit now supports four new configs to set up the health check.
373+
Fluent bit supports the following configurations to set up the health check.
375374

376375
| Configuration name | Description | Default |
377376
| ---------------------- | ------------| ------------- |
378-
| `Health_Check` | enable Health check feature | Off |
379-
| `HC_Errors_Count` | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for output error: `[2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)` | 5 |
380-
| `HC_Retry_Failure_Count` | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for retry failure: `[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1` | 5 |
381-
| `HC_Period` | The time period by second to count the error and retry failure data point | 60 |
377+
| `Health_Check` | Enable Health check feature | `Off` |
378+
| `HC_Errors_Count` | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined `HC_Period`, example for output error: `[2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)` | `5` |
379+
| `HC_Retry_Failure_Count` | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined `HC_Period`, example for retry failure: `[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1` | `5` |
380+
| `HC_Period` | The time period by second to count the error and retry failure data point | `60` |
382381

383382
Not every error log means an error to be counted. The error retry failures count only
384383
on specific errors, which is the example in configuration table description.
@@ -425,7 +424,7 @@ Use the following command to call the health endpoint:
425424
curl -s http://127.0.0.1:2020/api/v1/health
426425
```
427426

428-
With the example config, the health status is determined by the following equation:
427+
With the example configuration, the health status is determined by the following equation:
429428

430429
```text
431430
Health status = (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds
@@ -437,5 +436,5 @@ Health status = (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 secon
437436
## Telemetry Pipeline
438437

439438
[Telemetry Pipeline](https://chronosphere.io/platform/telemetry-pipeline/) is a
440-
hosted service that allows you to monitor your Fluent Bit agents including data flow,
439+
hosted service that lets you monitor your Fluent Bit agents including data flow,
441440
metrics, and configurations.

vale-styles/FluentBit/Spelling-exceptions.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ backpressure
1313
BitBake
1414
Blackhole
1515
blocklist
16+
boolean
1617
Buildkite
1718
cAdvisor
1819
Calyptia
@@ -21,6 +22,7 @@ clickstreams
2122
CloudWatch
2223
CMake
2324
cmdlet
25+
cmetrics
2426
Config
2527
Coralogix
2628
coroutine

0 commit comments

Comments
 (0)