Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 23 additions & 4 deletions administration/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -476,6 +476,11 @@
| `HC_Errors_Count` | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined `HC_Period`, example for output error: `[2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)` | `5` |
| `HC_Retry_Failure_Count` | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined `HC_Period`, example for retry failure: `[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1` | `5` |
| `HC_Period` | The time period by second to count the error and retry failure data point | `60` |
| `HC_Throughput` | Enable throughput health checking. In this context, throughput means `OUTPUT_RATE/INPUT_RATE` ratio, and the check happens in accordance to `Hc_Period`. If this is `On`, all related options must be set since there are no default values. | `Off` |
| `HC_Throughput_Input_Plugins` | Comma separated list of input plugins used for the purposes of calculating input rate. | _none_ |
| `HC_Throughput_Output_Plugins` | Comma separated list of output plugins used for the purposes of calculating output rate. | _none_ |
| `HC_Throughput_Ratio_Threshold` | `OUTPUT_RATE/INPUT_RATE` ratio failure threshold. If the ratio is under this number, then the current check fails. A single check is not enough to trigger a health error, see `Hc_Throughput_Min_Failures` for details.| _none_ |

Check warning on line 482 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Contractions] Feel free to use 'isn't' instead of 'is not'. Raw Output: {"message": "[FluentBit.Contractions] Feel free to use 'isn't' instead of 'is not'.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 482, "column": 168}}}, "severity": "INFO"}
| `HC_Throughput_Min_Failures` | Minimum number of consecutive ratio check failures required before the health endpoint will return an error. For example, if this is `60` and the default `Hc_Period`, the ratio must be below threshold for 1 minute before an error is returned. | _none_ |

Check warning on line 483 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 483, "column": 223}}}, "severity": "INFO"}

Not every error log means an error to be counted. The error retry failures count only on specific errors, which is the example in configuration table description.

Expand Down Expand Up @@ -527,18 +532,34 @@
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 5



[INPUT]
Name cpu


[OUTPUT]
Name stdout
Match *
```

### Throughput health check

If `Hc_Throughput` and other related options are set, Fluent Bit will monitor output/input ratio, and the health endpoint will return error if ratio is beneath the configured threshold. For example:

```text
hc_throughput On
hc_throughput_input_plugins tail.0
hc_throughput_output_plugins http.0
hc_throughput_ratio_threshold 0.1
hc_throughput_min_failures 60
```

{% endtab %}
{% endtabs %}

In the previous example, if the HTTP output rate is below 1/10 of the tail input rate for 1 consecutive minute, then the `/api/v1/health` endpoint will return `error`. If the ratio goes above threshold, it will restore the `OK` status until another minute of consecutive failed checks.

Check warning on line 561 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'above' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'above' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 561, "column": 187}}}, "severity": "INFO"}

Check warning on line 561 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 561, "column": 53}}}, "severity": "INFO"}

Use the following command to call the health endpoint:

```shell
Expand All @@ -556,6 +577,4 @@

## Telemetry Pipeline

[Telemetry Pipeline](https://chronosphere.io/platform/telemetry-pipeline/) is a
hosted service that lets you monitor your Fluent Bit agents including data flow,
metrics, and configurations.
[Telemetry Pipeline](https://chronosphere.io/platform/telemetry-pipeline/) is a hosted service that lets you monitor your Fluent Bit agents including data flow, metrics, and configurations.