Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 21 additions & 7 deletions administration/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -476,6 +476,11 @@
| `HC_Errors_Count` | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined `HC_Period`, example for output error: `[2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)` | `5` |
| `HC_Retry_Failure_Count` | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined `HC_Period`, example for retry failure: `[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1` | `5` |
| `HC_Period` | The time period by second to count the error and retry failure data point | `60` |
| `HC_Throughput` | Enable throughput health checking (more details below). In this context, throughput means `OUTPUT_RATE/INPUT_RATE` ratio, and the check happens in accordance to `Hc_Period`. If this is `On`, all related options must be set since there are no default values. | `Off` |

Check warning on line 479 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 479, "column": 69}}}, "severity": "INFO"}
| `HC_Throughput_Input_Plugins` | Comma separated list of input plugins used for the purposes of calculating input rate. | _none_ |
| `HC_Throughput_Output_Plugins` | Comma separated list of output plugins used for the purposes of calculating output rate. | _none_ |
| `HC_Throughput_Ratio_Threshold` | `OUTPUT_RATE/INPUT_RATE` ratio failure threshold. If the ratio is below this number, then the current check fails. A single check is not enough to trigger a health error, see `Hc_Throughput_Min_Failures` for details.| _none_ |

Check warning on line 482 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Contractions] Feel free to use 'isn't' instead of 'is not'. Raw Output: {"message": "[FluentBit.Contractions] Feel free to use 'isn't' instead of 'is not'.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 482, "column": 168}}}, "severity": "INFO"}

Check warning on line 482 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 482, "column": 104}}}, "severity": "INFO"}
| `HC_Throughput_Min_Failures` | Minimum number of consecutive ratio check failures required before the health endpoint will return an error. For example, if this is 60 and the default Hc_Period, the ratio must be below threshold for 1 minute before an error is returned. |_none_ |

Check warning on line 483 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 483, "column": 219}}}, "severity": "INFO"}

Check warning on line 483 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Spelling] Spelling check: 'Hc_Period'? Raw Output: {"message": "[FluentBit.Spelling] Spelling check: 'Hc_Period'?", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 483, "column": 190}}}, "severity": "INFO"}

Not every error log means an error to be counted. The error retry failures count only on specific errors, which is the example in configuration table description.

Expand Down Expand Up @@ -527,18 +532,29 @@
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 5

[INPUT]
Name cpu


[OUTPUT]
Name stdout
Match *
```

### Throughput health check

If `Hc_Throughput` and other related options are set, fluent-bit will monitor output/input ratio, and the health endpoint will return error if ratio is below the configured threshold. For example:

Check warning on line 543 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 543, "column": 153}}}, "severity": "INFO"}

Check warning on line 543 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.FluentBitCasing] Use the proper noun 'Fluent Bit' with correct casing and spacing. Raw Output: {"message": "[FluentBit.FluentBitCasing] Use the proper noun 'Fluent Bit' with correct casing and spacing.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 543, "column": 55}}}, "severity": "WARNING"}

```text
hc_throughput On
hc_throughput_input_plugins tail.0
hc_throughput_output_plugins http.0
hc_throughput_ratio_threshold 0.1
hc_throughput_min_failures 60
```

{% endtab %}
{% endtabs %}

In the above example, if the http output rate is below 1/10 of the tail input rate for 1 consecutive minute, then the `/api/v1/health` endpoint will return `error`. Note that if the ratio goes above threshold, it will restore the `OK` status until another minute of consecutive failed checks.

Check failure on line 556 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.DontUse] We don't use 'Note that'. Raw Output: {"message": "[FluentBit.DontUse] We don't use 'Note that'.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 556, "column": 166}}}, "severity": "ERROR"}

Check warning on line 556 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'below' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'below' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 556, "column": 50}}}, "severity": "INFO"}

Check warning on line 556 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Spelling] Spelling check: 'http'? Raw Output: {"message": "[FluentBit.Spelling] Spelling check: 'http'?", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 556, "column": 30}}}, "severity": "INFO"}

Check warning on line 556 in administration/monitoring.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Directional] Verify your use of 'above' with the Style Guide. Raw Output: {"message": "[FluentBit.Directional] Verify your use of 'above' with the Style Guide.", "location": {"path": "administration/monitoring.md", "range": {"start": {"line": 556, "column": 8}}}, "severity": "INFO"}

Use the following command to call the health endpoint:

```shell
Expand All @@ -556,6 +572,4 @@

## Telemetry Pipeline

[Telemetry Pipeline](https://chronosphere.io/platform/telemetry-pipeline/) is a
hosted service that lets you monitor your Fluent Bit agents including data flow,
metrics, and configurations.
[Telemetry Pipeline](https://chronosphere.io/platform/telemetry-pipeline/) is a hosted service that lets you monitor your Fluent Bit agents including data flow, metrics, and configurations.
Loading