Skip to content

Commit 713fd91

Browse files
Adds InfluxDB Mixin (#1104)
* mixin * update readme * grammar fix * add k8s support to aggregations * adjust alerts * revert matcher aggregation, aggregate instance stat panels * remove matcher from histogram aggregation * README log db description * move go row * suppress 0 values for non-static labels
1 parent 931f6b1 commit 713fd91

11 files changed

+3720
-0
lines changed

influxdb-mixin/.lint

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
exclusions:
2+
template-job-rule:
3+
reason: "Prometheus datasource variable is being named as prometheus_datasource now while linter expects 'datasource'"
4+
panel-datasource-rule:
5+
reason: "Loki datasource variable is being named as loki_datasource now while linter expects 'datasource'"
6+
template-datasource-rule:
7+
reason: "Based on new convention we are using variable names prometheus_datasource and loki_datasource where as linter expects 'datasource'"
8+
template-instance-rule:
9+
reason: "Based on new convention we are using variable names prometheus_datasource and loki_datasource where as linter expects 'datasource'"
10+
target-instance-rule:
11+
reason: "The dashboard is a 'cluster' dashboard where the instance refers to nodes, this dashboard focuses only on the cluster view."
12+
entries:
13+
- dashboard: "InfluxDB cluster overview"
14+
target-promql-rule:
15+
reason: "Linter does not support selector variable value as a scalar in top-k PromQL queries."
16+
template-label-promql-rule:
17+
reason: "Defining a selector for the value of top-k requires a predefined label that the linter considers invalid."
18+
panel-title-description-rule:
19+
reason: "Not required for logs volume"
20+
panel-units-rule:
21+
reason: "Logs volume has no unit"

influxdb-mixin/Makefile

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
JSONNET_FMT := jsonnetfmt -n 2 --max-blank-lines 1 --string-style s --comment-style s
2+
3+
.PHONY: all
4+
all: build dashboards_out prometheus_alerts.yaml
5+
6+
vendor: jsonnetfile.json
7+
jb install
8+
9+
.PHONY: build
10+
build: vendor
11+
12+
.PHONY: fmt
13+
fmt:
14+
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
15+
xargs -n 1 -- $(JSONNET_FMT) -i
16+
17+
.PHONY: lint
18+
lint: build
19+
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
20+
while read f; do \
21+
$(JSONNET_FMT) "$$f" | diff -u "$$f" -; \
22+
done
23+
mixtool lint mixin.libsonnet
24+
25+
dashboards_out: mixin.libsonnet config.libsonnet $(wildcard dashboards/*)
26+
@mkdir -p dashboards_out
27+
mixtool generate dashboards mixin.libsonnet -d dashboards_out
28+
29+
prometheus_alerts.yaml: mixin.libsonnet alerts/*.libsonnet
30+
mixtool generate alerts mixin.libsonnet -a prometheus_alerts.yaml
31+
32+
.PHONY: clean
33+
clean:
34+
rm -rf dashboards_out prometheus_alerts.yaml

influxdb-mixin/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# InfluxDB mixin
2+
3+
The InfluxDB mixin is a set of configurable Grafana dashboards and alerts.
4+
5+
The InfluxDB mixin contains the following dashboards:
6+
7+
- InfluxDB cluster overview
8+
- InfluxDB instance overview
9+
- InfluxDB logs overview
10+
11+
and the following alerts:
12+
13+
- InfluxDBWarningTaskSchedulerHighFailureRate
14+
- InfluxDBCriticalTaskSchedulerHighFailureRate
15+
- InfluxDBHighBusyWorkerPercentage
16+
- InfluxDBHighHeapMemoryUsage
17+
- InfluxDBHighAverageAPIRequestLatency
18+
- InfluxDBSlowAverageIQLExecutionTime
19+
20+
## InfluxDB cluster overview
21+
22+
The InfluxDB cluster overview dashboard provides details on the cluster's performance and highlights top instances. The dashboard covers all available aspects of InfluxDB performance and integration health, including Golang performance, query/request load, and task scheduler activity.
23+
24+
![First screenshot of the InfluxDB cluster overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_cluster_overview_1.png)
25+
![Second screenshot of the InfluxDB cluster overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_cluster_overview_2.png)
26+
![Third screenshot of the InfluxDB cluster overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_cluster_overview_3.png)
27+
28+
## InfluxDB instance overview
29+
30+
The InfluxDB instance overview dashboard provides details on one or more instances, including instance configuration stats, Golang performance, query/request load, and task scheduler activity.
31+
32+
![First screenshot of the InfluxDB instance overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_instance_overview_1.png)
33+
![Second screenshot of the InfluxDB instance overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_instance_overview_2.png)
34+
![Third screenshot of the InfluxDB instance overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_instance_overview_3.png)
35+
36+
37+
## InfluxDB logs overview
38+
39+
The InfluxDB logs overview dashboard allows users to view incoming InfluxDB logs. The dashboard also allows users to filter logs based on level, service, engine, and custom regex.
40+
41+
![Screenshot of the InfluxDB logs dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/influxdb/screenshots/influxdb_logs_overview.png)
42+
43+
InfluxDB system logs are enabled by default in the `config.libsonnet` and can be disabled by setting `enableLokiLogs` to `false`. Then run `make` again to regenerate the dashboard:
44+
45+
```
46+
{
47+
_config+:: {
48+
enableLokiLogs: false,
49+
},
50+
}
51+
```
52+
53+
For the selectors to properly work for InfluxDB logs ingested into your logs datasource, please also include the matching `instance`, `job`, and `influxdb_cluster` labels in the [scrape_configs](https://grafana.com/docs/loki/latest/clients/promtail/configuration/#scrape_configs) to match the labels for ingested metrics.
54+
55+
```yaml
56+
scrape_configs:
57+
- job_name: integrations/influxdb
58+
static_configs:
59+
- targets: [localhost]
60+
labels:
61+
job: integrations/influxdb
62+
influxdb_cluster: "<your-cluster-name>"
63+
instance: "<your-instance-name>"
64+
__path__: /var/log/influxdb/influxdb.log
65+
pipeline_stages:
66+
- multiline:
67+
firstline: 'ts=\d{4}'
68+
- regex:
69+
expression: 'ts=(\S+) lvl=(?P<level>\w+) msg=.* log_id=(\S+) (service=(?P<service>\S+) ){0,1}(engine=(?P<engine>\S*) ){0,1}.*$'
70+
- labels:
71+
level:
72+
service:
73+
engine:
74+
```
75+
76+
## Alerts overview
77+
78+
- InfluxDBWarningTaskSchedulerHighFailureRate: Automated data processing tasks are failing at a high rate.
79+
- InfluxDBCriticalTaskSchedulerHighFailureRate: Automated data processing tasks are failing at a critical rate.
80+
- InfluxDBHighBusyWorkerPercentage: There is a high percentage of busy workers.
81+
- InfluxDBHighHeapMemoryUsage: There is a high amount of heap memory being used.
82+
- InfluxDBHighAverageAPIRequestLatency: Average API request latency is too high. High latency will negatively affect system performance, degrading data availability and precision.
83+
- InfluxDBSlowAverageIQLExecutionTime: InfluxQL execution times are too slow. Slow query execution times will negatively affect system performance, degrading data availability and precision.
84+
85+
Default thresholds can be configured in `config.libsonnet`.
86+
87+
```js
88+
{
89+
_config+:: {
90+
alertsWarningTaskSchedulerHighFailureRate: 25, // %
91+
alertsCriticalTaskSchedulerHighFailureRate: 50, // %
92+
alertsWarningHighBusyWorkerPercentage: 80, // %
93+
alertsWarningHighHeapMemoryUsage: 80, // %
94+
alertsWarningHighAverageAPIRequestLatency: 0.1, // count
95+
alertsWarningSlowAverageIQLExecutionTime: 0.1, // count
96+
},
97+
}
98+
```
99+
100+
## Install tools
101+
102+
```bash
103+
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
104+
go install github.com/monitoring-mixins/mixtool/cmd/mixtool@latest
105+
```
106+
107+
For linting and formatting, you would also need `jsonnetfmt` installed. If you
108+
have a working Go development environment, it's easiest to run the following:
109+
110+
```bash
111+
go install github.com/google/go-jsonnet/cmd/jsonnetfmt@latest
112+
```
113+
114+
The files in `dashboards_out` need to be imported
115+
into your Grafana server. The exact details will depend on your environment.
116+
117+
`prometheus_alerts.yaml` needs to be imported into Prometheus.
118+
119+
## Generate dashboards and alerts
120+
121+
Edit `config.libsonnet` if required and then build JSON dashboard files for Grafana:
122+
123+
```bash
124+
make
125+
```
126+
127+
For more advanced uses of mixins, see
128+
https://github.com/monitoring-mixins/docs.
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
{
2+
prometheusAlerts+:: {
3+
groups+: [
4+
{
5+
name: 'influxdb',
6+
rules: [
7+
{
8+
alert: 'InfluxDBWarningTaskSchedulerHighFailureRate',
9+
expr: |||
10+
100 * rate(task_scheduler_total_execute_failure[5m])/clamp_min(rate(task_scheduler_total_execution_calls[5m]), 1) >= %(alertsWarningTaskSchedulerHighFailureRate)s
11+
||| % $._config,
12+
'for': '5m',
13+
labels: {
14+
severity: 'warning',
15+
},
16+
annotations: {
17+
summary: 'Automated data processing tasks are failing at a high rate.',
18+
description:
19+
(
20+
'Task scheduler task executions for instance {{$labels.instance}} on cluster {{$labels.influxdb_cluster}} are failing at a rate of {{ printf "%%.0f" $value }} percent, ' +
21+
'which is above the threshold of %(alertsWarningTaskSchedulerHighFailureRate)s percent.'
22+
) % $._config,
23+
},
24+
},
25+
{
26+
alert: 'InfluxDBCriticalTaskSchedulerHighFailureRate',
27+
expr: |||
28+
100 * rate(task_scheduler_total_execute_failure[5m])/clamp_min(rate(task_scheduler_total_execution_calls[5m]), 1) >= %(alertsCriticalTaskSchedulerHighFailureRate)s
29+
||| % $._config,
30+
'for': '5m',
31+
labels: {
32+
severity: 'critical',
33+
},
34+
annotations: {
35+
summary: 'Automated data processing tasks are failing at a critical rate.',
36+
description:
37+
(
38+
'Task scheduler task executions for instance {{$labels.instance}} on cluster {{$labels.influxdb_cluster}} are failing at a rate of {{ printf "%%.0f" $value }} percent, ' +
39+
'which is above the threshold of %(alertsCriticalTaskSchedulerHighFailureRate)s percent.'
40+
) % $._config,
41+
},
42+
},
43+
{
44+
alert: 'InfluxDBHighBusyWorkerPercentage',
45+
expr: |||
46+
task_executor_workers_busy >= %(alertsWarningHighBusyWorkerPercentage)s
47+
||| % $._config,
48+
'for': '5m',
49+
labels: {
50+
severity: 'critical',
51+
},
52+
annotations: {
53+
summary: 'There is a high percentage of busy workers.',
54+
description:
55+
(
56+
'The busy worker percentage for instance {{$labels.instance}} on cluster {{$labels.influxdb_cluster}} is {{ printf "%%.0f" $value }} percent, ' +
57+
'which is above the threshold of %(alertsWarningHighBusyWorkerPercentage)s percent.'
58+
) % $._config,
59+
},
60+
},
61+
{
62+
alert: 'InfluxDBHighHeapMemoryUsage',
63+
expr: |||
64+
100 * go_memstats_heap_alloc_bytes/clamp_min((go_memstats_heap_idle_bytes + go_memstats_heap_alloc_bytes), 1) >= %(alertsWarningHighHeapMemoryUsage)s
65+
||| % $._config,
66+
'for': '5m',
67+
labels: {
68+
severity: 'critical',
69+
},
70+
annotations: {
71+
summary: 'There is a high amount of heap memory being used.',
72+
description:
73+
(
74+
'The heap memory usage for instance {{$labels.instance}} on cluster {{$labels.influxdb_cluster}} is {{ printf "%%.0f" $value }} percent, ' +
75+
'which is above the threshold of %(alertsWarningHighHeapMemoryUsage)s percent.'
76+
) % $._config,
77+
},
78+
},
79+
{
80+
alert: 'InfluxDBHighAverageAPIRequestLatency',
81+
expr: |||
82+
sum without(handler, method, path, response_code, status, user_agent) (increase(http_api_request_duration_seconds_sum[5m])/clamp_min(increase(http_api_requests_total[5m]), 1)) >= %(alertsWarningHighAverageAPIRequestLatency)s
83+
||| % $._config,
84+
'for': '1m',
85+
labels: {
86+
severity: 'critical',
87+
},
88+
annotations: {
89+
summary: 'Average API request latency is too high. High latency will negatively affect system performance, degrading data availability and precision.',
90+
description:
91+
(
92+
'The average API request latency for instance {{$labels.instance}} on cluster {{$labels.influxdb_cluster}} is {{ printf "%%.2f" $value }} seconds, which is above the threshold of %(alertsWarningHighAverageAPIRequestLatency)s seconds.'
93+
) % $._config,
94+
},
95+
},
96+
{
97+
alert: 'InfluxDBSlowAverageIQLExecutionTime',
98+
expr: |||
99+
sum without(result) (increase(influxql_service_executing_duration_seconds_sum[5m])/clamp_min(increase(influxql_service_requests_total[5m]), 1)) >= %(alertsWarningSlowAverageIQLExecutionTime)s
100+
||| % $._config,
101+
'for': '5m',
102+
labels: {
103+
severity: 'warning',
104+
},
105+
annotations: {
106+
summary: 'InfluxQL execution times are too slow. Slow query execution times will negatively affect system performance, degrading data availability and precision.',
107+
description:
108+
(
109+
'The average InfluxQL query execution time for instance {{$labels.instance}} on cluster {{$labels.influxdb_cluster}} is {{ printf "%%.2f" $value }} seconds, ' +
110+
'which is above the threshold of %(alertsWarningSlowAverageIQLExecutionTime)s seconds.'
111+
) % $._config,
112+
},
113+
},
114+
],
115+
},
116+
],
117+
},
118+
}

influxdb-mixin/config.libsonnet

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
{
2+
_config+:: {
3+
enableMultiCluster: false,
4+
influxdbSelector: if self.enableMultiCluster then 'job=~"$job", cluster=~"$cluster"' else 'job=~"$job"',
5+
multiclusterSelector: 'job=~"$job"',
6+
filterSelector: 'job=~"integrations/influxdb"',
7+
8+
dashboardTags: ['influxdb-mixin'],
9+
dashboardPeriod: 'now-30m',
10+
dashboardTimezone: 'default',
11+
dashboardRefresh: '1m',
12+
13+
// alerts thresholds
14+
alertsWarningTaskSchedulerHighFailureRate: 25, // %
15+
alertsCriticalTaskSchedulerHighFailureRate: 50, // %
16+
alertsWarningHighBusyWorkerPercentage: 80, // %
17+
alertsWarningHighHeapMemoryUsage: 80, // %
18+
alertsWarningHighAverageAPIRequestLatency: 0.3, // count
19+
alertsWarningSlowAverageIQLExecutionTime: 0.1, // count
20+
21+
enableLokiLogs: true,
22+
},
23+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
(import 'influxdb-cluster-overview.libsonnet') +
2+
(import 'influxdb-instance-overview.libsonnet') +
3+
(import 'influxdb-logs-overview.libsonnet')

0 commit comments

Comments
 (0)