Skip to content

Commit f076e16

Browse files
Add Apache HBase mixin (#1085)
* add mixin * fix import errors * fix lint * update alert summaries * larger table * fmt * alert label filter * RS titles/descriptions * vitalys feedback * commit suggestion * log dash enhancements * master status, rs links * fix link * RS dashboard aggregation * remove instance aggr * fix alert filter
1 parent 27b5767 commit f076e16

11 files changed

+2779
-0
lines changed

apache-hbase-mixin/.lint

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
exclusions:
2+
template-job-rule:
3+
reason: "Prometheus datasource variable is being named as prometheus_datasource now while linter expects 'datasource'"
4+
panel-datasource-rule:
5+
reason: "Loki datasource variable is being named as loki_datasource now while linter expects 'datasource'"
6+
template-datasource-rule:
7+
reason: "Based on new convention we are using variable names prometheus_datasource and loki_datasource where as linter expects 'datasource'"
8+
template-instance-rule:
9+
reason: "Based on new convention we are using variable names prometheus_datasource and loki_datasource where as linter expects 'datasource'"
10+
target-instance-rule:
11+
reason: "The dashboard is a 'cluster' dashboard where the instance refers to nodes, this dashboard focuses only on the cluster view."
12+
entries:
13+
- dashboard: "Apache HBase cluster overview"
14+
panel-title-description-rule:
15+
reason: "Not required for logs volume"
16+
panel-units-rule:
17+
reason: "Logs volume has no unit"

apache-hbase-mixin/Makefile

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
JSONNET_FMT := jsonnetfmt -n 2 --max-blank-lines 1 --string-style s --comment-style s
2+
3+
.PHONY: all
4+
all: build dashboards_out prometheus_alerts.yaml
5+
6+
vendor: jsonnetfile.json
7+
jb install
8+
9+
.PHONY: build
10+
build: vendor
11+
12+
.PHONY: fmt
13+
fmt:
14+
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
15+
xargs -n 1 -- $(JSONNET_FMT) -i
16+
17+
.PHONY: lint
18+
lint: build
19+
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
20+
while read f; do \
21+
$(JSONNET_FMT) "$$f" | diff -u "$$f" -; \
22+
done
23+
mixtool lint mixin.libsonnet
24+
25+
dashboards_out: mixin.libsonnet config.libsonnet $(wildcard dashboards/*)
26+
@mkdir -p dashboards_out
27+
mixtool generate dashboards mixin.libsonnet -d dashboards_out
28+
29+
prometheus_alerts.yaml: mixin.libsonnet alerts/*.libsonnet
30+
mixtool generate alerts mixin.libsonnet -a prometheus_alerts.yaml
31+
32+
.PHONY: clean
33+
clean:
34+
rm -rf dashboards_out prometheus_alerts.yaml

apache-hbase-mixin/README.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Apache HBase mixin
2+
3+
The Apache HBase mixin is a set of configurable Grafana dashboards and alerts.
4+
5+
The Apache HBase mixin contains the following dashboards:
6+
7+
- Apache HBase cluster overview
8+
- Apache HBase RegionServer overview
9+
- Apache HBase logs
10+
11+
and the following alerts:
12+
13+
- HBaseHighHeapMemUsage
14+
- HBaseHighNonHeapMemUsage
15+
- HBaseDeadRegionServer
16+
- HBaseOldRegionsInTransition
17+
- HBaseHighMasterAuthFailureRate
18+
- HBaseHighRSAuthFailureRate
19+
20+
## Apache HBase overview
21+
The Apache HBase cluster overview dashboard provides details on integration status/alerts, current RegionServers, JVM memory usage, cluster connections, master queue performance, and transitioning regions.
22+
23+
![First screenshot of the Apache HBase cluster overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/apache-hbase/screenshots/apache_hbase_cluster_overview_1.png)
24+
![Second screenshot of the Apache HBase cluster overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/apache-hbase/screenshots/apache_hbase_cluster_overview_2.png)
25+
26+
## Apache HBase RegionServer overview
27+
The Apache HBase RegionServer overview dashboard provides details on data regions, storage, connections, and request handling performance for a RegionServer node.
28+
29+
![First screenshot of the Apache HBase RegionServer overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/apache-hbase/screenshots/apache_hbase_region_server_overview_1.png)
30+
![Second screenshot of the Apache HBase RegionServer overview dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/apache-hbase/screenshots/apache_hbase_region_server_overview_2.png)
31+
32+
33+
## Apache HBase logs
34+
The Apache HBase logs dashboard provides details on incoming system logs.
35+
36+
![First screenshot of the Apache HBase logs dashboard](https://storage.googleapis.com/grafanalabs-integration-assets/apache-hbase/screenshots/apache_hbase_logs_1.png)
37+
38+
Apache HBase system logs are enabled by default in the `config.libsonnet` and can be removed by setting `enableLokiLogs` to `false`. Then run `make` again to regenerate the dashboard:
39+
40+
```
41+
{
42+
_config+:: {
43+
enableLokiLogs: false,
44+
},
45+
}
46+
```
47+
48+
In order for the selectors to properly work for system logs ingested into your logs datasource, please also include the matching `instance`, `job`, and `apache_hbase_cluster` labels onto the [scrape_configs](https://grafana.com/docs/loki/latest/clients/promtail/configuration/#scrape_configs) as to match the labels for ingested metrics.
49+
50+
```yaml
51+
scrape_configs:
52+
- job_name: integrations/apache-hbase
53+
static_configs:
54+
- targets: [localhost]
55+
labels:
56+
job: integrations/apache-hbase
57+
hbase_cluster: "<your-cluster-name>"
58+
instance: "<your-instance-name>"
59+
__path__: {hbase-home}/logs/*.log
60+
pipeline_stages:
61+
- multiline:
62+
firstline: '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}'
63+
- regex:
64+
expression: '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3} (?P<level>\w+) \[(?P<context>.*)\] (?P<message>(?s:.*))$'
65+
- labels:
66+
level:
67+
logger:
68+
```
69+
70+
## Alerts overview
71+
72+
- ApacheHBaseHighHeapMemUsage: There is a limited amount of heap memory available to the JVM.
73+
- ApacheHBaseHighNonHeapMemUsage: There is a limited amount of non-heap memory available to the JVM.
74+
- ApacheHBaseDeadRegionServer: One or more RegionServer(s) has become unresponsive.
75+
- ApacheHBaseOldRegionsInTransition: RegionServers are in transition for longer than expected.
76+
- ApacheHBaseHighMasterAuthFailureRate: A high percentage of authentication attempts to the master are failing.
77+
- ApacheHBaseHighRSAuthFailureRate: A high percentage of authentication attempts to a RegionServer are failing.
78+
79+
Default thresholds can be configured in `config.libsonnet`.
80+
81+
```js
82+
{
83+
_config+:: {
84+
alertsHighHeapMemUsage: 80 // percentage
85+
alertsHighNonHeapMemUsage: 80 // percentage
86+
alertsDeadRegionServer: 0 // count
87+
alertsOldRegionsInTransition: 50 // percentage
88+
alertsHighMasterAuthFailRate: 35 // percentage
89+
alertsHighRSAuthFailRate: 35 // percentage
90+
},
91+
}
92+
```
93+
94+
## Install tools
95+
96+
```bash
97+
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
98+
go install github.com/monitoring-mixins/mixtool/cmd/mixtool@latest
99+
```
100+
101+
For linting and formatting, you would also need `jsonnetfmt` installed. If you
102+
have a working Go development environment, it's easiest to run the following:
103+
104+
```bash
105+
go install github.com/google/go-jsonnet/cmd/jsonnetfmt@latest
106+
```
107+
108+
The files in `dashboards_out` need to be imported
109+
into your Grafana server. The exact details will be depending on your environment.
110+
111+
`prometheus_alerts.yaml` needs to be imported into Prometheus.
112+
113+
## Generate dashboards and alerts
114+
115+
Edit `config.libsonnet` if required and then build JSON dashboard files for Grafana:
116+
117+
```bash
118+
make
119+
```
120+
121+
For more advanced uses of mixins, see
122+
https://github.com/monitoring-mixins/docs.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
{
2+
prometheusAlerts+:: {
3+
groups+: [
4+
{
5+
name: 'apache-hbase-alerts',
6+
rules: [
7+
{
8+
alert: 'HBaseHighHeapMemUsage',
9+
expr: |||
10+
100 * sum without(context, hostname, processname) (jvm_metrics_mem_heap_used_m{%(filterSelector)s} / clamp_min(jvm_metrics_mem_heap_committed_m{%(filterSelector)s}, 1)) > %(alertsHighHeapMemUsage)s
11+
||| % $._config,
12+
'for': '5m',
13+
labels: {
14+
severity: 'warning',
15+
},
16+
annotations: {
17+
summary: 'There is a limited amount of heap memory available to the JVM.',
18+
description:
19+
(
20+
'The heap memory usage for the JVM on instance {{$labels.instance}} in cluster {{$labels.hbase_cluster}} is {{printf "%%.0f" $labels.value}} percent, which is above the threshold of %(alertsHighHeapMemUsage)s percent'
21+
) % $._config,
22+
},
23+
},
24+
{
25+
alert: 'HBaseDeadRegionServer',
26+
expr: |||
27+
server_num_dead_region_servers > %(alertsDeadRegionServer)s
28+
||| % $._config,
29+
'for': '5m',
30+
labels: {
31+
severity: 'warning',
32+
},
33+
annotations: {
34+
summary: 'One or more RegionServer(s) has become unresponsive.',
35+
description:
36+
(
37+
'{{$labels.value}} RegionServer(s) in cluster {{$labels.hbase_cluster}} are unresponsive, which is above the threshold of %(alertsDeadRegionServer)s. The name(s) of the dead RegionServer(s) are {{$labels.deadregionservers}}'
38+
) % $._config,
39+
},
40+
},
41+
{
42+
alert: 'HBaseOldRegionsInTransition',
43+
expr: |||
44+
100 * assignment_manager_rit_count_over_threshold / clamp_min(assignment_manager_rit_count, 1) > %(alertsOldRegionsInTransition)s
45+
||| % $._config,
46+
'for': '5m',
47+
labels: {
48+
severity: 'warning',
49+
},
50+
annotations: {
51+
summary: 'RegionServers are in transition for longer than expected.',
52+
description:
53+
(
54+
'{{printf "%%.0f" $labels.value}} percent of RegionServers in transition in cluster {{$labels.hbase_cluster}} are transitioning for longer than expected, which is above the threshold of %(alertsOldRegionsInTransition)s percent'
55+
) % $._config,
56+
},
57+
},
58+
{
59+
alert: 'HBaseHighMasterAuthFailRate',
60+
expr: |||
61+
100 * rate(master_authentication_failures[5m]) / (clamp_min(rate(master_authentication_successes[5m]), 1) + clamp_min(rate(master_authentication_failures[5m]), 1)) > %(alertsHighMasterAuthFailRate)s
62+
||| % $._config,
63+
'for': '5m',
64+
labels: {
65+
severity: 'warning',
66+
},
67+
annotations: {
68+
summary: 'A high percentage of authentication attempts to the master are failing.',
69+
description:
70+
(
71+
'{{printf "%%.0f" $labels.value}} percent of authentication attempts to the master are failing in cluster {{$labels.hbase_cluster}}, which is above the threshold of %(alertsHighMasterAuthFailRate)s percent'
72+
) % $._config,
73+
},
74+
},
75+
{
76+
alert: 'HBaseHighRSAuthFailRate',
77+
expr: |||
78+
100 * rate(region_server_authentication_failures[5m]) / (clamp_min(rate(region_server_authentication_successes[5m]), 1) + clamp_min(rate(region_server_authentication_failures[5m]), 1)) > %(alertsHighRSAuthFailRate)s
79+
||| % $._config,
80+
'for': '5m',
81+
labels: {
82+
severity: 'warning',
83+
},
84+
annotations: {
85+
summary: 'A high percentage of authentication attempts to a RegionServer are failing.',
86+
description:
87+
(
88+
'{{printf "%%.0f" $labels.value}} percent of authentication attempts to the RegionServer {{$labels.instance}} are failing in cluster {{$labels.hbase_cluster}}, which is above the threshold of %(alertsHighRSAuthFailRate)s percent'
89+
) % $._config,
90+
},
91+
},
92+
],
93+
},
94+
],
95+
},
96+
}

apache-hbase-mixin/config.libsonnet

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
_config+:: {
3+
filterSelector: 'job=~"integrations/apache-hbase"',
4+
5+
dashboardTags: ['apache-hbase-mixin'],
6+
dashboardPeriod: 'now-30m',
7+
dashboardTimezone: 'default',
8+
dashboardRefresh: '1m',
9+
10+
// alerts thresholds
11+
alertsHighHeapMemUsage: 80, // percentage
12+
alertsHighNonHeapMemUsage: 80, // percentage
13+
alertsDeadRegionServer: 0, // count
14+
alertsOldRegionsInTransition: 50, // percentage
15+
alertsHighMasterAuthFailRate: 35, // percentage
16+
alertsHighRSAuthFailRate: 35, // percentage
17+
18+
enableLokiLogs: true,
19+
},
20+
}

0 commit comments

Comments
 (0)