Skip to content

Commit 98c6a42

Browse files
authored
Add envoy consul (#205)
1 parent f5badd3 commit 98c6a42

File tree

5 files changed

+1094
-19
lines changed

5 files changed

+1094
-19
lines changed

resources/consul/ALERTS.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,10 @@ RPC requests error rate is higher than 5%.
5353
## Cache hit rate is low
5454
RPC requests error rate is higher than 5%.
5555

56+
## High Request Latency
57+
Envoy High Request Latency
58+
## High Response Latency
59+
Envoy High Response Latency
60+
61+
## Certificate close to expire
62+
Certificate close to expire

resources/consul/alerts.yaml

Lines changed: 42 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ configurations:
1717
+ 2* stddev_over_time(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
1818
for: 5m
1919
labels:
20-
severity: medium
20+
severity: warning
2121
annotations:
2222
description: Consul KV Store update time had noticeable deviations from baseline over the previous hour.
2323
- alert: Transaction time anomaly
@@ -26,7 +26,7 @@ configurations:
2626
+ 2* stddev_over_time(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
2727
for: 5m
2828
labels:
29-
severity: medium
29+
severity: warning
3030
annotations:
3131
description: Consul Transaction time had noticeable deviations from baseline over the previous hour.
3232
- alert: Raft transactions count anomaly
@@ -36,7 +36,7 @@ configurations:
3636
+ 2* stddev_over_time(rate(consul_raft_apply{kube_pod_label_component="server"}[1m]) [1h:1m]) )
3737
for: 5m
3838
labels:
39-
severity: medium
39+
severity: warning
4040
annotations:
4141
description: Consul Raft transactions count rate had noticeable deviations from baseline over the previous hour.
4242
- alert: Raft commit time anomaly
@@ -45,118 +45,142 @@ configurations:
4545
+ 2* stddev_over_time(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
4646
for: 5m
4747
labels:
48-
severity: medium
48+
severity: warning
4949
annotations:
5050
description: Consul Raft commit time had noticeable deviations from baseline over the previous hour.
5151
- alert: Leader time to contact followers too high
5252
expr: |
5353
consul_raft_leader_lastContact{quantile="0.9"} > 200
5454
for: 5m
5555
labels:
56-
severity: medium
56+
severity: warning
5757
annotations:
5858
description: Consul Leader time to contact followers was greater than 200ms.
5959
- alert: Flapping leadership
6060
expr: |
6161
sum(rate(consul_raft_state_leader{kube_pod_label_component="server"}[1m])) > 0
6262
for: 5m
6363
labels:
64-
severity: medium
64+
severity: warning
6565
annotations:
6666
description: There are too many leadership changes."
6767
- alert: Too many elections
6868
expr: |
6969
sum(rate(consul_raft_state_candidate{kube_pod_label_component="server"}[1m])) > 0
7070
for: 5m
7171
labels:
72-
severity: medium
72+
severity: warning
7373
annotations:
7474
description: There are too many elections for leadership."
7575
- alert: Server cluster unhealthy
7676
expr: |
7777
consul_autopilot_healthy == 0
7878
for: 5m
7979
labels:
80-
severity: high
80+
severity: critical
8181
annotations:
8282
description: One or many Consul servers in the cluster are unhealthy.
8383
- alert: Zero failure tolerance
8484
expr: |
8585
consul_autopilot_failure_tolerance == 0
8686
for: 5m
8787
labels:
88-
severity: medium
88+
severity: warning
8989
annotations:
9090
description: There is no failure tolerance in case one Consul server goes down.
9191
- alert: Client RPC requests anomaly
9292
expr: |
9393
avg(rate(consul_client_rpc[1m]) > 0) > (avg_over_time(rate(consul_client_rpc[1m]) [1h:1m])+ 2* stddev_over_time(rate(consul_client_rpc[1m]) [1h:1m]) )
9494
for: 5m
9595
labels:
96-
severity: medium
96+
severity: warning
9797
annotations:
9898
description: Consul Client RPC requests had noticeable deviations from baseline over the previous hour.
9999
- alert: Client RPC requests rate limit exceeded
100100
expr: |
101101
rate(consul_client_rpc_exceeded[1m]) / rate(consul_client_rpc[1m]) > 0.1
102102
for: 5m
103103
labels:
104-
severity: medium
104+
severity: warning
105105
annotations:
106106
description: Over 10% of Consul Client RPC requests have exceeded the rate limit.
107107
- alert: Client RPC requests failed
108108
expr: |
109109
rate(consul_client_rpc_failed[1m]) / rate(consul_client_rpc[1m]) > 0.1
110110
for: 5m
111111
labels:
112-
severity: medium
112+
severity: warning
113113
annotations:
114114
description: Over 10% of Consul Client RPC requests are failing.
115115
- alert: License Expiry
116116
expr: |
117117
consul_system_licenseExpiration / 24 < 30
118118
for: 5m
119119
labels:
120-
severity: medium
120+
severity: warning
121121
annotations:
122122
description: Consul license will expire in less than 30 days.
123123
- alert: Garbage Collection pause high
124124
expr: |
125125
(rate(consul_runtime_gc_pause_ns_sum[1m]) / (1000000000) > 2
126126
for: 5m
127127
labels:
128-
severity: medium
128+
severity: warning
129129
annotations:
130130
description: Garbage Collection stop-the-world pauses were greater than 2 seconds per minute.
131131
- alert: Garbage Collection pause too high
132132
expr: |
133133
(min(consul_runtime_gc_pause_ns_sum)) / (1000000000) > 5
134134
for: 5m
135135
labels:
136-
severity: high
136+
severity: critical
137137
annotations:
138138
description: Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
139139
- alert: Raft restore duration too high
140140
expr: |
141141
consul_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration{kube_pod_label_component="server"})
142142
for: 5m
143143
labels:
144-
severity: medium
144+
severity: warning
145145
annotations:
146146
description: The Raft FSM restore duration is too close to the Raft leader oldest log age.
147147
- alert: RPC requests error rate is high
148148
expr: |
149149
sum(rate(consul_rpc_request_error[1m])) / sum(rate(consul_rpc_request[1m])) > 0.05
150150
for: 5m
151151
labels:
152-
severity: medium
152+
severity: warning
153153
annotations:
154154
description: RPC requests error rate is higher than 5%.
155155
- alert: Cache hit rate is low
156156
expr: |
157157
consul_consul_cache_fetch_success/(consul_consul_cache_fetch_success+ ignoring(result_not_modified) group_leftconsul_consul_cache_fetch_error)< 0.95
158158
for: 5m
159159
labels:
160-
severity: medium
160+
severity: warning
161161
annotations:
162162
description: RPC requests error rate is higher than 5%.
163+
- alert: High Request Latency
164+
expr: |
165+
histogram_quantile(0.95,sum(rate(envoy_cluster_upstream_cx_connect_ms_bucket{kube_cluster_name=~$cluster}[5m])) by (le,kube_cluster_name,kube_pod_name,kube_namespace_name)) > 0.25
166+
for: 5m
167+
labels:
168+
severity: critical
169+
annotations:
170+
description: "Envoy High Request Latency"
171+
- alert: High Response Latency
172+
expr: |
173+
histogram_quantile(0.95,sum(rate(envoy_cluster_upstream_rq_time_bucket{kube_cluster_name=~$cluster}[5m])) by (le,kube_cluster_name,kube_pod_name,kube_namespace_name)) > 0.25
174+
for: 5m
175+
labels:
176+
severity: critical
177+
annotations:
178+
description: "Envoy High Response Latency"
179+
- alert: Certificate close to expire
180+
expr: |
181+
envoy_server_days_until_first_cert_expiring{kube_cluster_name=~$cluster} < 1
182+
for: 5m
183+
labels:
184+
severity: critical
185+
annotations:
186+
description: "Certificate close to expire"

resources/consul/dashboards.yaml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,11 @@ configurations:
1616
* Leadership
1717
* Network
1818
* Cache
19-
file: include/consul_sysdig.json
19+
file: include/consul_sysdig.json
20+
- name: Envoy
21+
kind: Sysdig
22+
image: consul/images/consul_envoy_sysdig.png
23+
description: |
24+
This dashboard offers information on:
25+
* Envoy
26+
file: include/consul_envoy_sysdig.json
893 KB
Loading

0 commit comments

Comments
 (0)