@@ -17,7 +17,7 @@ configurations:
1717 + 2* stddev_over_time(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
1818 for: 5m
1919 labels:
20- severity: medium
20+ severity: warning
2121 annotations:
2222 description: Consul KV Store update time had noticeable deviations from baseline over the previous hour.
2323 - alert: Transaction time anomaly
@@ -26,7 +26,7 @@ configurations:
2626 + 2* stddev_over_time(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
2727 for: 5m
2828 labels:
29- severity: medium
29+ severity: warning
3030 annotations:
3131 description: Consul Transaction time had noticeable deviations from baseline over the previous hour.
3232 - alert: Raft transactions count anomaly
@@ -36,7 +36,7 @@ configurations:
3636 + 2* stddev_over_time(rate(consul_raft_apply{kube_pod_label_component="server"}[1m]) [1h:1m]) )
3737 for: 5m
3838 labels:
39- severity: medium
39+ severity: warning
4040 annotations:
4141 description: Consul Raft transactions count rate had noticeable deviations from baseline over the previous hour.
4242 - alert: Raft commit time anomaly
@@ -45,118 +45,142 @@ configurations:
4545 + 2* stddev_over_time(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
4646 for: 5m
4747 labels:
48- severity: medium
48+ severity: warning
4949 annotations:
5050 description: Consul Raft commit time had noticeable deviations from baseline over the previous hour.
5151 - alert: Leader time to contact followers too high
5252 expr: |
5353 consul_raft_leader_lastContact{quantile="0.9"} > 200
5454 for: 5m
5555 labels:
56- severity: medium
56+ severity: warning
5757 annotations:
5858 description: Consul Leader time to contact followers was greater than 200ms.
5959 - alert: Flapping leadership
6060 expr: |
6161 sum(rate(consul_raft_state_leader{kube_pod_label_component="server"}[1m])) > 0
6262 for: 5m
6363 labels:
64- severity: medium
64+ severity: warning
6565 annotations:
6666 description: There are too many leadership changes."
6767 - alert: Too many elections
6868 expr: |
6969 sum(rate(consul_raft_state_candidate{kube_pod_label_component="server"}[1m])) > 0
7070 for: 5m
7171 labels:
72- severity: medium
72+ severity: warning
7373 annotations:
7474 description: There are too many elections for leadership."
7575 - alert: Server cluster unhealthy
7676 expr: |
7777 consul_autopilot_healthy == 0
7878 for: 5m
7979 labels:
80- severity: high
80+ severity: critical
8181 annotations:
8282 description: One or many Consul servers in the cluster are unhealthy.
8383 - alert: Zero failure tolerance
8484 expr: |
8585 consul_autopilot_failure_tolerance == 0
8686 for: 5m
8787 labels:
88- severity: medium
88+ severity: warning
8989 annotations:
9090 description: There is no failure tolerance in case one Consul server goes down.
9191 - alert: Client RPC requests anomaly
9292 expr: |
9393 avg(rate(consul_client_rpc[1m]) > 0) > (avg_over_time(rate(consul_client_rpc[1m]) [1h:1m])+ 2* stddev_over_time(rate(consul_client_rpc[1m]) [1h:1m]) )
9494 for: 5m
9595 labels:
96- severity: medium
96+ severity: warning
9797 annotations:
9898 description: Consul Client RPC requests had noticeable deviations from baseline over the previous hour.
9999 - alert: Client RPC requests rate limit exceeded
100100 expr: |
101101 rate(consul_client_rpc_exceeded[1m]) / rate(consul_client_rpc[1m]) > 0.1
102102 for: 5m
103103 labels:
104- severity: medium
104+ severity: warning
105105 annotations:
106106 description: Over 10% of Consul Client RPC requests have exceeded the rate limit.
107107 - alert: Client RPC requests failed
108108 expr: |
109109 rate(consul_client_rpc_failed[1m]) / rate(consul_client_rpc[1m]) > 0.1
110110 for: 5m
111111 labels:
112- severity: medium
112+ severity: warning
113113 annotations:
114114 description: Over 10% of Consul Client RPC requests are failing.
115115 - alert: License Expiry
116116 expr: |
117117 consul_system_licenseExpiration / 24 < 30
118118 for: 5m
119119 labels:
120- severity: medium
120+ severity: warning
121121 annotations:
122122 description: Consul license will expire in less than 30 days.
123123 - alert: Garbage Collection pause high
124124 expr: |
125125 (rate(consul_runtime_gc_pause_ns_sum[1m]) / (1000000000) > 2
126126 for: 5m
127127 labels:
128- severity: medium
128+ severity: warning
129129 annotations:
130130 description: Garbage Collection stop-the-world pauses were greater than 2 seconds per minute.
131131 - alert: Garbage Collection pause too high
132132 expr: |
133133 (min(consul_runtime_gc_pause_ns_sum)) / (1000000000) > 5
134134 for: 5m
135135 labels:
136- severity: high
136+ severity: critical
137137 annotations:
138138 description: Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
139139 - alert: Raft restore duration too high
140140 expr: |
141141 consul_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration{kube_pod_label_component="server"})
142142 for: 5m
143143 labels:
144- severity: medium
144+ severity: warning
145145 annotations:
146146 description: The Raft FSM restore duration is too close to the Raft leader oldest log age.
147147 - alert: RPC requests error rate is high
148148 expr: |
149149 sum(rate(consul_rpc_request_error[1m])) / sum(rate(consul_rpc_request[1m])) > 0.05
150150 for: 5m
151151 labels:
152- severity: medium
152+ severity: warning
153153 annotations:
154154 description: RPC requests error rate is higher than 5%.
155155 - alert: Cache hit rate is low
156156 expr: |
157157 consul_consul_cache_fetch_success/(consul_consul_cache_fetch_success+ ignoring(result_not_modified) group_leftconsul_consul_cache_fetch_error)< 0.95
158158 for: 5m
159159 labels:
160- severity: medium
160+ severity: warning
161161 annotations:
162162 description: RPC requests error rate is higher than 5%.
163+ - alert: High Request Latency
164+ expr: |
165+ histogram_quantile(0.95,sum(rate(envoy_cluster_upstream_cx_connect_ms_bucket{kube_cluster_name=~$cluster}[5m])) by (le,kube_cluster_name,kube_pod_name,kube_namespace_name)) > 0.25
166+ for: 5m
167+ labels:
168+ severity: critical
169+ annotations:
170+ description: "Envoy High Request Latency"
171+ - alert: High Response Latency
172+ expr: |
173+ histogram_quantile(0.95,sum(rate(envoy_cluster_upstream_rq_time_bucket{kube_cluster_name=~$cluster}[5m])) by (le,kube_cluster_name,kube_pod_name,kube_namespace_name)) > 0.25
174+ for: 5m
175+ labels:
176+ severity: critical
177+ annotations:
178+ description: "Envoy High Response Latency"
179+ - alert: Certificate close to expire
180+ expr: |
181+ envoy_server_days_until_first_cert_expiring{kube_cluster_name=~$cluster} < 1
182+ for: 5m
183+ labels:
184+ severity: critical
185+ annotations:
186+ description: "Certificate close to expire"
0 commit comments