Skip to content

Commit f077103

Browse files
Merge pull request #198 from sysdiglabs/staging
Staging
2 parents a6bbad9 + a6c5918 commit f077103

File tree

14 files changed

+2286
-2
lines changed

14 files changed

+2286
-2
lines changed

apps/consul.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
apiVersion: v1
3+
kind: App
4+
name: "Consul"
5+
keywords:
6+
- Kubernetes
7+
- Available
8+
availableVersions:
9+
- '1.11.1'
10+
shortDescription: "Consul is a free and open-source service networking platform developed by HashiCorp."
11+
description: |
12+
Consul uses service identity with automated networking to help organizations securely connect applications running in any environment.
13+
icon: https://raw.githubusercontent.com/sysdiglabs/promcat-resources/master/apps/images/consul.svg
14+
website: consul.io
15+
available: true

apps/images/consul.png

8.94 KB
Loading

resources/consul/ALERTS.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Alerts
2+
## KV Store update time anomaly
3+
Consul KV Store update time had noticeable deviations from baseline over the previous hour.
4+
5+
## Transaction time anomaly
6+
Consul Transaction time had noticeable deviations from baseline over the previous hour.
7+
8+
## Raft transactions count anomaly
9+
Consul Raft transactions count rate had noticeable deviations from baseline over the previous hour.
10+
11+
## Raft commit time anomaly
12+
Consul Raft commit time had noticeable deviations from baseline over the previous hour.
13+
14+
## Leader time to contact followers too high
15+
Consul Leader time to contact followers was greater than 200ms.
16+
17+
## Flapping leadership
18+
There are too many leadership changes."
19+
20+
## Too many elections
21+
There are too many elections for leadership."
22+
23+
## Server cluster unhealthy
24+
One or many Consul servers in the cluster are unhealthy.
25+
26+
## Zero failure tolerance
27+
There is no failure tolerance in case one Consul server goes down.
28+
29+
## Client RPC requests anomaly
30+
Consul Client RPC requests had noticeable deviations from baseline over the previous hour.
31+
32+
## Client RPC requests rate limit exceeded
33+
Over 10% of Consul Client RPC requests have exceeded the rate limit.
34+
35+
## Client RPC requests failed
36+
Over 10% of Consul Client RPC requests are failing.
37+
38+
## License Expiry
39+
Consul license will expire in less than 30 days.
40+
41+
## Garbage Collection pause high
42+
Garbage Collection stop-the-world pauses were greater than 2 seconds per minute.
43+
44+
## Garbage Collection pause too high
45+
Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
46+
47+
## Raft restore duration too high
48+
The Raft FSM restore duration is too close to the Raft leader oldest log age.
49+
50+
## RPC requests error rate is high
51+
RPC requests error rate is higher than 5%.
52+
53+
## Cache hit rate is low
54+
RPC requests error rate is higher than 5%.
55+

resources/consul/INSTALL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Prerequisites
2+
Consul instruments Prometheus metrics and annotates the pods with Prometheus annotations.

resources/consul/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Consul
2+
Consul is a service mesh solution providing a full featured control plane with service discovery, configuration, and segmentation functionality. Each of these features can be used individually as needed, or they can be used together to build a full service mesh. Consul requires a data plane and supports both a proxy and native integration model. Consul ships with a simple built-in proxy so that everything works out of the box, but also supports 3rd party proxy integrations such as Envoy.
3+
4+
# Prometheus and exporters
5+
Consul already has a Prometheus endpoint with all the metrics exposed on the port 8500 and the envoys are the 20200. In Kubernetes the pod is already annotated, so with the Sysdig agent you can scrape the endpoint right away.
6+
7+
# Metrics
8+
- Consul server statistics
9+
- Consul client statistics
10+
- Envoy metrics
11+
12+
# Attributions
13+
Configuration files, dashboards and alerts are maintained by [Sysdig team](https://sysdig.com/).

resources/consul/alerts.yaml

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
apiVersion: v1
2+
kind: Alert
3+
app: Consul
4+
version: 1.0.0
5+
appVersion:
6+
- 1.11.1
7+
descriptionFile: ALERTS.md
8+
configurations:
9+
- kind: Prometheus
10+
data: |-
11+
groups:
12+
- name: Consul
13+
rules:
14+
- alert: KV Store update time anomaly
15+
expr: |
16+
avg(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m])
17+
+ 2* stddev_over_time(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
18+
for: 5m
19+
labels:
20+
severity: medium
21+
annotations:
22+
description: Consul KV Store update time had noticeable deviations from baseline over the previous hour.
23+
- alert: Transaction time anomaly
24+
expr: |
25+
avg(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m])
26+
+ 2* stddev_over_time(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
27+
for: 5m
28+
labels:
29+
severity: medium
30+
annotations:
31+
description: Consul Transaction time had noticeable deviations from baseline over the previous hour.
32+
- alert: Raft transactions count anomaly
33+
expr: |
34+
avg(rate(consul_raft_apply{
35+
kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_raft_apply{kube_pod_label_component="server"}[1m]) [1h:1m])
36+
+ 2* stddev_over_time(rate(consul_raft_apply{kube_pod_label_component="server"}[1m]) [1h:1m]) )
37+
for: 5m
38+
labels:
39+
severity: medium
40+
annotations:
41+
description: Consul Raft transactions count rate had noticeable deviations from baseline over the previous hour.
42+
- alert: Raft commit time anomaly
43+
expr: |
44+
avg(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) [1h:1m])
45+
+ 2* stddev_over_time(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
46+
for: 5m
47+
labels:
48+
severity: medium
49+
annotations:
50+
description: Consul Raft commit time had noticeable deviations from baseline over the previous hour.
51+
- alert: Leader time to contact followers too high
52+
expr: |
53+
consul_raft_leader_lastContact{quantile="0.9"} > 200
54+
for: 5m
55+
labels:
56+
severity: medium
57+
annotations:
58+
description: Consul Leader time to contact followers was greater than 200ms.
59+
- alert: Flapping leadership
60+
expr: |
61+
sum(rate(consul_raft_state_leader{kube_pod_label_component="server"}[1m])) > 0
62+
for: 5m
63+
labels:
64+
severity: medium
65+
annotations:
66+
description: There are too many leadership changes."
67+
- alert: Too many elections
68+
expr: |
69+
sum(rate(consul_raft_state_candidate{kube_pod_label_component="server"}[1m])) > 0
70+
for: 5m
71+
labels:
72+
severity: medium
73+
annotations:
74+
description: There are too many elections for leadership."
75+
- alert: Server cluster unhealthy
76+
expr: |
77+
consul_per_server_autopilot_healthy == 0
78+
for: 5m
79+
labels:
80+
severity: high
81+
annotations:
82+
description: One or many Consul servers in the cluster are unhealthy.
83+
- alert: Zero failure tolerance
84+
expr: |
85+
consul_per_server_autopilot_failure_tolerance == 0
86+
for: 5m
87+
labels:
88+
severity: medium
89+
annotations:
90+
description: There is no failure tolerance in case one Consul server goes down.
91+
- alert: Client RPC requests anomaly
92+
expr: |
93+
avg(rate(consul_client_rpc[1m]) > 0) > (avg_over_time(rate(consul_client_rpc[1m]) [1h:1m])+ 2* stddev_over_time(rate(consul_client_rpc[1m]) [1h:1m]) )
94+
for: 5m
95+
labels:
96+
severity: medium
97+
annotations:
98+
description: Consul Client RPC requests had noticeable deviations from baseline over the previous hour.
99+
- alert: Client RPC requests rate limit exceeded
100+
expr: |
101+
rate(consul_client_rpc_exceeded[1m]) / rate(consul_client_rpc[1m]) > 0.1
102+
for: 5m
103+
labels:
104+
severity: medium
105+
annotations:
106+
description: Over 10% of Consul Client RPC requests have exceeded the rate limit.
107+
- alert: Client RPC requests failed
108+
expr: |
109+
rate(consul_client_rpc_failed[1m]) / rate(consul_client_rpc[1m]) > 0.1
110+
for: 5m
111+
labels:
112+
severity: medium
113+
annotations:
114+
description: Over 10% of Consul Client RPC requests are failing.
115+
- alert: License Expiry
116+
expr: |
117+
consul_system_licenseExpiration / 24 < 30
118+
for: 5m
119+
labels:
120+
severity: medium
121+
annotations:
122+
description: Consul license will expire in less than 30 days.
123+
- alert: Garbage Collection pause high
124+
expr: |
125+
(rate(consul_runtime_gc_pause_ns_sum[1m]) / (1000000000) > 2
126+
for: 5m
127+
labels:
128+
severity: medium
129+
annotations:
130+
description: Garbage Collection stop-the-world pauses were greater than 2 seconds per minute.
131+
- alert: Garbage Collection pause too high
132+
expr: |
133+
(min(consul_runtime_gc_pause_ns_sum)) / (1000000000) > 5
134+
for: 5m
135+
labels:
136+
severity: high
137+
annotations:
138+
description: Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
139+
- alert: Raft restore duration too high
140+
expr: |
141+
consul_per_server_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration{kube_pod_label_component="server"})
142+
for: 5m
143+
labels:
144+
severity: medium
145+
annotations:
146+
description: The Raft FSM restore duration is too close to the Raft leader oldest log age.
147+
- alert: RPC requests error rate is high
148+
expr: |
149+
sum(rate(consul_rpc_request_error[1m])) / sum(rate(consul_rpc_request[1m])) > 0.05
150+
for: 5m
151+
labels:
152+
severity: medium
153+
annotations:
154+
description: RPC requests error rate is higher than 5%.
155+
- alert: Cache hit rate is low
156+
expr: |
157+
consul_consul_cache_fetch_success/(consul_consul_cache_fetch_success+ ignoring(result_not_modified) group_leftconsul_consul_cache_fetch_error)< 0.95
158+
for: 5m
159+
labels:
160+
severity: medium
161+
annotations:
162+
description: RPC requests error rate is higher than 5%.

resources/consul/dashboards.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
apiVersion: v1
2+
kind: Dashboard
3+
app: Consul
4+
version: 1.0.0
5+
appVersion:
6+
- '1.11.1'
7+
configurations:
8+
- name: Overview
9+
kind: Sysdig
10+
image: consul/images/consul_sysdig.png
11+
description: |
12+
This dashboard offers information on:
13+
* Overview
14+
* Health
15+
* Transaction
16+
* Leadership
17+
* Network
18+
* Cache
19+
file: include/consul_sysdig.json

resources/consul/description.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
apiVersion: v1
2+
kind: Description
3+
app: Consul
4+
version: 1.0.0
5+
appVersion:
6+
- '1.11.1'
7+
descriptionFile: README.md
1.88 MB
Loading

0 commit comments

Comments
 (0)