Skip to content

Commit 8185534

Browse files
Staging (#202)
* Fix image logo to png * Add control plane for version 1.22 (#200) * SMPROD-6724 Updated improved Consul integration (#201) * SMPROD-6724 Updated improved Consul integration * SMPROD-6724 Updated Consul prerequisites * SMPROD-6724 Fix Consul prerequisites Co-authored-by: carlosadiegosysdig <[email protected]>
1 parent 1fc52a3 commit 8185534

24 files changed

+9306
-19
lines changed

apps/kubernetes-control-plane.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ keywords:
88
availableVersions:
99
- '1.14.0'
1010
- '1.18.0'
11+
- '1.22.0'
1112
shortDescription: Open-source system for automating deployment, scaling, and management of containerized applications.
1213
description: |
1314
# Kubernetes (K8s) control plane monitor.

resources/consul/INSTALL.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,13 @@
11
# Prerequisites
22
Consul instruments Prometheus metrics and annotates the pods with Prometheus annotations.
3+
4+
As seen in Consul documentation pages (https://www.consul.io/docs/k8s/helm#v-global-metrics and https://www.consul.io/docs/agent/options#telemetry-prometheus_retention_time), to make Consul expose an endpoint for scraping metrics, you need to enable a few global.metrics configurations.
5+
You also need to enable the telemetry.disable_hostname "extra configurations" in the Consul Server and Client, so the metrics don't contain the name of the instances.
6+
7+
If you install Consul with Helm, you need to use the following flags:
8+
```
9+
--set 'global.metrics.enabled=true'
10+
--set 'global.metrics.enableAgentMetrics=true'
11+
--set 'server.extraConfig="{"telemetry": {"disable_hostname": true}}"'
12+
--set 'client.extraConfig="{"telemetry": {"disable_hostname": true}}"'
13+
```

resources/consul/alerts.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -74,15 +74,15 @@ configurations:
7474
description: There are too many elections for leadership."
7575
- alert: Server cluster unhealthy
7676
expr: |
77-
consul_per_server_autopilot_healthy == 0
77+
consul_autopilot_healthy == 0
7878
for: 5m
7979
labels:
8080
severity: high
8181
annotations:
8282
description: One or many Consul servers in the cluster are unhealthy.
8383
- alert: Zero failure tolerance
8484
expr: |
85-
consul_per_server_autopilot_failure_tolerance == 0
85+
consul_autopilot_failure_tolerance == 0
8686
for: 5m
8787
labels:
8888
severity: medium
@@ -138,7 +138,7 @@ configurations:
138138
description: Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
139139
- alert: Raft restore duration too high
140140
expr: |
141-
consul_per_server_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration{kube_pod_label_component="server"})
141+
consul_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration{kube_pod_label_component="server"})
142142
for: 5m
143143
labels:
144144
severity: medium

resources/consul/include/consul_sysdig.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -456,7 +456,7 @@
456456
"unit": "number",
457457
"yAxis": "auto"
458458
},
459-
"query": "min(consul_per_server_autopilot_healthy{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace})"
459+
"query": "min(consul_autopilot_healthy{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace})"
460460
}
461461
],
462462
"description": "",
@@ -655,7 +655,7 @@
655655
"unit": "number",
656656
"yAxis": "auto"
657657
},
658-
"query": "consul_per_server_autopilot_failure_tolerance{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace}"
658+
"query": "consul_autopilot_failure_tolerance{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace}"
659659
}
660660
],
661661
"axesConfiguration": {
@@ -718,7 +718,7 @@
718718
"unit": "number",
719719
"yAxis": "auto"
720720
},
721-
"query": "consul_per_server_autopilot_healthy{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace}"
721+
"query": "consul_autopilot_healthy{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace}"
722722
}
723723
],
724724
"axesConfiguration": {
@@ -1433,7 +1433,7 @@
14331433
"unit": "relativeTime",
14341434
"yAxis": "auto"
14351435
},
1436-
"query": "consul_per_server_raft_leader_oldestLogAge{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace} > 0\n"
1436+
"query": "consul_raft_leader_oldestLogAge{kube_cluster_name=~$cluster, kube_namespace_name=~$namespace} > 0\n"
14371437
},
14381438
{
14391439
"displayInfo": {

resources/consul/include/sysdig-agent.yaml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@ data:
2929
scrape_interval: 10s
3030
scrape_configs:
3131
- job_name: 'consul-envoy-default'
32+
metrics_path: '/v1/agent/metrics'
33+
params:
34+
format: ['prometheus']
3235
tls_config:
3336
insecure_skip_verify: true
3437
kubernetes_sd_configs:
@@ -102,20 +105,9 @@ data:
102105
- action: keep
103106
source_labels: [__address__]
104107
regex: (.*:8500)
105-
- action: replace
106-
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
107-
target_label: __metrics_path__
108-
regex: (.+)
109-
replacement: '/v1/agent/metrics'
110108
- action: replace
111109
source_labels: [__meta_kubernetes_pod_uid]
112110
target_label: sysdig_k8s_pod_uid
113111
- action: replace
114112
source_labels: [__meta_kubernetes_pod_container_name]
115-
target_label: sysdig_k8s_pod_container_name
116-
metric_relabel_configs:
117-
# Change the name of the metric to remove the name of the pod
118-
- source_labels: ['__name__']
119-
target_label: '__name__'
120-
regex: '(consul_)([a-z]+_)+[0-9]+_(.+)'
121-
replacement: ${1}per_server_${3}
113+
target_label: sysdig_k8s_pod_container_name
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Alerts
2+
## [KubeProxy] Kube Proxy Down
3+
KubeProxy detected down
4+
5+
## [KubeProxy] High Rest Client Latency
6+
High Rest Client Latency detected
7+
8+
## [KubeProxy] High Rule Sync Latency
9+
High Rule Sync Latency detected
10+
11+
## [KubeProxy] Too Many 500 Code
12+
Too Many 500 Code detected
13+
14+
## [CoreDNS] Error High
15+
High Request Duration
16+
17+
## [CoreDNS] Latency High
18+
Latency High
19+
20+
## [Etcd] Etcd Members Down
21+
There are members down.
22+
23+
## [Etcd] Etcd Insufficient Members
24+
Etcd cluster has insufficient members
25+
26+
## [Etcd] Etcd No Leader
27+
Member has no leader.
28+
29+
## [Etcd] Etcd High Number Of Leader Changes
30+
Leader changes within the last 15 minutes.
31+
32+
## [Etcd] Etcd High Number Of Failed GRPC Requests
33+
High number of failed grpc requests
34+
35+
## [Etcd] Etcd GRPC Requests Slow
36+
gRPC requests are taking too much time
37+
38+
## [Etcd] Etcd High Number Of Failed Proposals
39+
High number of proposal failures within the last 30 minutes on etcd instance
40+
41+
## [Etcd] Etcd High Fsync Durations
42+
99th percentile fync durations are too high
43+
44+
## [Etcd] Etcd High Commit Durations
45+
99th percentile commit durations are too high
46+
47+
## [Etcd] Etcd HighNumber Of Failed HTTP Requests
48+
High number of failed http requests
49+
50+
## [Etcd] Etcd HTTP Requests Slow
51+
Https request are slow
52+
53+
## [Kubelet] PV Not Available
54+
Persistent Volume not available
55+
56+
## [Kubelet] High Storage Error Rate
57+
High Storage Error Rate
58+
59+
## [Kubelet] High Storage Latency
60+
High Storage Latency
61+
62+
## [Kubernetes Api Server] Deprecated APIs
63+
API-Server Deprecated APIs
64+
65+
## [Kubernetes Api Server] Certificate Expiry
66+
API-Server Certificate Expiry
67+
68+
## [Kubernetes Api Server] Admission Controller High Latency
69+
API-Server Admission Controller High Latency
70+
71+
## [Kubernetes Api Server] Webhook Admission Controller High Latency
72+
API-Server Webhook Admission Controller High Latency
73+
74+
## [Kubernetes Api Server] High 4xx RequestError Rate
75+
APIS-Server High 4xx Request Error Rate
76+
77+
## [Kubernetes Api Server] High 5xx RequestError Rate
78+
APIS-Server High 5xx Request Error Rate
79+
80+
## [Kubernetes Api Server] High Request Latency
81+
APIS-Server High Request Latency
82+
83+
## [k8s-kubelet] Kubelet Too Many Pods
84+
Kubelet Too Many Pods
85+
86+
## [k8s-kubelet] Kubelet Pod Lifecycle Event Generator Duration High
87+
Kubelet Pod Lifecycle Event Generator Duration High
88+
89+
## [k8s-kubelet] Kubelet Pod StartUp Latency High
90+
Kubelet Pod StartUp Latency High
91+
92+
## [k8s-kubelet] Kubelet Down
93+
Kubelet Down
94+
95+
## [k8s-pvc] PV Not Available
96+
Persistent Volume not available
97+
98+
## [k8s-pvc] PVC Pending For a Long Time
99+
Persistent Volume Claim not available
100+
101+
## [k8s-pvc] PVC Lost
102+
Persistent Volume Claim lost
103+
104+
## [k8s-pvc] PVC Storage Usage Is Reaching The Limit
105+
Persistent Volume Claim storage at 95%
106+
107+
## [k8s-pvc] PVC Inodes Usage Is Reaching The Limit
108+
PVC inodes Usage Is Reaching The Limit
109+
110+
## [k8s-pvc] PV Full In Four Days
111+
Persistent Volume Full In Four Days
112+
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
## Mount the etcd certificates in the sysdig agent
2+
```sh
3+
kubectl -n sysdig-agent patch ds sysdig-agent -p '{"spec":{"template":{"spec":{"volumes":[{"hostPath":{"path":"/etc/kubernetes/pki/etcd-manager-main","type":"DirectoryOrCreate"},"name":"etcd-certificates"}]}}}}'
4+
5+
kubectl -n sysdig-agent patch ds sysdig-agent -p '{"spec":{"template":{"spec":{"containers":[{"name":"sysdig-agent","volumeMounts": [{"mountPath": "/etc/kubernetes/pki/etcd-manager","name": "etcd-certificates"}]}]}}}}'
6+
```
7+
8+
# Exposing the Proxy port in kops
9+
If you are using kops, you will have to change the cluster spec to expose the port for the proxy. To edit the cluster, run:
10+
11+
```
12+
kops --state s3://name-of-s3 --name cluster-name edit cluster
13+
```
14+
15+
And add the following lines:
16+
17+
```yaml
18+
kubeProxy:
19+
metricsBindAddress: 0.0.0.0
20+
```
21+
22+
And update the cluster:
23+
24+
```
25+
kops --state s3://name-of-s3 --name cluster-name rolling-update cluster --yes
26+
```
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Kubernetes
2+
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.
3+
4+
The metrics for the information of kubernetes control plane are gathered from the pods located in the namespace kube-system.
5+
6+
# Metrics
7+
With this metrics we can see the information about:
8+
- Api-server
9+
- Kubelet
10+
- Control manager
11+
- Scheduler
12+
- Proxy
13+
- etcd
14+
- coreDNS
15+
16+
# Attributions
17+
Configuration files and dashboards maintained by [Sysdig team](https://sysdig.com/).
18+
19+
All dashboards and alerts are modified from the [kubernetes mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin) as reference.

0 commit comments

Comments
 (0)