Skip to content

Commit 4d88d16

Browse files
authored
Add Kubernetes admin dashboards for api-server (#208)
* added workqueue related and apiserver_storage_db_total_size_in_bytes (available since K8s/EKS v1.26+) metrics into kube-admin scrape job * chnanged OTEL scrape config and recording rules file to make apiserver Grafana dashboards working * added APISERVER Grafana dashboards into variables, Flux kustomization * added original kube-prom-stack kube-apiserver scrape config into OTEL * Revert "added original kube-prom-stack kube-apiserver scrape config into OTEL" because this scrape config does not work :-( This reverts commit 715db89. * added API server troubleshoting dashboard * removed empty line for clarity * make API serve rmonitoring default to true but can be disabled as well * Update eks-apiserver.md * updated eks-monitoring/README.md according to pre-commit --------- Co-authored-by: Jens-Uwe Walther <[email protected]>
1 parent 9f90ec8 commit 4d88d16

File tree

10 files changed

+197
-8
lines changed

10 files changed

+197
-8
lines changed

docs/eks/eks-apiserver.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Monitoring EKS API server
2+
3+
AWS Distro of OpenTelemetry enables EKS API server monitoring by default and provides three Grafana dashboards:
4+
5+
## Kube-apiserver (basic)
6+
7+
The basic dashboard shows metrics recommended in [EKS Best Practices Guides - Monitor Control Plane Metrics](https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#monitor-control-plane-metrics) and provides request rate and latency for API server, latency for ETCD server and overall workqueue sercice time and latency. It allows a drill-down per API server.
8+
9+
![image](https://github.com/youwalther65/terraform-aws-observability-accelerator/assets/29410195/9dcf2583-6630-4d3c-911d-8ca48ae2d26f)
10+
11+
## Kube-apiserver (advanced)
12+
13+
The advanced dashboard is derived from kube-prometheus-stack "Kubernetes / API server" dashboard and provides a detailed metrics drill-down for example per READ and WRITE operations per component (like deployments, configmaps etc.).
14+
15+
![image](https://github.com/youwalther65/terraform-aws-observability-accelerator/assets/29410195/e76a6357-461f-416d-8bf0-5b7777848bea)
16+
17+
## Kube-apiserver (troubleshooting)
18+
19+
This dashboards can be used to troubleshoot API server problems like latency, errors etc.
20+
21+
A detailed description for usage and background information regarding the dashboard can be found in AWS Containers blog post [Troubleshooting Amazon EKS API servers with Prometheus](https://aws.amazon.com/blogs/containers/troubleshooting-amazon-eks-api-servers-with-prometheus/).
22+
23+
![image](https://github.com/youwalther65/terraform-aws-observability-accelerator/assets/29410195/921d3453-dcda-4d8a-8223-7c02f1f08ee2)

examples/existing-cluster-with-base-and-infra/main.tf

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,9 @@ module "eks_monitoring" {
6767
# reusing existing certificate manager? defaults to true
6868
enable_cert_manager = true
6969

70+
# enable EKS API server monitoring
71+
enable_apiserver_monitoring = true
72+
7073
# deploys external-secrets in to the cluster
7174
enable_external_secrets = true
7275
grafana_api_key = var.grafana_api_key

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ nav:
2727
- Concepts: concepts.md
2828
- Amazon EKS:
2929
- Infrastructure monitoring: eks/index.md
30+
- EKS API server monitoring: eks/eks-apiserver.md
3031
- Multicluster monitoring: eks/multicluster.md
3132
- Java/JMX: eks/java.md
3233
- Nginx: eks/nginx.md

modules/eks-monitoring/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
7171
| <a name="input_eks_cluster_id"></a> [eks\_cluster\_id](#input\_eks\_cluster\_id) | EKS Cluster Id | `string` | n/a | yes |
7272
| <a name="input_enable_alerting_rules"></a> [enable\_alerting\_rules](#input\_enable\_alerting\_rules) | Enables or disables Managed Prometheus alerting rules | `bool` | `true` | no |
7373
| <a name="input_enable_amazon_eks_adot"></a> [enable\_amazon\_eks\_adot](#input\_enable\_amazon\_eks\_adot) | Enables the ADOT Operator on the EKS Cluster | `bool` | `true` | no |
74+
| <a name="input_enable_apiserver_monitoring"></a> [enable\_apiserver\_monitoring](#input\_enable\_apiserver\_monitoring) | Enable EKS kube-apiserver monitoring, alerting and dashboards | `bool` | `true` | no |
7475
| <a name="input_enable_cert_manager"></a> [enable\_cert\_manager](#input\_enable\_cert\_manager) | Allow reusing an existing installation of cert-manager | `bool` | `true` | no |
7576
| <a name="input_enable_custom_metrics"></a> [enable\_custom\_metrics](#input\_enable\_custom\_metrics) | Allows additional metrics collection for config elements in the `custom_metrics_config` config object. Automatic dashboards are not included | `bool` | `false` | no |
7677
| <a name="input_enable_dashboards"></a> [enable\_dashboards](#input\_enable\_dashboards) | Enables or disables curated dashboards | `bool` | `true` | no |
@@ -93,6 +94,9 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
9394
| <a name="input_flux_kustomization_path"></a> [flux\_kustomization\_path](#input\_flux\_kustomization\_path) | Flux Kustomization Path | `string` | `"./artifacts/grafana-operator-manifests/eks/infrastructure"` | no |
9495
| <a name="input_go_config"></a> [go\_config](#input\_go\_config) | Grafana Operator configuration | <pre>object({<br> create_namespace = bool<br> helm_chart = string<br> helm_name = string<br> k8s_namespace = string<br> helm_release_name = string<br> helm_chart_version = string<br> })</pre> | <pre>{<br> "create_namespace": true,<br> "helm_chart": "oci://ghcr.io/grafana-operator/helm-charts/grafana-operator",<br> "helm_chart_version": "v5.0.0-rc3",<br> "helm_name": "grafana-operator",<br> "helm_release_name": "grafana-operator",<br> "k8s_namespace": "grafana-operator"<br>}</pre> | no |
9596
| <a name="input_grafana_api_key"></a> [grafana\_api\_key](#input\_grafana\_api\_key) | Grafana API key for the Amazon Managed Grafana workspace. Required if `enable_external_secrets = true` | `string` | `""` | no |
97+
| <a name="input_grafana_apiserver_advanced_dashboard_url"></a> [grafana\_apiserver\_advanced\_dashboard\_url](#input\_grafana\_apiserver\_advanced\_dashboard\_url) | Dashboard URL for Kube-apiserver (advanced) Grafana Dashboard JSON | `string` | `"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-advanced.json"` | no |
98+
| <a name="input_grafana_apiserver_basic_dashboard_url"></a> [grafana\_apiserver\_basic\_dashboard\_url](#input\_grafana\_apiserver\_basic\_dashboard\_url) | Dashboard URL for Kube-apiserver (basic) Grafana Dashboard JSON | `string` | `"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-basic.json"` | no |
99+
| <a name="input_grafana_apiserver_troubleshooting_dashboard_url"></a> [grafana\_apiserver\_troubleshooting\_dashboard\_url](#input\_grafana\_apiserver\_troubleshooting\_dashboard\_url) | Dashboard URL for Kube-apiserver (troubleshooting) Grafana Dashboard JSON | `string` | `"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-troubleshooting.json"` | no |
96100
| <a name="input_grafana_cluster_dashboard_url"></a> [grafana\_cluster\_dashboard\_url](#input\_grafana\_cluster\_dashboard\_url) | Dashboard URL for Cluster Grafana Dashboard JSON | `string` | `"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json"` | no |
97101
| <a name="input_grafana_kubelet_dashboard_url"></a> [grafana\_kubelet\_dashboard\_url](#input\_grafana\_kubelet\_dashboard\_url) | Dashboard URL for Kubelet Grafana Dashboard JSON | `string` | `"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json"` | no |
98102
| <a name="input_grafana_namespace_workloads_dashboard_url"></a> [grafana\_namespace\_workloads\_dashboard\_url](#input\_grafana\_namespace\_workloads\_dashboard\_url) | Dashboard URL for Namespace Workloads Grafana Dashboard JSON | `string` | `"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json"` | no |

modules/eks-monitoring/dashboards.tf

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ spec:
3737
AMP_ENDPOINT_URL: ${var.managed_prometheus_workspace_endpoint}
3838
AMG_ENDPOINT_URL: ${var.grafana_url}
3939
GRAFANA_CLUSTER_DASH_URL: ${var.grafana_cluster_dashboard_url}
40+
GRAFANA_APISERVER_BASIC_DASH_URL: ${var.grafana_apiserver_basic_dashboard_url}
41+
GRAFANA_APISERVER_ADVANCED_DASH_URL: ${var.grafana_apiserver_advanced_dashboard_url}
42+
GRAFANA_APISERVER_TROUBLESHOOTING_DASH_URL: ${var.grafana_apiserver_troubleshooting_dashboard_url}
4043
GRAFANA_KUBELET_DASH_URL: ${var.grafana_kubelet_dashboard_url}
4144
GRAFANA_NSWRKLDS_DASH_URL: ${var.grafana_namespace_workloads_dashboard_url}
4245
GRAFANA_NODEEXP_DASH_URL: ${var.grafana_node_exporter_dashboard_url}

modules/eks-monitoring/main.tf

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,10 @@ module "helm_addon" {
153153
name = "javaPrometheusMetricsEndpoint"
154154
value = try(var.java_config.prometheus_metrics_endpoint, local.java_pattern_config.prometheus_metrics_endpoint)
155155
},
156+
{
157+
name = "enableAPIserver"
158+
value = var.enable_apiserver_monitoring
159+
},
156160
{
157161
name = "enableNginx"
158162
value = var.enable_nginx

modules/eks-monitoring/otel-config/templates/opentelemetrycollector.yaml

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -68,24 +68,35 @@ spec:
6868
regex: (.+)
6969
target_label: __metrics_path__
7070
replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
71-
- job_name: 'kube-admin'
71+
72+
{{ if .Values.enableAPIserver }}
73+
- job_name: 'apiserver'
7274
scheme: https
7375
tls_config:
7476
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
7577
insecure_skip_verify: true
7678
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
7779
kubernetes_sd_configs:
78-
- role: node
80+
- role: endpoints
7981
relabel_configs:
80-
- target_label: __address__
81-
replacement: kubernetes.default.svc.cluster.local:443
82-
- action: keep
83-
regex: $K8S_NODE_NAME
84-
source_labels: [__meta_kubernetes_node_name]
82+
- source_labels:
83+
[
84+
__meta_kubernetes_namespace,
85+
__meta_kubernetes_service_name,
86+
__meta_kubernetes_endpoint_port_name,
87+
]
88+
action: keep
89+
regex: default;kubernetes;https
8590
metric_relabel_configs:
8691
- action: keep
8792
source_labels: [__name__]
88-
regex: 'apiserver_(request_duration_seconds|storage_list_duration_seconds|admission_controller_admission_duration_seconds|flowcontrol_request_wait_duration_seconds).*|apiserver_(admission_webhook_fail_open_count|tls_handshake_errors_total|request_total)|rest_client_request_duration_seconds.*|rest_client_requests_total|etcd_(request_duration_seconds|db_total_size_in_bytes).*'
93+
- source_labels: [__name__, le]
94+
separator: ;
95+
regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
96+
replacement: $1
97+
action: drop
98+
{{ end }}
99+
89100
- job_name: serviceMonitor/default/kube-prometheus-stack-prometheus-node-exporter/0
90101
honor_timestamps: true
91102
scrape_interval: {{ .Values.globalScrapeInterval }}

modules/eks-monitoring/otel-config/values.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ accountId: ${account_id}
66
globalScrapeTimeout: ${global_scrape_timeout}
77
globalScrapeSampleLimit: ${global_scrape_sample_limit}
88

9+
enableAPIserver: ${enable_apiserver_monitoring}
10+
911
enableTracing: ${enable_tracing}
1012
otlpGrpcEndpoint: ${otlp_grpc_endpoint}
1113
otlpHttpEndpoint: ${otlp_http_endpoint}

modules/eks-monitoring/rules.tf

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,5 +238,119 @@ groups:
238238
expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="Job"}, "workload", "$1", "owner_name", "(.*)"))
239239
labels:
240240
workload_type: job
241+
- name: infra-rules-05
242+
rules:
243+
- expr: sum by (cluster, code, verb) (increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"2.."}[1h]))
244+
record: code_verb:apiserver_request_total:increase1h
245+
- expr: sum by (cluster, code, verb) (increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"3.."}[1h]))
246+
record: code_verb:apiserver_request_total:increase1h
247+
- expr: sum by (cluster, code, verb) (increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"4.."}[1h]))
248+
record: code_verb:apiserver_request_total:increase1h
249+
- expr: sum by (cluster, code, verb) (increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
250+
record: code_verb:apiserver_request_total:increase1h
251+
- expr: sum by (cluster,code,resource) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m]))
252+
labels:
253+
verb: read
254+
record: code_resource:apiserver_request_total:rate5m
255+
- expr: sum by (cluster,code,resource) (rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
256+
labels:
257+
verb: write
258+
record: code_resource:apiserver_request_total:rate5m
259+
- expr: sum by (cluster, verb, scope, le) (increase(apiserver_request_slo_duration_seconds_bucket[1h]))
260+
record: cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase1h
261+
- expr: sum by (cluster, verb, scope, le) (avg_over_time(cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase1h[30d])
262+
* 24 * 30)
263+
record: cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d
264+
- expr: |-
265+
1 - (
266+
(
267+
# write too slow
268+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
269+
-
270+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"POST|PUT|PATCH|DELETE",le="1"})
271+
) +
272+
(
273+
# read too slow
274+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"LIST|GET"})
275+
-
276+
(
277+
(
278+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
279+
or
280+
vector(0)
281+
)
282+
+
283+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
284+
+
285+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
286+
)
287+
) +
288+
# errors
289+
sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
290+
)
291+
/
292+
sum by (cluster) (code:apiserver_request_total:increase30d)
293+
labels:
294+
verb: all
295+
record: apiserver_request:availability30d
296+
- expr: |-
297+
1 - (
298+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"LIST|GET"})
299+
-
300+
(
301+
# too slow
302+
(
303+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
304+
or
305+
vector(0)
306+
)
307+
+
308+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
309+
+
310+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
311+
)
312+
+
313+
# errors
314+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
315+
)
316+
/
317+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})
318+
labels:
319+
verb: read
320+
record: apiserver_request:availability30d
321+
- expr: |-
322+
1 - (
323+
(
324+
# too slow
325+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
326+
-
327+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"POST|PUT|PATCH|DELETE",le="1"})
328+
)
329+
+
330+
# errors
331+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="write",code=~"5.."} or vector(0))
332+
)
333+
/
334+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="write"})
335+
labels:
336+
verb: write
337+
record: apiserver_request:availability30d
338+
- expr: histogram_quantile(0.99, sum by (cluster, le, resource) (rate(apiserver_request_slo_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[5m])))
339+
> 0
340+
labels:
341+
quantile: "0.99"
342+
verb: read
343+
record: cluster_quantile:apiserver_request_slo_duration_seconds:histogram_quantile
344+
- expr: histogram_quantile(0.99, sum by (cluster, le, resource) (rate(apiserver_request_slo_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[5m])))
345+
> 0
346+
labels:
347+
quantile: "0.99"
348+
verb: write
349+
record: cluster_quantile:apiserver_request_slo_duration_seconds:histogram_quantile
350+
- expr: |
351+
histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod))
352+
labels:
353+
quantile: "0.9"
354+
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
241355
EOF
242356
}

modules/eks-monitoring/variables.tf

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,12 @@ variable "prometheus_config" {
201201
nullable = false
202202
}
203203

204+
variable "enable_apiserver_monitoring" {
205+
description = "Enable EKS kube-apiserver monitoring, alerting and dashboards"
206+
type = bool
207+
default = true
208+
}
209+
204210
variable "enable_tracing" {
205211
description = "Enables tracing with OTLP traces receiver to X-Ray"
206212
type = bool
@@ -440,6 +446,24 @@ variable "grafana_cluster_dashboard_url" {
440446
default = "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json"
441447
}
442448

449+
variable "grafana_apiserver_basic_dashboard_url" {
450+
description = "Dashboard URL for Kube-apiserver (basic) Grafana Dashboard JSON"
451+
type = string
452+
default = "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-basic.json"
453+
}
454+
455+
variable "grafana_apiserver_advanced_dashboard_url" {
456+
description = "Dashboard URL for Kube-apiserver (advanced) Grafana Dashboard JSON"
457+
type = string
458+
default = "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-advanced.json"
459+
}
460+
461+
variable "grafana_apiserver_troubleshooting_dashboard_url" {
462+
description = "Dashboard URL for Kube-apiserver (troubleshooting) Grafana Dashboard JSON"
463+
type = string
464+
default = "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-troubleshooting.json"
465+
}
466+
443467
variable "grafana_kubelet_dashboard_url" {
444468
description = "Dashboard URL for Kubelet Grafana Dashboard JSON"
445469
type = string

0 commit comments

Comments
 (0)