You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Kubernetes admin dashboards for api-server (#208)
* added workqueue related and apiserver_storage_db_total_size_in_bytes (available since K8s/EKS v1.26+) metrics into kube-admin scrape job
* chnanged OTEL scrape config and recording rules file to make apiserver Grafana dashboards working
* added APISERVER Grafana dashboards into variables, Flux kustomization
* added original kube-prom-stack kube-apiserver scrape config into OTEL
* Revert "added original kube-prom-stack kube-apiserver scrape config into OTEL"
because this scrape config does not work :-(
This reverts commit 715db89.
* added API server troubleshoting dashboard
* removed empty line for clarity
* make API serve rmonitoring default to true but can be disabled as well
* Update eks-apiserver.md
* updated eks-monitoring/README.md according to pre-commit
---------
Co-authored-by: Jens-Uwe Walther <[email protected]>
AWS Distro of OpenTelemetry enables EKS API server monitoring by default and provides three Grafana dashboards:
4
+
5
+
## Kube-apiserver (basic)
6
+
7
+
The basic dashboard shows metrics recommended in [EKS Best Practices Guides - Monitor Control Plane Metrics](https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#monitor-control-plane-metrics) and provides request rate and latency for API server, latency for ETCD server and overall workqueue sercice time and latency. It allows a drill-down per API server.
The advanced dashboard is derived from kube-prometheus-stack "Kubernetes / API server" dashboard and provides a detailed metrics drill-down for example per READ and WRITE operations per component (like deployments, configmaps etc.).
This dashboards can be used to troubleshoot API server problems like latency, errors etc.
20
+
21
+
A detailed description for usage and background information regarding the dashboard can be found in AWS Containers blog post [Troubleshooting Amazon EKS API servers with Prometheus](https://aws.amazon.com/blogs/containers/troubleshooting-amazon-eks-api-servers-with-prometheus/).
| <aname="input_enable_alerting_rules"></a> [enable\_alerting\_rules](#input\_enable\_alerting\_rules)| Enables or disables Managed Prometheus alerting rules |`bool`|`true`| no |
73
73
| <aname="input_enable_amazon_eks_adot"></a> [enable\_amazon\_eks\_adot](#input\_enable\_amazon\_eks\_adot)| Enables the ADOT Operator on the EKS Cluster |`bool`|`true`| no |
74
+
| <aname="input_enable_apiserver_monitoring"></a> [enable\_apiserver\_monitoring](#input\_enable\_apiserver\_monitoring)| Enable EKS kube-apiserver monitoring, alerting and dashboards |`bool`|`true`| no |
74
75
| <aname="input_enable_cert_manager"></a> [enable\_cert\_manager](#input\_enable\_cert\_manager)| Allow reusing an existing installation of cert-manager |`bool`|`true`| no |
75
76
| <aname="input_enable_custom_metrics"></a> [enable\_custom\_metrics](#input\_enable\_custom\_metrics)| Allows additional metrics collection for config elements in the `custom_metrics_config` config object. Automatic dashboards are not included |`bool`|`false`| no |
76
77
| <aname="input_enable_dashboards"></a> [enable\_dashboards](#input\_enable\_dashboards)| Enables or disables curated dashboards |`bool`|`true`| no |
@@ -93,6 +94,9 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
93
94
| <aname="input_flux_kustomization_path"></a> [flux\_kustomization\_path](#input\_flux\_kustomization\_path)| Flux Kustomization Path |`string`|`"./artifacts/grafana-operator-manifests/eks/infrastructure"`| no |
| <aname="input_grafana_api_key"></a> [grafana\_api\_key](#input\_grafana\_api\_key)| Grafana API key for the Amazon Managed Grafana workspace. Required if `enable_external_secrets = true`|`string`|`""`| no |
97
+
| <aname="input_grafana_apiserver_advanced_dashboard_url"></a> [grafana\_apiserver\_advanced\_dashboard\_url](#input\_grafana\_apiserver\_advanced\_dashboard\_url)| Dashboard URL for Kube-apiserver (advanced) Grafana Dashboard JSON |`string`|`"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-advanced.json"`| no |
98
+
| <aname="input_grafana_apiserver_basic_dashboard_url"></a> [grafana\_apiserver\_basic\_dashboard\_url](#input\_grafana\_apiserver\_basic\_dashboard\_url)| Dashboard URL for Kube-apiserver (basic) Grafana Dashboard JSON |`string`|`"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-basic.json"`| no |
99
+
| <aname="input_grafana_apiserver_troubleshooting_dashboard_url"></a> [grafana\_apiserver\_troubleshooting\_dashboard\_url](#input\_grafana\_apiserver\_troubleshooting\_dashboard\_url)| Dashboard URL for Kube-apiserver (troubleshooting) Grafana Dashboard JSON |`string`|`"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/apiserver-troubleshooting.json"`| no |
96
100
| <aname="input_grafana_cluster_dashboard_url"></a> [grafana\_cluster\_dashboard\_url](#input\_grafana\_cluster\_dashboard\_url)| Dashboard URL for Cluster Grafana Dashboard JSON |`string`|`"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json"`| no |
97
101
| <aname="input_grafana_kubelet_dashboard_url"></a> [grafana\_kubelet\_dashboard\_url](#input\_grafana\_kubelet\_dashboard\_url)| Dashboard URL for Kubelet Grafana Dashboard JSON |`string`|`"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json"`| no |
98
102
| <aname="input_grafana_namespace_workloads_dashboard_url"></a> [grafana\_namespace\_workloads\_dashboard\_url](#input\_grafana\_namespace\_workloads\_dashboard\_url)| Dashboard URL for Namespace Workloads Grafana Dashboard JSON |`string`|`"https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json"`| no |
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
269
+
-
270
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"POST|PUT|PATCH|DELETE",le="1"})
271
+
) +
272
+
(
273
+
# read too slow
274
+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"LIST|GET"})
275
+
-
276
+
(
277
+
(
278
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
279
+
or
280
+
vector(0)
281
+
)
282
+
+
283
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
284
+
+
285
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
286
+
)
287
+
) +
288
+
# errors
289
+
sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
290
+
)
291
+
/
292
+
sum by (cluster) (code:apiserver_request_total:increase30d)
293
+
labels:
294
+
verb: all
295
+
record: apiserver_request:availability30d
296
+
- expr: |-
297
+
1 - (
298
+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"LIST|GET"})
299
+
-
300
+
(
301
+
# too slow
302
+
(
303
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
304
+
or
305
+
vector(0)
306
+
)
307
+
+
308
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
309
+
+
310
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
311
+
)
312
+
+
313
+
# errors
314
+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
315
+
)
316
+
/
317
+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})
318
+
labels:
319
+
verb: read
320
+
record: apiserver_request:availability30d
321
+
- expr: |-
322
+
1 - (
323
+
(
324
+
# too slow
325
+
sum by (cluster) (cluster_verb_scope:apiserver_request_slo_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
326
+
-
327
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_slo_duration_seconds_bucket:increase30d{verb=~"POST|PUT|PATCH|DELETE",le="1"})
328
+
)
329
+
+
330
+
# errors
331
+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="write",code=~"5.."} or vector(0))
332
+
)
333
+
/
334
+
sum by (cluster) (code:apiserver_request_total:increase30d{verb="write"})
335
+
labels:
336
+
verb: write
337
+
record: apiserver_request:availability30d
338
+
- expr: histogram_quantile(0.99, sum by (cluster, le, resource) (rate(apiserver_request_slo_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[5m])))
- expr: histogram_quantile(0.99, sum by (cluster, le, resource) (rate(apiserver_request_slo_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[5m])))
0 commit comments