Skip to content

Commit f65afff

Browse files
authored
docs: Update troubleshooting and navigation (#228)
* Update docs structure with regrouped patterns * Rename multi cluster to avoid ambiguity with cross account * Separate troubleshooting for eks-monitoring
1 parent 9498a06 commit f65afff

File tree

4 files changed

+272
-82
lines changed

4 files changed

+272
-82
lines changed

docs/eks/index.md

Lines changed: 0 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -191,76 +191,3 @@ sum(up{job="custom-metrics"}) by (container_name, cluster, nodename)
191191
```
192192

193193
<img width="2560" alt="Screenshot 2023-01-31 at 11 16 21" src="https://user-images.githubusercontent.com/10175027/215869004-e05f557d-c81a-41fb-a452-ede9f986cb27.png">
194-
195-
## Troubleshooting
196-
197-
### 1. Grafana dashboards missing or Grafana API key expired
198-
199-
In case you don't see the grafana dashboards in your Amazon Managed Grafana console, check on the logs on your grafana operator pod using the below command :
200-
201-
```bash
202-
kubectl get pods -n grafana-operator
203-
```
204-
205-
Output:
206-
207-
```console
208-
NAME READY STATUS RESTARTS AGE
209-
grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m
210-
```
211-
212-
```bash
213-
kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator
214-
```
215-
216-
Output:
217-
218-
```console
219-
1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
220-
github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile
221-
```
222-
223-
If you observe, the the above `grafana-api-key error` in the logs, your grafana API key is expired. Please use the operational procedure to update your `grafana-api-key` :
224-
225-
- First, lets create a new Grafana API key.
226-
227-
```bash
228-
export GO_AMG_API_KEY=$(aws grafana create-workspace-api-key \
229-
--key-name "grafana-operator-key-new" \
230-
--key-role "ADMIN" \
231-
--seconds-to-live 432000 \
232-
--workspace-id <YOUR_WORKSPACE_ID> \
233-
--query key \
234-
--output text)
235-
```
236-
237-
- Finally, update the Grafana API key secret in AWS SSM Parameter Store using the above new Grafana API key:
238-
239-
```bash
240-
aws aws ssm put-parameter \
241-
--name "/terraform-accelerator/grafana-api-key" \
242-
--type "SecureString" \
243-
--value "{\"GF_SECURITY_ADMIN_APIKEY\": \"${GO_AMG_API_KEY}\"}" \
244-
--region <Your AWS Region>
245-
```
246-
247-
- If the issue persists, you can force the synchronization by deleting the `externalsecret` Kubernetes object.
248-
249-
```bash
250-
kubectl delete externalsecret/external-secrets-sm -n grafana-operator
251-
```
252-
253-
### 2. Upgrade from 2.1.0 or earlier
254-
255-
When you upgrade the eks-monitoring module from v2.1.0 or earlier, the following error may occur.
256-
257-
```bash
258-
Error: cannot patch "prometheus-node-exporter" with kind DaemonSet: DaemonSet.apps "prometheus-node-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"prometheus-node-exporter", "app.kubernetes.io/name":"prometheus-node-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
259-
```
260-
261-
This is due to the upgrade of the node-exporter chart from v2 to v4. Manually delete the node-exporter's DaemonSet as described in [the link here](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter#3x-to-4x), and then apply.
262-
263-
```bash
264-
kubectl -n prometheus-node-exporter delete daemonset -l app=prometheus-node-exporter
265-
terraform apply
266-
```

docs/eks/multicluster.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1-
# AWS EKS Multicluster Observability
1+
# AWS EKS Multicluster Observability (single AWS Account)
22

3-
This example shows how to use the [AWS Observability Accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator), with more than one EKS cluster and verify the collected metrics from all the clusters in the dashboards of a common `Amazon Managed Grafana` workspace.
3+
This example shows how to use the [AWS Observability Accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator),
4+
with more than one EKS cluster in a single account and visualize the collected
5+
metrics from all the clusters in the dashboards of a common
6+
`Amazon Managed Grafana` workspace.
47

58
## Prerequisites
69

docs/eks/troubleshooting.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Troubleshooting guide for Amazon EKS monitoring module
2+
3+
Depending on your setup, you might face a few errors. If you encounter an error
4+
not listed here, please open an issue in the [issues section](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues)
5+
6+
These guide applies to the [eks-monitoring Terraform module](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/modules/eks-monitoring)
7+
8+
9+
## Cluster authentication issue
10+
11+
### Error message
12+
13+
```console
14+
15+
│ Error: cluster-secretstore-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host
16+
17+
│ with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.cluster_secretstore,
18+
│ on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 59, in resource "kubectl_manifest" "cluster_secretstore":
19+
│ 59: resource "kubectl_manifest" "cluster_secretstore" {
20+
21+
22+
23+
│ Error: grafana-operator/external-secrets-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host
24+
25+
│ with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.secret,
26+
│ on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 89, in resource "kubectl_manifest" "secret":
27+
│ 89: resource "kubectl_manifest" "secret" {
28+
```
29+
30+
### Resolution
31+
32+
33+
To provision the `eks-monitoring` module, the environment where you are running
34+
Terraform apply needs to be authenticated against your cluster and be your
35+
current context. To verify, you can run a single `kubectl get nodes` command
36+
to ensure you are using the correct Amazon EKS cluster.
37+
38+
To login agains the correct cluster, run:
39+
40+
```console
41+
aws eks update-kubeconfig --name <cluster name> --region <aws region>
42+
```
43+
44+
## Missing Grafana dashboards
45+
46+
Terraform apply can run without apparent errors and your Grafana workspace
47+
won't present any dashboards. Many situations could lead to this as described
48+
below. The best place to start would be checking the logs of `grafana-operator`,
49+
`external-secrets` and `flux-system` pods.
50+
51+
52+
### Wrong Grafana workspace
53+
54+
It might happen that you provide the wrong Grafana workspace. One way to verify
55+
this is to run the following command:
56+
57+
```bash
58+
kubectl describe grafanas external-grafana -n grafana-operator
59+
```
60+
61+
You should see an output similar to this (truncated for brevity). Validate that
62+
you have the correct URL. If that's the case, re-running Terraform with the
63+
correct workspace ID, API key should fix this issue.
64+
65+
```console
66+
...
67+
Spec:
68+
External:
69+
API Key:
70+
Key: GF_SECURITY_ADMIN_APIKEY
71+
Name: grafana-admin-credentials
72+
URL: https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com
73+
Status:
74+
Admin URL: https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com
75+
Dashboards:
76+
grafana-operator/apiserver-troubleshooting-grafanadashboard/V3y_Zcb7k
77+
grafana-operator/apiserver-basic-grafanadashboard/R6abPf9Zz
78+
grafana-operator/java-grafanadashboard/m9mHfAy7ks
79+
grafana-operator/grafana-dashboards-adothealth/reshmanat
80+
grafana-operator/apiserver-advanced-grafanadashboard/09ec8aa1e996d6ffcd6817bbaff4db1b
81+
grafana-operator/nginx-grafanadashboard/nginx
82+
grafana-operator/kubelet-grafanadashboard/3138fa155d5915769fbded898ac09fd9
83+
grafana-operator/cluster-grafanadashboard/efa86fd1d0c121a26444b636a3f509a8
84+
grafana-operator/workloads-grafanadashboard/a164a7f0339f99e89cea5cb47e9be617
85+
grafana-operator/grafana-dashboards-kubeproxy/632e265de029684c40b21cb76bca4f94
86+
grafana-operator/nodes-grafanadashboard/200ac8fdbfbb74b39aff88118e4d1c2c
87+
grafana-operator/node-exporter-grafanadashboard/v8yDYJqnz
88+
grafana-operator/namespace-workloads-grafanadashboard/a87fb0d919ec0ea5f6543124e16c42a5
89+
```
90+
91+
92+
### Grafana API key expired
93+
94+
Check on the logs on your grafana operator pod using the below command :
95+
96+
```bash
97+
kubectl get pods -n grafana-operator
98+
```
99+
100+
Output:
101+
102+
```console
103+
NAME READY STATUS RESTARTS AGE
104+
grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m
105+
```
106+
107+
```bash
108+
kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator
109+
```
110+
111+
Output:
112+
113+
```console
114+
1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
115+
github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile
116+
```
117+
118+
If you observe, the the above `grafana-api-key error` in the logs,
119+
your grafana API key is expired.
120+
121+
Please use the operational procedure to update your `grafana-api-key` :
122+
123+
- Create a new Grafana API key, you can use [this step](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/#6-grafana-api-key)
124+
and make sure the API key duration is not too short.
125+
126+
- Run Terraform with the new API key. Terraform will modify the AWS SSM
127+
Parameter used by `externalsecret`.
128+
129+
- If the issue persists, you can force the synchronization by deleting the
130+
`externalsecret` Kubernetes object.
131+
132+
```bash
133+
kubectl delete externalsecret/external-secrets-sm -n grafana-operator
134+
```
135+
136+
### Git repository errors
137+
138+
[Flux](https://fluxcd.io/flux/components/source/gitrepositories/) is responsible
139+
to regularly pull and synchronize [dashboards and artifacts](https://github.com/aws-observability/aws-observability-accelerator)
140+
into your EKS cluster. It might happen that its state gets corrupted.
141+
142+
You can verify those errors by using this command. You should see an error if
143+
Flux is not able to pull correctly:
144+
145+
```bash
146+
kubectl get gitrepositories -n flux-system
147+
NAME URL AGE READY STATUS
148+
aws-observability-accelerator https://github.com/aws-observability/aws-observability-accelerator 6d12h True stored artifact for revision 'v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d'
149+
```
150+
151+
Depending on the error, you can delete the repository and re-run Terraform and
152+
force the synchronization.
153+
154+
```bash
155+
k delete gitrepositories aws-observability-accelerator -n flux-system
156+
```
157+
158+
If you believe this is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues).
159+
160+
161+
### Flux Kustomizations
162+
163+
After Flux pulls the repository in the cluster state, it will apply [Kustomizations](https://fluxcd.io/flux/components/kustomize/kustomizations/)
164+
to create Grafana data sources, folders and dashboards.
165+
166+
- Check the kustomization objects. Here you should see the dashboards you have
167+
enabled
168+
169+
```bash
170+
k get kustomizations.kustomize.toolkit.fluxcd.io -A
171+
NAMESPACE NAME AGE READY STATUS
172+
flux-system grafana-dashboards-adothealth 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
173+
flux-system grafana-dashboards-apiserver 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
174+
flux-system grafana-dashboards-infrastructure 10d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
175+
flux-system grafana-dashboards-java 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
176+
flux-system grafana-dashboards-kubeproxy 10d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
177+
flux-system grafana-dashboards-nginx 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
178+
```
179+
180+
- To have more infos on an error, you can view the Kustomization controller logs
181+
182+
```bash
183+
kubectl get pods -n flux-system
184+
NAME READY STATUS RESTARTS AGE
185+
helm-controller-65cc46469f-nsqd5 1/1 Running 2 (13d ago) 27d
186+
image-automation-controller-d8f7bfcb4-k2m9j 1/1 Running 2 (13d ago) 27d
187+
image-reflector-controller-68979dfd49-wh25h 1/1 Running 2 (13d ago) 27d
188+
kustomize-controller-767677f7f5-c5xsp 1/1 Running 5 (13d ago) 63d
189+
notification-controller-55d8c759f5-7df5l 1/1 Running 5 (13d ago) 63d
190+
source-controller-58c66d55cd-4j6bl 1/1 Running 5 (13d ago) 63d
191+
```
192+
193+
```bash
194+
kubectl logs -f -n flux-system kustomize-controller-767677f7f5-c5xsp
195+
```
196+
197+
If you believe there is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues).
198+
199+
- Depending on the error, delete the kustomization object and re-apply Terraform
200+
201+
```bash
202+
kubectl delete kustomizations -n flux-system grafana-dashboards-apiserver
203+
```
204+
205+
### Grafana dashboards errors
206+
207+
If all of the above seem normal, finally inspect deployed dashboards by
208+
running this command:
209+
210+
```bash
211+
kubectl get grafanadashboards -A
212+
NAMESPACE NAME AGE
213+
grafana-operator apiserver-advanced-grafanadashboard 18d
214+
grafana-operator apiserver-basic-grafanadashboard 18d
215+
grafana-operator apiserver-troubleshooting-grafanadashboard 18d
216+
grafana-operator cluster-grafanadashboard 10d
217+
grafana-operator grafana-dashboards-adothealth 18d
218+
grafana-operator grafana-dashboards-kubeproxy 10d
219+
grafana-operator java-grafanadashboard 18d
220+
grafana-operator kubelet-grafanadashboard 10d
221+
grafana-operator namespace-workloads-grafanadashboard 10d
222+
grafana-operator nginx-grafanadashboard 18d
223+
grafana-operator node-exporter-grafanadashboard 10d
224+
grafana-operator nodes-grafanadashboard 10d
225+
grafana-operator workloads-grafanadashboard 10d
226+
```
227+
228+
- You can dive into the details of a dashboard by running:
229+
230+
```bash
231+
kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator
232+
```
233+
234+
- Depending on the error, you can delete the dashboard object. In this case,
235+
you don't need to re-run Terraform as the Flux Kustomization will force its
236+
recreation through the Grafana operator
237+
238+
```bash
239+
kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator
240+
```
241+
242+
If you believe there is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues).
243+
244+
## Upgrade from to v2.5 or earlier
245+
246+
v2.5.0 removes the dependency to the Terraform Grafana provider in the EKS
247+
monitoring module. As Grafana Operator manages and syncs the Grafana contents,
248+
Terraform is not required anymore in this context.
249+
250+
However, if you migrate from earlier versions, you might leave some data
251+
orphan as the Grafana provider is dropped.
252+
Terraform will throw an error. We have released v2.5.0-rc.1 which removes all
253+
the Grafana resources provisioned by Terraform in the EKS context,
254+
without removing the provider configurations.
255+
256+
- Step 1: migrate to v2.5.0-rc.1 and run apply
257+
- Step 2: migrate to v2.5.0 or above

mkdocs.yml

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,18 @@ nav:
2626
- Home: index.md
2727
- Concepts: concepts.md
2828
- Amazon EKS:
29-
- Infrastructure monitoring: eks/index.md
30-
- EKS API server monitoring: eks/eks-apiserver.md
31-
- Multicluster monitoring: eks/multicluster.md
32-
- Cross Account monitoring: eks/multiaccount.md
33-
- Java/JMX: eks/java.md
34-
- Nginx: eks/nginx.md
35-
- Istio: eks/istio.md
29+
- Infrastructure: eks/index.md
30+
- EKS API server: eks/eks-apiserver.md
31+
- Multicluster:
32+
- Single AWS account: eks/multicluster.md
33+
- Cross AWS account: eks/multiaccount.md
3634
- Viewing logs: eks/logs.md
3735
- Tracing: eks/tracing.md
36+
- Patterns:
37+
- Java/JMX: eks/java.md
38+
- Nginx: eks/nginx.md
39+
- Istio: eks/istio.md
40+
- Troubleshooting: eks/troubleshooting.md
3841
- Teardown: eks/destroy.md
3942
- AWS Distro for OpenTelemetry (ADOT):
4043
- Monitoring ADOT collector health: adothealth/index.md

0 commit comments

Comments
 (0)