|
| 1 | +# Troubleshooting guide for Amazon EKS monitoring module |
| 2 | + |
| 3 | +Depending on your setup, you might face a few errors. If you encounter an error |
| 4 | +not listed here, please open an issue in the [issues section](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues) |
| 5 | + |
| 6 | +These guide applies to the [eks-monitoring Terraform module](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/modules/eks-monitoring) |
| 7 | + |
| 8 | + |
| 9 | +## Cluster authentication issue |
| 10 | + |
| 11 | +### Error message |
| 12 | + |
| 13 | +```console |
| 14 | +╷ |
| 15 | +│ Error: cluster-secretstore-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host |
| 16 | +│ |
| 17 | +│ with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.cluster_secretstore, |
| 18 | +│ on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 59, in resource "kubectl_manifest" "cluster_secretstore": |
| 19 | +│ 59: resource "kubectl_manifest" "cluster_secretstore" { |
| 20 | +│ |
| 21 | +╵ |
| 22 | +╷ |
| 23 | +│ Error: grafana-operator/external-secrets-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host |
| 24 | +│ |
| 25 | +│ with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.secret, |
| 26 | +│ on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 89, in resource "kubectl_manifest" "secret": |
| 27 | +│ 89: resource "kubectl_manifest" "secret" { |
| 28 | +``` |
| 29 | + |
| 30 | +### Resolution |
| 31 | + |
| 32 | + |
| 33 | +To provision the `eks-monitoring` module, the environment where you are running |
| 34 | +Terraform apply needs to be authenticated against your cluster and be your |
| 35 | +current context. To verify, you can run a single `kubectl get nodes` command |
| 36 | +to ensure you are using the correct Amazon EKS cluster. |
| 37 | + |
| 38 | +To login agains the correct cluster, run: |
| 39 | + |
| 40 | +```console |
| 41 | +aws eks update-kubeconfig --name <cluster name> --region <aws region> |
| 42 | +``` |
| 43 | + |
| 44 | +## Missing Grafana dashboards |
| 45 | + |
| 46 | +Terraform apply can run without apparent errors and your Grafana workspace |
| 47 | +won't present any dashboards. Many situations could lead to this as described |
| 48 | +below. The best place to start would be checking the logs of `grafana-operator`, |
| 49 | +`external-secrets` and `flux-system` pods. |
| 50 | + |
| 51 | + |
| 52 | +### Wrong Grafana workspace |
| 53 | + |
| 54 | +It might happen that you provide the wrong Grafana workspace. One way to verify |
| 55 | +this is to run the following command: |
| 56 | + |
| 57 | +```bash |
| 58 | +kubectl describe grafanas external-grafana -n grafana-operator |
| 59 | +``` |
| 60 | + |
| 61 | +You should see an output similar to this (truncated for brevity). Validate that |
| 62 | +you have the correct URL. If that's the case, re-running Terraform with the |
| 63 | +correct workspace ID, API key should fix this issue. |
| 64 | + |
| 65 | +```console |
| 66 | +... |
| 67 | +Spec: |
| 68 | + External: |
| 69 | + API Key: |
| 70 | + Key: GF_SECURITY_ADMIN_APIKEY |
| 71 | + Name: grafana-admin-credentials |
| 72 | + URL: https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com |
| 73 | +Status: |
| 74 | + Admin URL: https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com |
| 75 | + Dashboards: |
| 76 | + grafana-operator/apiserver-troubleshooting-grafanadashboard/V3y_Zcb7k |
| 77 | + grafana-operator/apiserver-basic-grafanadashboard/R6abPf9Zz |
| 78 | + grafana-operator/java-grafanadashboard/m9mHfAy7ks |
| 79 | + grafana-operator/grafana-dashboards-adothealth/reshmanat |
| 80 | + grafana-operator/apiserver-advanced-grafanadashboard/09ec8aa1e996d6ffcd6817bbaff4db1b |
| 81 | + grafana-operator/nginx-grafanadashboard/nginx |
| 82 | + grafana-operator/kubelet-grafanadashboard/3138fa155d5915769fbded898ac09fd9 |
| 83 | + grafana-operator/cluster-grafanadashboard/efa86fd1d0c121a26444b636a3f509a8 |
| 84 | + grafana-operator/workloads-grafanadashboard/a164a7f0339f99e89cea5cb47e9be617 |
| 85 | + grafana-operator/grafana-dashboards-kubeproxy/632e265de029684c40b21cb76bca4f94 |
| 86 | + grafana-operator/nodes-grafanadashboard/200ac8fdbfbb74b39aff88118e4d1c2c |
| 87 | + grafana-operator/node-exporter-grafanadashboard/v8yDYJqnz |
| 88 | + grafana-operator/namespace-workloads-grafanadashboard/a87fb0d919ec0ea5f6543124e16c42a5 |
| 89 | +``` |
| 90 | + |
| 91 | + |
| 92 | +### Grafana API key expired |
| 93 | + |
| 94 | +Check on the logs on your grafana operator pod using the below command : |
| 95 | + |
| 96 | +```bash |
| 97 | +kubectl get pods -n grafana-operator |
| 98 | +``` |
| 99 | + |
| 100 | +Output: |
| 101 | + |
| 102 | +```console |
| 103 | +NAME READY STATUS RESTARTS AGE |
| 104 | +grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m |
| 105 | +``` |
| 106 | + |
| 107 | +```bash |
| 108 | +kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator |
| 109 | +``` |
| 110 | + |
| 111 | +Output: |
| 112 | + |
| 113 | +```console |
| 114 | +1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"} |
| 115 | +github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile |
| 116 | +``` |
| 117 | + |
| 118 | +If you observe, the the above `grafana-api-key error` in the logs, |
| 119 | +your grafana API key is expired. |
| 120 | + |
| 121 | +Please use the operational procedure to update your `grafana-api-key` : |
| 122 | + |
| 123 | +- Create a new Grafana API key, you can use [this step](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/#6-grafana-api-key) |
| 124 | +and make sure the API key duration is not too short. |
| 125 | + |
| 126 | +- Run Terraform with the new API key. Terraform will modify the AWS SSM |
| 127 | +Parameter used by `externalsecret`. |
| 128 | + |
| 129 | +- If the issue persists, you can force the synchronization by deleting the |
| 130 | +`externalsecret` Kubernetes object. |
| 131 | + |
| 132 | +```bash |
| 133 | +kubectl delete externalsecret/external-secrets-sm -n grafana-operator |
| 134 | +``` |
| 135 | + |
| 136 | +### Git repository errors |
| 137 | + |
| 138 | +[Flux](https://fluxcd.io/flux/components/source/gitrepositories/) is responsible |
| 139 | +to regularly pull and synchronize [dashboards and artifacts](https://github.com/aws-observability/aws-observability-accelerator) |
| 140 | +into your EKS cluster. It might happen that its state gets corrupted. |
| 141 | + |
| 142 | +You can verify those errors by using this command. You should see an error if |
| 143 | +Flux is not able to pull correctly: |
| 144 | + |
| 145 | +```bash |
| 146 | +kubectl get gitrepositories -n flux-system |
| 147 | +NAME URL AGE READY STATUS |
| 148 | +aws-observability-accelerator https://github.com/aws-observability/aws-observability-accelerator 6d12h True stored artifact for revision 'v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d' |
| 149 | +``` |
| 150 | + |
| 151 | +Depending on the error, you can delete the repository and re-run Terraform and |
| 152 | +force the synchronization. |
| 153 | + |
| 154 | +```bash |
| 155 | +k delete gitrepositories aws-observability-accelerator -n flux-system |
| 156 | +``` |
| 157 | + |
| 158 | +If you believe this is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues). |
| 159 | + |
| 160 | + |
| 161 | +### Flux Kustomizations |
| 162 | + |
| 163 | +After Flux pulls the repository in the cluster state, it will apply [Kustomizations](https://fluxcd.io/flux/components/kustomize/kustomizations/) |
| 164 | +to create Grafana data sources, folders and dashboards. |
| 165 | + |
| 166 | +- Check the kustomization objects. Here you should see the dashboards you have |
| 167 | +enabled |
| 168 | + |
| 169 | +```bash |
| 170 | +k get kustomizations.kustomize.toolkit.fluxcd.io -A |
| 171 | +NAMESPACE NAME AGE READY STATUS |
| 172 | +flux-system grafana-dashboards-adothealth 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d |
| 173 | +flux-system grafana-dashboards-apiserver 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d |
| 174 | +flux-system grafana-dashboards-infrastructure 10d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d |
| 175 | +flux-system grafana-dashboards-java 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d |
| 176 | +flux-system grafana-dashboards-kubeproxy 10d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d |
| 177 | +flux-system grafana-dashboards-nginx 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d |
| 178 | +``` |
| 179 | + |
| 180 | +- To have more infos on an error, you can view the Kustomization controller logs |
| 181 | + |
| 182 | +```bash |
| 183 | +kubectl get pods -n flux-system |
| 184 | +NAME READY STATUS RESTARTS AGE |
| 185 | +helm-controller-65cc46469f-nsqd5 1/1 Running 2 (13d ago) 27d |
| 186 | +image-automation-controller-d8f7bfcb4-k2m9j 1/1 Running 2 (13d ago) 27d |
| 187 | +image-reflector-controller-68979dfd49-wh25h 1/1 Running 2 (13d ago) 27d |
| 188 | +kustomize-controller-767677f7f5-c5xsp 1/1 Running 5 (13d ago) 63d |
| 189 | +notification-controller-55d8c759f5-7df5l 1/1 Running 5 (13d ago) 63d |
| 190 | +source-controller-58c66d55cd-4j6bl 1/1 Running 5 (13d ago) 63d |
| 191 | +``` |
| 192 | + |
| 193 | +```bash |
| 194 | +kubectl logs -f -n flux-system kustomize-controller-767677f7f5-c5xsp |
| 195 | +``` |
| 196 | + |
| 197 | +If you believe there is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues). |
| 198 | + |
| 199 | +- Depending on the error, delete the kustomization object and re-apply Terraform |
| 200 | + |
| 201 | +```bash |
| 202 | +kubectl delete kustomizations -n flux-system grafana-dashboards-apiserver |
| 203 | +``` |
| 204 | + |
| 205 | +### Grafana dashboards errors |
| 206 | + |
| 207 | +If all of the above seem normal, finally inspect deployed dashboards by |
| 208 | +running this command: |
| 209 | + |
| 210 | +```bash |
| 211 | +kubectl get grafanadashboards -A |
| 212 | +NAMESPACE NAME AGE |
| 213 | +grafana-operator apiserver-advanced-grafanadashboard 18d |
| 214 | +grafana-operator apiserver-basic-grafanadashboard 18d |
| 215 | +grafana-operator apiserver-troubleshooting-grafanadashboard 18d |
| 216 | +grafana-operator cluster-grafanadashboard 10d |
| 217 | +grafana-operator grafana-dashboards-adothealth 18d |
| 218 | +grafana-operator grafana-dashboards-kubeproxy 10d |
| 219 | +grafana-operator java-grafanadashboard 18d |
| 220 | +grafana-operator kubelet-grafanadashboard 10d |
| 221 | +grafana-operator namespace-workloads-grafanadashboard 10d |
| 222 | +grafana-operator nginx-grafanadashboard 18d |
| 223 | +grafana-operator node-exporter-grafanadashboard 10d |
| 224 | +grafana-operator nodes-grafanadashboard 10d |
| 225 | +grafana-operator workloads-grafanadashboard 10d |
| 226 | +``` |
| 227 | + |
| 228 | +- You can dive into the details of a dashboard by running: |
| 229 | + |
| 230 | +```bash |
| 231 | +kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator |
| 232 | +``` |
| 233 | + |
| 234 | +- Depending on the error, you can delete the dashboard object. In this case, |
| 235 | +you don't need to re-run Terraform as the Flux Kustomization will force its |
| 236 | +recreation through the Grafana operator |
| 237 | + |
| 238 | +```bash |
| 239 | +kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator |
| 240 | +``` |
| 241 | + |
| 242 | +If you believe there is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues). |
| 243 | + |
| 244 | +## Upgrade from to v2.5 or earlier |
| 245 | + |
| 246 | +v2.5.0 removes the dependency to the Terraform Grafana provider in the EKS |
| 247 | +monitoring module. As Grafana Operator manages and syncs the Grafana contents, |
| 248 | +Terraform is not required anymore in this context. |
| 249 | + |
| 250 | +However, if you migrate from earlier versions, you might leave some data |
| 251 | +orphan as the Grafana provider is dropped. |
| 252 | +Terraform will throw an error. We have released v2.5.0-rc.1 which removes all |
| 253 | +the Grafana resources provisioned by Terraform in the EKS context, |
| 254 | +without removing the provider configurations. |
| 255 | + |
| 256 | +- Step 1: migrate to v2.5.0-rc.1 and run apply |
| 257 | +- Step 2: migrate to v2.5.0 or above |
0 commit comments