|
| 1 | +# Singe New EKS Cluster Open Source Observability Accelerator |
| 2 | + |
| 3 | +## Architecture |
| 4 | + |
| 5 | +The following figure illustrates the architecture of the pattern we will be deploying for Single EKS Cluster Open Source Observability on Graviton pattern using open source tooling such as AWS Distro for Open Telemetry (ADOT), Amazon Managed Service for Prometheus (AMP), Amazon Managed Grafana : |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +Monitoring Amazon Elastic Kubernetes Service (Amazon EKS) for metrics has two categories: |
| 10 | +the control plane and the Amazon EKS nodes (with Kubernetes objects). |
| 11 | +The Amazon EKS control plane consists of control plane nodes that run the Kubernetes software, |
| 12 | +such as etcd and the Kubernetes API server. To read more on the components of an Amazon EKS cluster, |
| 13 | +please read the [service documentation](https://docs.aws.amazon.com/eks/latest/userguide/clusters.html). |
| 14 | + |
| 15 | +## Objective |
| 16 | + |
| 17 | +- Deploys one production grade Amazon EKS cluster running on a Graviton3 Processor |
| 18 | +- AWS Distro For OpenTelemetry Operator and Collector for Metrics and Traces |
| 19 | +- Logs with [AWS for FluentBit](https://github.com/aws/aws-for-fluent-bit) |
| 20 | +- Installs Grafana Operator to add AWS data sources and create Grafana Dashboards to Amazon Managed Grafana. |
| 21 | +- Installs FluxCD to perform GitOps sync of a Git Repo to EKS Cluster. We will use this later for creating Grafana Dashboards and AWS datasources to Amazon Managed Grafana. You can also use your own GitRepo to sync your own Grafana resources such as Dashboards, Datasources etc. Please check our One observability module - [GitOps with Amazon Managed Grafana](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/gitops-with-amg) to learn more about this. |
| 22 | +- Installs External Secrets Operator to retrieve and Sync the Grafana API keys. |
| 23 | +- Amazon Managed Grafana Dashboard and data source |
| 24 | +- Alerts and recording rules with AWS Managed Service for Prometheus |
| 25 | + |
| 26 | +## Prerequisites: |
| 27 | + |
| 28 | +Ensure that you have installed the following tools on your machine. |
| 29 | + |
| 30 | +1. [aws cli](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) |
| 31 | +2. [kubectl](https://Kubernetes.io/docs/tasks/tools/) |
| 32 | +3. [cdk](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install) |
| 33 | +4. [npm](https://docs.npmjs.com/cli/v8/commands/npm-install) |
| 34 | + |
| 35 | +## Deploying |
| 36 | + |
| 37 | +1. Clone your forked repository |
| 38 | + |
| 39 | +```sh |
| 40 | +git clone https://github.com/aws-observability/cdk-aws-observability-accelerator.git |
| 41 | +``` |
| 42 | + |
| 43 | +2. Install the AWS CDK Toolkit globally on your machine using |
| 44 | + |
| 45 | +```bash |
| 46 | +npm install -g aws-cdk |
| 47 | +``` |
| 48 | + |
| 49 | +3. Amazon Managed Grafana workspace: To visualize metrics collected, you need an Amazon Managed Grafana workspace. If you have an existing workspace, create an environment variable as described below. To create a new workspace, visit [our supporting example for Grafana](https://aws-observability.github.io/terraform-aws-observability-accelerator/helpers/managed-grafana/) |
| 50 | + |
| 51 | +!!! note |
| 52 | + For the URL `https://g-xyz.grafana-workspace.us-east-1.amazonaws.com`, the workspace ID would be `g-xyz` |
| 53 | + |
| 54 | +```bash |
| 55 | +export AWS_REGION=<YOUR AWS REGION> |
| 56 | +export COA_AMG_WORKSPACE_ID=g-xxx |
| 57 | +export COA_AMG_ENDPOINT_URL=https://g-xyz.grafana-workspace.us-east-1.amazonaws.com |
| 58 | +``` |
| 59 | + |
| 60 | +!!! warning |
| 61 | + Setting up environment variables `COA_AMG_ENDPOINT_URL` and `AWS_REGION` is mandatory for successful execution of this pattern. |
| 62 | + |
| 63 | +4. GRAFANA API KEY: Amazon Managed Grafana provides a control plane API for generating Grafana API keys. |
| 64 | + |
| 65 | +```bash |
| 66 | +export AMG_API_KEY=$(aws grafana create-workspace-api-key \ |
| 67 | + --key-name "grafana-operator-key" \ |
| 68 | + --key-role "ADMIN" \ |
| 69 | + --seconds-to-live 432000 \ |
| 70 | + --workspace-id $COA_AMG_WORKSPACE_ID \ |
| 71 | + --query key \ |
| 72 | + --output text) |
| 73 | +``` |
| 74 | + |
| 75 | +5. AWS Secrets Manager for GRAFANA API KEY: Update the Grafana API key secret in AWS Secrets using the above new Grafana API key. This will be referenced by Grafana Operator deployment of our solution to access Amazon Managed Grafana from Amazon EKS Cluster |
| 76 | + |
| 77 | +```bash |
| 78 | +aws secretsmanager create-secret \ |
| 79 | + --name grafana-api-key \ |
| 80 | + --description "API Key of your Grafana Instance" \ |
| 81 | + --secret-string "${AMG_API_KEY}" \ |
| 82 | + --region $AWS_REGION \ |
| 83 | + --query ARN \ |
| 84 | + --output text |
| 85 | +``` |
| 86 | + |
| 87 | +6. Install project dependencies by running `npm install` in the main folder of this cloned repository. |
| 88 | + |
| 89 | +7. The actual settings for dashboard urls are expected to be specified in the CDK context. Generically it is inside the cdk.json file of the current directory or in `~/.cdk.json` in your home directory. |
| 90 | + |
| 91 | +Example settings: Update the context in `cdk.json` file located in `cdk-eks-blueprints-patterns` directory |
| 92 | + |
| 93 | +``` |
| 94 | + "context": { |
| 95 | + "cluster.dashboard.url": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json", |
| 96 | + "kubelet.dashboard.url": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json", |
| 97 | + "namespaceworkloads.dashboard.url": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json", |
| 98 | + "nodeexporter.dashboard.url": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json", |
| 99 | + "nodes.dashboard.url": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json", |
| 100 | + "workloads.dashboard.url": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json" |
| 101 | + } |
| 102 | +``` |
| 103 | + |
| 104 | +8. Once all pre-requisites are set you are ready to deploy the pipeline. Run the following command from the root of this repository to deploy the pipeline stack: |
| 105 | + |
| 106 | +```bash |
| 107 | +make build |
| 108 | +make pattern single-new-eks-opensource-observability-graviton deploy |
| 109 | +``` |
| 110 | + |
| 111 | +## Verify the resources |
| 112 | + |
| 113 | +Run update-kubeconfig command. You should be able to get the command from CDK output message. |
| 114 | + |
| 115 | +```bash |
| 116 | +aws eks update-kubeconfig --name single-new-eks-opensource-graviton-observability-accelerator --region <your region> --role-arn arn:aws:iam::xxxxxxxxx:role/single-new-eks-opensource-singleneweksopensourceob-82N8N3BMJYYI |
| 117 | +``` |
| 118 | + |
| 119 | +Let’s verify the resources created by steps above. |
| 120 | + |
| 121 | +```bash |
| 122 | +kubectl get nodes -o wide |
| 123 | +``` |
| 124 | +Output: |
| 125 | + |
| 126 | +```console |
| 127 | +NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME |
| 128 | +ip-10-0-104-200.us-west-2.compute.internal Ready <none> 2d1h v1.27.1-eks-2f008fe 10.0.104.200 <none> Amazon Linux 2 5.10.179-168.710.amzn2.aarch64 containerd://1.6.19 |
| 129 | +``` |
| 130 | + |
| 131 | +Next, lets verify the namespaces in the cluster: |
| 132 | + |
| 133 | +```bash |
| 134 | +kubectl get ns # Output shows all namespace |
| 135 | +``` |
| 136 | + |
| 137 | +Output: |
| 138 | + |
| 139 | +```console |
| 140 | +NAME STATUS AGE |
| 141 | +cert-manager Active 2d1h |
| 142 | +default Active 2d1h |
| 143 | +external-secrets Active 2d1h |
| 144 | +flux-system Active 2d1h |
| 145 | +grafana-operator Active 2d1h |
| 146 | +kube-node-lease Active 2d1h |
| 147 | +kube-public Active 2d1h |
| 148 | +kube-system Active 2d1h |
| 149 | +opentelemetry-operator-system Active 2d1h |
| 150 | +prometheus-node-exporter Active 2d1h |
| 151 | +``` |
| 152 | + |
| 153 | +Next, lets verify all resources of `grafana-operator` namespace: |
| 154 | + |
| 155 | +```bash |
| 156 | +kubectl get all --namespace=grafana-operator |
| 157 | +``` |
| 158 | + |
| 159 | +Output: |
| 160 | + |
| 161 | +```console |
| 162 | +NAME READY STATUS RESTARTS AGE |
| 163 | +pod/grafana-operator-866d4446bb-g5srl 1/1 Running 0 2d1h |
| 164 | + |
| 165 | +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE |
| 166 | +service/grafana-operator-metrics-service ClusterIP 172.20.223.125 <none> 9090/TCP 2d1h |
| 167 | + |
| 168 | +NAME READY UP-TO-DATE AVAILABLE AGE |
| 169 | +deployment.apps/grafana-operator 1/1 1 1 2d1h |
| 170 | + |
| 171 | +NAME DESIRED CURRENT READY AGE |
| 172 | +replicaset.apps/grafana-operator-866d4446bb 1 1 1 2d1h |
| 173 | +``` |
| 174 | + |
| 175 | +## Visualization |
| 176 | + |
| 177 | +#### 1. Grafana dashboards |
| 178 | + |
| 179 | +Login to your Grafana workspace and navigate to the Dashboards panel. You should see a list of dashboards under the `Observability Accelerator Dashboards` |
| 180 | + |
| 181 | + |
| 182 | + |
| 183 | +Open the `Node Exporter` dashboard and you should be able to view its visualization as shown below : |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | + |
| 188 | +Open the `Kubelet` dashboard and you should be able to view its visualization as shown below : |
| 189 | + |
| 190 | + |
| 191 | + |
| 192 | +From the cluster to view all dashboards as Kubernetes objects, run: |
| 193 | + |
| 194 | +```bash |
| 195 | +kubectl get grafanadashboards -A |
| 196 | +``` |
| 197 | + |
| 198 | +```console |
| 199 | +NAMESPACE NAME AGE |
| 200 | +grafana-operator cluster-grafanadashboard 138m |
| 201 | +grafana-operator java-grafanadashboard 143m |
| 202 | +grafana-operator kubelet-grafanadashboard 13h |
| 203 | +grafana-operator namespace-workloads-grafanadashboard 13h |
| 204 | +grafana-operator nginx-grafanadashboard 134m |
| 205 | +grafana-operator node-exporter-grafanadashboard 13h |
| 206 | +grafana-operator nodes-grafanadashboard 13h |
| 207 | +grafana-operator workloads-grafanadashboard 13h |
| 208 | +``` |
| 209 | + |
| 210 | +You can inspect more details per dashboard using this command |
| 211 | + |
| 212 | +```bash |
| 213 | +kubectl describe grafanadashboards cluster-grafanadashboard -n grafana-operator |
| 214 | +``` |
| 215 | + |
| 216 | +Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically. |
| 217 | + |
| 218 | +## Viewing Logs |
| 219 | + |
| 220 | +By default, we deploy a FluentBit daemon set in the cluster to collect worker logs for all namespaces. Logs are collected and exported to Amazon CloudWatch Logs, which enables you to centralize the logs from all of your systems, applications, |
| 221 | +and AWS services that you use, in a single, highly scalable service. |
| 222 | + |
| 223 | +## Using CloudWatch Logs as data source in Grafana |
| 224 | + |
| 225 | +Follow [the documentation](https://docs.aws.amazon.com/grafana/latest/userguide/using-amazon-cloudwatch-in-AMG.html) |
| 226 | +to enable Amazon CloudWatch as a data source. Make sure to provide permissions. |
| 227 | + |
| 228 | +All logs are delivered in the following CloudWatch Log groups naming pattern: |
| 229 | +`/aws/eks/single-new-eks-opensource-observability-accelerator`. |
| 230 | +Log streams follow `{container-name}.{pod-name}`. In Grafana, querying and analyzing logs is done with [CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) |
| 231 | + |
| 232 | +### Example - ADOT collector logs |
| 233 | + |
| 234 | +Select one or many log groups and run the following query. The example below, |
| 235 | +queries AWS Distro for OpenTelemetry (ADOT) logs |
| 236 | + |
| 237 | +```console |
| 238 | +fields @timestamp, log |
| 239 | +| order @timestamp desc |
| 240 | +| limit 100 |
| 241 | +``` |
| 242 | + |
| 243 | + |
| 244 | + |
| 245 | +### Example - Using time series visualizations |
| 246 | + |
| 247 | +[CloudWatch Logs syntax](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html) |
| 248 | +provide powerful functions to extract data from your logs. The `stats()` |
| 249 | +function allows you to calculate aggregate statistics with log field values. |
| 250 | +This is useful to have visualization on non-metric data from your applications. |
| 251 | + |
| 252 | +In the example below, we use the following query to graph the number of metrics |
| 253 | +collected by the ADOT collector |
| 254 | + |
| 255 | +```console |
| 256 | +fields @timestamp, log |
| 257 | +| parse log /"#metrics": (?<metrics_count>\d+)}/ |
| 258 | +| stats avg(metrics_count) by bin(5m) |
| 259 | +| limit 100 |
| 260 | +``` |
| 261 | + |
| 262 | +!!! tip |
| 263 | + You can add logs in your dashboards with logs panel types or time series |
| 264 | + depending on your query results type. |
| 265 | + |
| 266 | + |
| 267 | + |
| 268 | +!!! warning |
| 269 | + Querying CloudWatch logs will incur costs per GB scanned. Use small time |
| 270 | + windows and limits in your queries. Checkout the CloudWatch |
| 271 | + [pricing page](https://aws.amazon.com/cloudwatch/pricing/) for more infos. |
| 272 | + |
| 273 | +## Troubleshooting |
| 274 | + |
| 275 | +### 1. Grafana dashboards missing or Grafana API key expired |
| 276 | + |
| 277 | +In case you don't see the grafana dashboards in your Amazon Managed Grafana console, check on the logs on your grafana operator pod using the below command : |
| 278 | + |
| 279 | +```bash |
| 280 | +kubectl get pods -n grafana-operator |
| 281 | +``` |
| 282 | + |
| 283 | +Output: |
| 284 | + |
| 285 | +```console |
| 286 | +NAME READY STATUS RESTARTS AGE |
| 287 | +grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m |
| 288 | +``` |
| 289 | + |
| 290 | +```bash |
| 291 | +kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator |
| 292 | +``` |
| 293 | + |
| 294 | +Output: |
| 295 | + |
| 296 | +```console |
| 297 | +1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"} |
| 298 | +github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile |
| 299 | +``` |
| 300 | + |
| 301 | +If you observe, the the above `grafana-api-key error` in the logs, your grafana API key is expired. Please use the operational procedure to update your `grafana-api-key` : |
| 302 | + |
| 303 | +- First, lets create a new Grafana API key. |
| 304 | + |
| 305 | +```bash |
| 306 | +export GO_AMG_API_KEY=$(aws grafana create-workspace-api-key \ |
| 307 | + --key-name "grafana-operator-key-new" \ |
| 308 | + --key-role "ADMIN" \ |
| 309 | + --seconds-to-live 432000 \ |
| 310 | + --workspace-id $COA_AMG_WORKSPACE_ID \ |
| 311 | + --query key \ |
| 312 | + --output text) |
| 313 | +``` |
| 314 | + |
| 315 | +- Finally, update the Grafana API key secret in AWS Secrets Manager using the above new Grafana API key: |
| 316 | + |
| 317 | +```bash |
| 318 | +export API_KEY_SECRET_NAME="grafana-api-key" |
| 319 | +aws secretsmanager update-secret \ |
| 320 | + --secret-id $API_KEY_SECRET_NAME \ |
| 321 | + --secret-string "${AMG_API_KEY}" \ |
| 322 | + --region $AWS_REGION |
| 323 | +``` |
| 324 | + |
| 325 | +- If the issue persists, you can force the synchronization by deleting the `externalsecret` Kubernetes object. |
| 326 | + |
| 327 | +```bash |
| 328 | +kubectl delete externalsecret/external-secrets-sm -n grafana-operator |
| 329 | +``` |
| 330 | + |
| 331 | + |
0 commit comments