Skip to content

Commit c3fcc67

Browse files
authored
Feature/issue 114 monitor adot collector (#132)
* Issue-114-MonitorAdotCollector Health check and Monitoring * Issue-114-MonitorAdotCollector Health check and Monitoring
1 parent 8711e1f commit c3fcc67

File tree

12 files changed

+365
-1
lines changed

12 files changed

+365
-1
lines changed
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Single Cluster Open Source Observability - OTEL Collector Monitoring
2+
3+
## Objective
4+
5+
This pattern aims to add Observability on top of an existing EKS cluster and adds monitoring for ADOT collector health, with open source managed AWS services.
6+
7+
## Prerequisites:
8+
9+
Ensure that you have installed the following tools on your machine:
10+
11+
1. [aws cli](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)
12+
2. [kubectl](https://Kubernetes.io/docs/tasks/tools/)
13+
3. [cdk](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)
14+
4. [npm](https://docs.npmjs.com/cli/v8/commands/npm-install)
15+
16+
You will also need:
17+
18+
1. Either an existing EKS cluster, or you can setup a new one with [Single New EKS Cluster Observability Accelerator](../single-new-eks-observability-accelerators/single-new-eks-cluster.md)
19+
2. An OpenID Connect (OIDC) provider, associated to the above EKS cluster (Note: Single EKS Cluster Pattern takes care of that for you)
20+
21+
## Deploying
22+
23+
1. Edit `~/.cdk.json` by setting the name of your existing cluster:
24+
25+
```json
26+
"context": {
27+
...
28+
"existing.cluster.name": "...",
29+
...
30+
}
31+
```
32+
33+
2. Edit `~/.cdk.json` by setting the kubectl role name; if you used Single New EKS Cluster Observability Accelerator to setup your cluster, the kubectl role name would be provided by the output of the deployment, on your command-line interface (CLI):
34+
35+
```json
36+
"context": {
37+
...
38+
"existing.kubectl.rolename":"...",
39+
...
40+
}
41+
```
42+
43+
3. Amazon Managed Grafana workspace: To visualize metrics collected, you need an Amazon Managed Grafana workspace. If you have an existing workspace, create an environment variable as described below. To create a new workspace, visit [our supporting example for Grafana](https://aws-observability.github.io/terraform-aws-observability-accelerator/helpers/managed-grafana/)
44+
45+
!!! note
46+
For the URL `https://g-xyz.grafana-workspace.us-east-1.amazonaws.com`, the workspace ID would be `g-xyz`
47+
48+
```bash
49+
export AWS_REGION=<YOUR AWS REGION>
50+
export COA_AMG_WORKSPACE_ID=g-xxx
51+
export COA_AMG_ENDPOINT_URL=https://g-xyz.grafana-workspace.us-east-1.amazonaws.com
52+
```
53+
54+
!!! warning
55+
Setting up environment variables `COA_AMG_ENDPOINT_URL` and `AWS_REGION` is mandatory for successful execution of this pattern.
56+
57+
4. GRAFANA API KEY: Amazon Managed Grafana provides a control plane API for generating Grafana API keys.
58+
59+
```bash
60+
export AMG_API_KEY=$(aws grafana create-workspace-api-key \
61+
--key-name "grafana-operator-key" \
62+
--key-role "ADMIN" \
63+
--seconds-to-live 432000 \
64+
--workspace-id $COA_AMG_WORKSPACE_ID \
65+
--query key \
66+
--output text)
67+
```
68+
69+
5. AWS SSM Parameter Store for GRAFANA API KEY: Update the Grafana API key secret in AWS SSM Parameter Store using the above new Grafana API key. This will be referenced by Grafana Operator deployment of our solution to access Amazon Managed Grafana from Amazon EKS Cluster
70+
71+
```bash
72+
aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" \
73+
--type "SecureString" \
74+
--value $AMG_API_KEY \
75+
--region $AWS_REGION
76+
```
77+
78+
6. Install project dependencies by running `npm install` in the main folder of this cloned repository.
79+
80+
7. The actual settings for dashboard urls are expected to be specified in the CDK context. Generically it is inside the cdk.json file of the current directory or in `~/.cdk.json` in your home directory.
81+
82+
Example settings: Update the context in `cdk.json` file located in `cdk-eks-blueprints-patterns` directory
83+
84+
```typescript
85+
"context": {
86+
"fluxRepository": {
87+
"name": "grafana-dashboards",
88+
"namespace": "grafana-operator",
89+
"repository": {
90+
"repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
91+
"name": "grafana-dashboards",
92+
"targetRevision": "main",
93+
"path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
94+
},
95+
"values": {
96+
"GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
97+
"GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
98+
"GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
99+
"GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
100+
"GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
101+
"GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
102+
"GRAFANA_ADOTHEALTH_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/adot/adothealth.json"
103+
},
104+
"kustomizations": [
105+
{
106+
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
107+
},
108+
{
109+
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/adot"
110+
}
111+
]
112+
},
113+
"adotcollectormetrics.pattern.enabled": true
114+
}
115+
```
116+
117+
8. Once all pre-requisites are set you are ready to deploy the pipeline. Run the following command from the root of this repository to deploy the pipeline stack:
118+
119+
```bash
120+
make build
121+
make pattern existing-eks-opensource-observability deploy
122+
```
123+
124+
## Visualization
125+
126+
The OpenTelemetry collector produces metrics to monitor the entire pipeline.
127+
128+
Login to your Grafana workspace and navigate to the Dashboards panel. You should see three new dashboard named `OpenTelemetry Health Collector`, under `Observability Accelerator Dashboards`
129+
130+
This dashboard shows useful telemetry information about the ADOT collector itself which can be helpful when you want to troubleshoot any issues with the collector or understand how much resources the collector is consuming.
131+
132+
Below diagram shows an example data flow and the components in an ADOT collector:
133+
134+
![ADOTCollectorComponents](../images/ADOTCollectorComponents.png)
135+
136+
137+
In this dashboard, there are five sections. Each section has [metrics](https://aws-observability.github.io/observability-best-practices/guides/operational/adot-at-scale/operating-adot-collector/#collecting-health-metrics-from-the-collector) relevant to the various [components](https://opentelemetry.io/docs/demo/collector-data-flow-dashboard/#data-flow-overview) of the AWS Distro for OpenTelemetry (ADOT) collector :
138+
139+
### Receivers
140+
Shows the receiver’s accepted and refused rate/count of spans and metric points that are pushed into the telemetry pipeline.
141+
142+
### Processors
143+
Shows the accepted and refused rate/count of spans and metric points pushed into next component in the pipeline. The batch metrics can help to understand how often metrics are sent to exporter and the batch size.
144+
145+
![receivers_processors](../images/ADOTReceiversProcessors.png)
146+
147+
148+
### Exporters
149+
Shows the exporter’s accepted and refused rate/count of spans and metric points that are pushed to any of the destinations. It also shows the size and capacity of the retry queue. These metrics can be used to understand if the collector is having issues in sending trace or metric data to the destination configured.
150+
151+
![exporters](../images/ADOTExporters.png)
152+
153+
154+
### Collectors
155+
Shows the collector’s operational metrics (Memory, CPU, uptime). This can be used to understand how much resources the collector is consuming.
156+
157+
![collectors](../images/ADOTCollectors.png)
158+
159+
### Data Flow
160+
Shows the metrics and spans data flow through the collector’s components.
161+
162+
![dataflow](../images/ADOTDataflow.png)
163+
164+
Note:
165+
To read more about the metrics and the dashboard used, visit the upstream documentation [here](https://opentelemetry.io/docs/demo/collector-data-flow-dashboard/).
166+
167+
168+
## Disable ADOT health monitoring
169+
170+
Update the context in `cdk.json` file located in `cdk-eks-blueprints-patterns` directory
171+
172+
```typescript
173+
"context": {
174+
"adotcollectormetrics.pattern.enabled": false
175+
}
176+
```
177+
178+
## Teardown
179+
180+
You can teardown the whole CDK stack with the following command:
181+
182+
```bash
183+
make pattern existing-eks-opensource-observability destroy
184+
```
185+
186+
If you setup your cluster with Single New EKS Cluster Observability Accelerator, you also need to run:
187+
188+
```bash
189+
make pattern single-new-eks-cluster destroy
190+
```
83.3 KB
Loading
586 KB
Loading
188 KB
Loading
362 KB
Loading
623 KB
Loading
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Single Cluster Open Source Observability - OTEL Collector Monitoring
2+
3+
## Objective
4+
5+
This pattern demonstrates how to use the _New EKS Cluster Open Source Observability Accelerator_ with monitoring for ADOT collector health.
6+
7+
## Prerequisites
8+
9+
Ensure that you have installed the following tools on your machine.
10+
11+
1. [aws cli](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)
12+
2. [kubectl](https://Kubernetes.io/docs/tasks/tools/)
13+
3. [cdk](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)
14+
4. [npm](https://docs.npmjs.com/cli/v8/commands/npm-install)
15+
16+
## Deploying
17+
18+
Please follow the _Deploying_ instructions of the [New EKS Cluster Open Source Observability Accelerator](./single-new-eks-opensource-observability.md) pattern, except for step 7, where you need to replace "context" in `~/.cdk.json` with the following:
19+
20+
```typescript
21+
"context": {
22+
"fluxRepository": {
23+
"name": "grafana-dashboards",
24+
"namespace": "grafana-operator",
25+
"repository": {
26+
"repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
27+
"name": "grafana-dashboards",
28+
"targetRevision": "main",
29+
"path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
30+
},
31+
"values": {
32+
"GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
33+
"GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
34+
"GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
35+
"GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
36+
"GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
37+
"GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
38+
"GRAFANA_ADOTHEALTH_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/adot/adothealth.json"
39+
},
40+
"kustomizations": [
41+
{
42+
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
43+
},
44+
{
45+
"kustomizationPath": "./artifacts/grafana-operator-manifests/eks/adot"
46+
}
47+
]
48+
},
49+
"adotcollectormetrics.pattern.enabled": true
50+
}
51+
```
52+
53+
## Visualization
54+
55+
The OpenTelemetry collector produces metrics to monitor the entire pipeline.
56+
57+
Login to your Grafana workspace and navigate to the Dashboards panel. You should see three new dashboard named `OpenTelemetry Health Collector`, under `Observability Accelerator Dashboards`
58+
59+
This dashboard shows useful telemetry information about the ADOT collector itself which can be helpful when you want to troubleshoot any issues with the collector or understand how much resources the collector is consuming.
60+
61+
Below diagram shows an example data flow and the components in an ADOT collector:
62+
63+
![ADOTCollectorComponents](../images/ADOTCollectorComponents.png)
64+
65+
66+
In this dashboard, there are five sections. Each section has [metrics](https://aws-observability.github.io/observability-best-practices/guides/operational/adot-at-scale/operating-adot-collector/#collecting-health-metrics-from-the-collector) relevant to the various [components](https://opentelemetry.io/docs/demo/collector-data-flow-dashboard/#data-flow-overview) of the AWS Distro for OpenTelemetry (ADOT) collector :
67+
68+
### Receivers
69+
Shows the receiver’s accepted and refused rate/count of spans and metric points that are pushed into the telemetry pipeline.
70+
71+
### Processors
72+
Shows the accepted and refused rate/count of spans and metric points pushed into next component in the pipeline. The batch metrics can help to understand how often metrics are sent to exporter and the batch size.
73+
74+
![receivers_processors](../images/ADOTReceiversProcessors.png)
75+
76+
77+
### Exporters
78+
Shows the exporter’s accepted and refused rate/count of spans and metric points that are pushed to any of the destinations. It also shows the size and capacity of the retry queue. These metrics can be used to understand if the collector is having issues in sending trace or metric data to the destination configured.
79+
80+
![exporters](../images/ADOTExporters.png)
81+
82+
83+
### Collectors
84+
Shows the collector’s operational metrics (Memory, CPU, uptime). This can be used to understand how much resources the collector is consuming.
85+
86+
![collectors](../images/ADOTCollectors.png)
87+
88+
### Data Flow
89+
Shows the metrics and spans data flow through the collector’s components.
90+
91+
![dataflow](../images/ADOTDataflow.png)
92+
93+
Note:
94+
To read more about the metrics and the dashboard used, visit the upstream documentation [here](https://opentelemetry.io/docs/demo/collector-data-flow-dashboard/).
95+
96+
97+
## Disable ADOT health monitoring
98+
99+
Update the context in `cdk.json` file located in `cdk-eks-blueprints-patterns` directory
100+
101+
```typescript
102+
"context": {
103+
"adotcollectormetrics.pattern.enabled": false
104+
}
105+
```
106+
107+
## Teardown
108+
109+
You can teardown the whole CDK stack with the following command:
110+
111+
```bash
112+
make pattern single-new-eks-opensource-observability destroy
113+
```

lib/common/resources/otel-collector-config.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ spec:
2828
external_labels:
2929
cluster: "{{clusterName}}"
3030
scrape_configs:
31+
{{ start enableAdotMetricsCollectionJob}}
32+
- job_name: otel-collector-metrics
33+
scrape_interval: 10s
34+
static_configs:
35+
- targets: ['localhost:8888']
36+
{{ stop enableAdotMetricsCollectionJob }}
3137
- job_name: 'kubernetes-kubelet'
3238
scheme: https
3339
tls_config:
@@ -1653,3 +1659,9 @@ spec:
16531659
metrics:
16541660
receivers: [prometheus]
16551661
exporters: [logging, prometheusremotewrite]
1662+
{{ start enableAdotMetricsCollectionTelemetry }}
1663+
telemetry:
1664+
metrics:
1665+
address: 0.0.0.0:8888
1666+
level: basic
1667+
{{ stop enableAdotMetricsCollectionTelemetry }}

lib/existing-eks-opensource-observability-pattern/index.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,18 @@ export default class ExistingEksOpenSourceobservabilityPattern {
6666
"{{ end }}",
6767
jsonStringnew.context["apiserver.pattern.enabled"]
6868
);
69+
doc = utils.changeTextBetweenTokens(
70+
doc,
71+
"{{ start enableAdotMetricsCollectionJob}}",
72+
"{{ stop enableAdotMetricsCollectionJob }}",
73+
jsonStringnew.context["adotcollectormetrics.pattern.enabled"]
74+
);
75+
doc = utils.changeTextBetweenTokens(
76+
doc,
77+
"{{ start enableAdotMetricsCollectionTelemetry }}",
78+
"{{ stop enableAdotMetricsCollectionTelemetry }}",
79+
jsonStringnew.context["adotcollectormetrics.pattern.enabled"]
80+
);
6981
console.log(doc);
7082
fs.writeFileSync(__dirname + '/../common/resources/otel-collector-config-new.yml', doc);
7183

lib/single-new-eks-fargate-opensource-observability-pattern/index.ts

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,14 +35,27 @@ export default class SingleNewEksFargateOpenSourceObservabilityConstruct {
3535
]
3636
}
3737
};
38-
38+
const jsonString = fs.readFileSync(__dirname + '/../../cdk.json', 'utf-8');
39+
const jsonStringnew = JSON.parse(jsonString);
3940
let doc = utils.readYamlDocument(__dirname + '/../common/resources/otel-collector-config.yml');
4041
doc = utils.changeTextBetweenTokens(
4142
doc,
4243
"{{ if enableAPIserverJob }}",
4344
"{{ end }}",
4445
true
4546
);
47+
doc = utils.changeTextBetweenTokens(
48+
doc,
49+
"{{ start enableAdotMetricsCollectionJob}}",
50+
"{{ stop enableAdotMetricsCollectionJob }}",
51+
jsonStringnew.context["adotcollectormetrics.pattern.enabled"]
52+
);
53+
doc = utils.changeTextBetweenTokens(
54+
doc,
55+
"{{ start enableAdotMetricsCollectionTelemetry }}",
56+
"{{ stop enableAdotMetricsCollectionTelemetry }}",
57+
true
58+
);
4659
console.log(doc);
4760
fs.writeFileSync(__dirname + '/../common/resources/otel-collector-config-new.yml', doc);
4861

0 commit comments

Comments
 (0)