You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPUs play an integral part in data intensive workloads. The eks-monitoring module of the Observability Accelerator provides the ability to deploy the NVIDIA DCGM Exporter Dashboard.
4
+
The dashboard utilizes metrics scraped from the `/metrics` endpoint that are exposed when running the nvidia gpu operator with the [DCGM exporter](https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/) and NVSMI binary.
5
+
6
+
!!!note
7
+
In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html)
8
+
The recommended way of deploying the GPU operator is the [Data on EKS Blueprint](https://github.com/aws-ia/terraform-aws-eks-data-addons/blob/main/nvidia-gpu-operator.tf)
9
+
10
+
## Deployment
11
+
12
+
This is enabled by default in the [eks-monitoring module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/).
13
+
14
+
## Dashboards
15
+
16
+
In order to start producing diagnostic metrics you must first deploy the nvidia SMI binary. nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container:
17
+
18
+
```
19
+
cat << EOF | kubectl apply -f -
20
+
apiVersion: v1
21
+
kind: Pod
22
+
metadata:
23
+
name: nvidia-smi
24
+
spec:
25
+
restartPolicy: OnFailure
26
+
containers:
27
+
- name: nvidia-smi
28
+
image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
29
+
args:
30
+
- "nvidia-smi"
31
+
resources:
32
+
limits:
33
+
nvidia.com/gpu: 1
34
+
EOF
35
+
```
36
+
After producing the metrics they should populate the DCGM exporter dashboard:
|[aws_caller_identity.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity)| data source |
65
66
|[aws_eks_cluster.eks_cluster](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/eks_cluster)| data source |
66
67
|[aws_partition.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/partition)| data source |
@@ -93,6 +94,7 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
93
94
| <aname="input_enable_managed_prometheus"></a> [enable\_managed\_prometheus](#input\_enable\_managed\_prometheus)| Creates a new Amazon Managed Service for Prometheus Workspace |`bool`|`true`| no |
94
95
| <aname="input_enable_nginx"></a> [enable\_nginx](#input\_enable\_nginx)| Enable NGINX workloads monitoring, alerting and default dashboards |`bool`|`false`| no |
95
96
| <aname="input_enable_node_exporter"></a> [enable\_node\_exporter](#input\_enable\_node\_exporter)| Enables or disables Node exporter. Disabling this might affect some data in the dashboards |`bool`|`true`| no |
97
+
| <aname="input_enable_nvidia_monitoring"></a> [enable\_nvidia\_monitoring](#input\_enable\_nvidia\_monitoring)| Enables monitoring of nvidia metrics |`bool`|`true`| no |
96
98
| <aname="input_enable_recording_rules"></a> [enable\_recording\_rules](#input\_enable\_recording\_rules)| Enables or disables Managed Prometheus recording rules |`bool`|`true`| no |
97
99
| <aname="input_enable_tracing"></a> [enable\_tracing](#input\_enable\_tracing)| Enables tracing with OTLP traces receiver to X-Ray |`bool`|`true`| no |
| <aname="input_prometheus_config"></a> [prometheus\_config](#input\_prometheus\_config)| Controls default values such as scrape interval, timeouts and ports globally | <pre>object({<br> global_scrape_interval = string<br> global_scrape_timeout = string<br> })</pre> | <pre>{<br> "global_scrape_interval": "120s",<br> "global_scrape_timeout": "15s"<br>}</pre> | no |
131
134
| <aname="input_tags"></a> [tags](#input\_tags)| Additional tags (e.g. `map('BusinessUnit`,`XYZ`) |`map(string)`|`{}`| no |
132
135
| <aname="input_target_secret_name"></a> [target\_secret\_name](#input\_target\_secret\_name)| Target secret in Kubernetes to store the Grafana API Key Secret |`string`|`"grafana-admin-credentials"`| no |
0 commit comments