Skip to content

Commit 21ee82b

Browse files
authored
docs(cpt): add gpu instanceXCockpit tuto (#3865)
1 parent 340b8be commit 21ee82b

File tree

2 files changed

+219
-0
lines changed

2 files changed

+219
-0
lines changed

observability/cockpit/index.mdx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,11 @@ meta:
6161
url="/observability/cockpit/how-to/send-metrics-with-grafana-alloy/"
6262
label="Read more"
6363
/>
64+
<DefaultCard
65+
title="Monitor GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter"
66+
url="/tutorials/monitor-gpu-instance-with-cockpit/"
67+
label="Read more"
68+
/>
6469
</Grid>
6570

6671

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
---
2+
meta:
3+
title: Monitor GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter
4+
description: This page explains how to visualize metrics and logs from GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter
5+
content:
6+
h1: Monitor GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter
7+
paragraph: This page explains how to visualize metrics and logs from GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter
8+
dates:
9+
validation: 2024-10-21
10+
posted: 2024-10-21
11+
---
12+
13+
This tutorial guides you through the process of monitoring your [GPU Instances](/compute/instances/concepts/#gpu-instance) using Cockpit and the [NVIDIA Data Center GPU Manager (DCGM) Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Visualize your GPU Instances' metrics and ensure optimal performance and usage of your resources.
14+
15+
<Macro id="requirements" />
16+
17+
- A Scaleway account logged into the [console](https://console.scaleway.com)
18+
- [Owner](/identity-and-access-management/iam/concepts/#owner) status or [IAM permissions](/identity-and-access-management/iam/concepts/#permission) allowing you to perform actions in the intended Organization
19+
- Created a [GPU Instance](/compute/gpu/how-to/create-manage-gpu-instance/)
20+
- [Connected to your Instance via SSH](/compute/gpu/how-to/create-manage-gpu-instance/#how-to-connect-to-a-gpu-instance)
21+
- Installed [Docker Engine](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/linux/#install-using-the-repository) on your GPU Instance.
22+
23+
## Create a Cockpit data source and credentials
24+
25+
### Create a Cockpit data source
26+
27+
We are creating a Cockpit data source because your GPU Instance's metrics will be stored in it and the exporter agent needs data source configuration information to then export your Instance's metrics.
28+
29+
1. Create a metrics [custom data source in Cockpit](/observability/cockpit/how-to/create-external-data-sources/). For the sake of this tutorial, we will name it `gpu-instance-metrics`.
30+
31+
<Message type="important">
32+
- To fill in the cost estimator, you can assume that **1 metric sent without [specific cardinality](https://grafana.com/docs/tempo/latest/metrics-generator/cardinality/)** (ie. without labels or value duplication for a same metric) **every minute will generate around 50 000 samples per month** (60 minutes x 730 hours per month = 43 800 samples). By default, DCGM and node exporter will send multiple metrics and add labels to these metrics leading to a higher number of samples.
33+
- **We recommend that you complete this tutorial first** to visualize your data, and **then review your configuration to optimize the number of metrics or labels sent**.
34+
</Message>
35+
2. Click your metrics data source to view information such as its **URL** and **push path**.
36+
37+
### Create a token
38+
39+
1. Create a [Cockpit token](/observability/cockpit/how-to/create-token/) from the [Scaleway console](https://console.scaleway.com/cockpit/tokens).
40+
2. Select a region for the data source.
41+
3. Tick the **Push Metrics** box and click **Create token** to confirm.
42+
43+
<Message type="important">
44+
Copy and store your token securely. We will use it to allow the Grafana Alloy agent to push your metrics to the metrics data source you have created earlier.
45+
</Message>
46+
47+
## Collect metrics from your GPU Instance
48+
49+
### Install the NVIDIA DCGM Exporter, node exporter and Grafana Alloy agent on your GPU Instance
50+
51+
1. [Connect to your GPU Instance through SSH](/compute/gpu/how-to/create-manage-gpu-instance/#how-to-connect-to-a-gpu-instance).
52+
2. Copy and paste the following command to create a configuration file named `config.alloy` in your Instance:
53+
```sh
54+
touch config.alloy
55+
```
56+
3. Copy and paste the following template inside `config.alloy`:
57+
```json
58+
prometheus.remote_write "cockpit" {
59+
endpoint {
60+
url = "https://example-afc6-4d02-a2fd-bc020bbaa7d0.metrics.cockpit.fr-par.scw.cloud/api/v1/push"
61+
headers = {
62+
"X-TOKEN" = "example_bKNpXZZP6BSKiYzV8fiQL1yR_kP_VLB-h0tpYAkaNoVTHVm8q",
63+
}
64+
}
65+
}
66+
67+
prometheus.scrape "dcgm_exporter" {
68+
scrape_interval = "60s"
69+
targets = [{__address__ = "dcgm_exporter:9400"}]
70+
forward_to = [prometheus.remote_write.cockpit.receiver]
71+
}
72+
73+
prometheus.exporter.unix "node_exporter" {
74+
set_collectors = [
75+
"uname",
76+
"cpu",
77+
"cpufreq",
78+
"loadavg",
79+
"meminfo",
80+
"filesystem",
81+
"netdev",
82+
]
83+
}
84+
85+
prometheus.scrape "node_exporter" {
86+
scrape_interval = "60s"
87+
targets = prometheus.exporter.unix.node_exporter.targets
88+
forward_to = [prometheus.remote_write.cockpit.receiver]
89+
}
90+
```
91+
4. Replace the values of `cockpit.endpoint.url` (`https://example-afc6-4d02-a2fd-bc020bbaa7d0.metrics.cockpit.fr-par.scw.cloud/api/v1/push`) and `cockpit.endpoint.headers.X-TOKEN` (`example_bKNpXZZP6BSKiYzV8fiQL1yR_kP_VLB-h0tpYAkaNoVTHVm8q`) with the ones of your `gpu-instance-metrics`[Cockpit data source](https://console.scaleway.com/cockpit/dataSource).
92+
93+
This configuration allows you to:
94+
- collect performance data (using `dcgm_exporter`) from your GPU Instance. This includes information like GPU load (how much of the GPU's processing power is being used), temperature, and other relevant metrics.
95+
- collect standard Instance metrics with `node_exporter` (CPU load, disk size, etc.)
96+
- push the collected data to your Cockpit data source (using `cockpit`).
97+
98+
<Message type="note">
99+
- The current configuration is set to send only a limited number of metrics from `node_exporter` (the tool collecting CPU, disk, memory, etc. data). Because of this, some data might not show up on your Cockpit dashboards in Grafana when you import them.
100+
- If you want to send all available data from `node_exporter`, you need to edit its configuration. Specifically, you need to remove the `set_collectors` list from the configuration. This list defines which metrics are being collected, and removing it will allow all metrics to be sent.
101+
- While removing the `set_collectors` list will provide more detailed metrics, it may come with **higher resource usage and associated costs**, especially if you are using a paid service for data monitoring or storage.
102+
</Message>
103+
104+
5. Copy and paste the following command to create a `docker-compose.yaml` file in your Instance:
105+
```sh
106+
touch docker-compose.yaml
107+
```
108+
6. Copy and paste the following configuration inside `docker-compose.yaml`, save it and exit the file.
109+
```yaml
110+
services:
111+
dcgm_exporter:
112+
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
113+
deploy:
114+
resources:
115+
reservations:
116+
devices:
117+
- driver: nvidia
118+
count: all
119+
capabilities: [ gpu ]
120+
cap_add:
121+
- SYS_ADMIN
122+
ports:
123+
- "9400:9400"
124+
125+
agent:
126+
image: grafana/alloy:latest
127+
ports:
128+
- "12345:12345"
129+
volumes:
130+
- "./config.alloy:/etc/alloy/config.alloy"
131+
command: [
132+
"run",
133+
"--server.http.listen-addr=0.0.0.0:12345",
134+
"/etc/alloy/config.alloy",
135+
]
136+
```
137+
This configuration will:
138+
- deploy the DCGM exporter
139+
- deploy the Grafana Alloy agent
140+
141+
7. Run docker services using the following command:
142+
```yaml
143+
docker compose up
144+
```
145+
146+
## Create Cockpit dashboards in Grafana
147+
148+
### Create a GPU metrics dashboard
149+
150+
1. Access the **Overview** tab of your [Cockpit](https://console.scaleway.com/cockpit/overview) and click **Open dashboards** to open your Cockpit dashboards in Grafana.
151+
152+
2. Click the **+** icon in the top-right-hand corner, then click **Import dashboard**.
153+
154+
3. Copy the ID (`12219`) of the [Grafana NVIDIA DCGM Exporter dashboard](https://grafana.com/grafana/dashboards/12219-nvidia-dcgm-exporter-dashboard/) and paste it in the **Import via grafana.com** field.
155+
156+
4. Click **Load**.
157+
158+
5. Select your Prometheus data source named `gpu-instance-metrics`, then click **Import**
159+
160+
You should see your dashboard with data such as **GPU Temperature** or **GPU Power Usage**.
161+
162+
<Message type="tip">
163+
If you see only an empty dashboard with the "Dashboard not Found" and "Access denied to this dashboard" error, wait a few seconds and refresh the page. Your dashboard should then display.
164+
Alternatively, you can also click the **Menu** icon on the left, then on **Dashboards** and search through your dashboards. You should see your newly created dashboard.
165+
</Message>
166+
167+
### Create a CPU and disk metrics Cockpit dashboard in Grafana
168+
169+
1. Access the **Overview** tab of your [Cockpit](https://console.scaleway.com/cockpit/overview) and click **Open dashboards** to open your Cockpit dashboards in Grafana.
170+
171+
2. Click the **+** icon in the top-right-hand corner, then click **Import dashboard**.
172+
173+
3. Copy the ID (`1860`) of the [Node Exporter Full dashboard](https://grafana.com/grafana/dashboards/1860-node-exporter-full/) and paste it in the **Import via grafana.com** field.
174+
175+
4. Click **Load**.
176+
177+
5. Select your Prometheus data source named `gpu-instance-metrics`, then click **Import**
178+
179+
You should now see your dashboard with data such as **CPU usage** and **Memory Usage**.
180+
181+
<Message type="tip">
182+
If you see only an empty dashboard with the "Dashboard not Found" and "Access denied to this dashboard" error, wait a few seconds and refresh the page. Your dashboard should then display.
183+
If you still do not see any data, make sure that you select the `gpu-instance-metrics` in the **Datasource** dropdown list located in the top-left-hand corner.
184+
</Message>
185+
186+
<Message type="note">
187+
The current configuration of the Node Exporter agent does not include certain metrics, such as:
188+
- Swap used: How much swap space (virtual memory) is currently being used by the system.
189+
- Root FS used: How much of the root file system (main storage partition) is being used.
190+
</Message>
191+
192+
You can now find your newly created dashboards in your list of Cockpit dashboards in Grafana. This allows you to access your GPU Instances data to monitor and optimize your resources.
193+
194+
### Going further
195+
196+
- **Add more metrics to your dashboards**
197+
- Connect to your GPU Instance via SSH
198+
- Edit the `config.alloy` file and restart the agents using the `docker compose up` command
199+
- Update your Cockpit dashboards in Grafana
200+
201+
- **Create custom dashboards**
202+
- In Grafana explore the metrics you have sent by clicking the **Menu** icon on the left, then **Explore**.
203+
- Select your custom data source named `gpu-instance-metrics` in the **Datasource** dropdown list located in the top-left-hand corner.
204+
- Click **Metrics browser**. You should see a list of metrics appear (for example, `DCGM_FI_DEV_GPU_TEMP` or `node_cpu_seconds_total`).
205+
- Write the desired query, click **Run query** to visualize data, and then **Add to dashboard** to add it to a new or existing dashboard.
206+
207+
## Troubleshooting
208+
209+
If you encounter any issues, make sure that you meet all the requirements listed at the beginning of this tutorial.
210+
211+
You can run `docker -v` in your terminal to check your docker version. You should see an output similar to the following:
212+
```
213+
Docker version 24.0.6, build ed223bc820
214+
```

0 commit comments

Comments
 (0)