You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: gpu/README.md
+54-6Lines changed: 54 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -200,13 +200,63 @@ spec:
200
200
env:
201
201
# add this env var, if using operator version 1.14.x
202
202
- name: DD_ENABLE_NVML_DETECTION
203
-
value: "true"
203
+
value: "true"
204
204
# add this env var, if using operator versions 1.14.x or 1.15.x
205
205
- name: DD_COLLECT_GPU_TAGS
206
-
value: "true"
206
+
value: "true"
207
207
```
208
208
209
-
For **mixed environments**, use the [DatadogAgentProfiles feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. In this case, it is not necessary to modify the DatadogAgent manifest. Instead, create a profile that enables the configuration on GPU nodes only:
209
+
For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled. For more information, see [Enabling DatadogAgentProfiles](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).
210
+
211
+
Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:
212
+
- In the existing configuration, enable the `system-probe` container in the datadog-agent pods. Because the DAP feature does not yet support conditionally enabling containers, a feature that uses `system-probe` needs to be enabled for all Agent pods.
213
+
- You can check this by looking at the list of containers when running `kubectl describe pod <datadog-agent-pod-name> -n <namespace>`.
214
+
- Datadog recommends enabling the `oomKill` integration, as it is lightweight and does not require any additional configuration or cost.
215
+
- Configure the Agent so that the NVIDIA container runtime exposes GPUs to the Agent.
216
+
- You can do this using environment variables or volume mounts, depending on whether the `accept-nvidia-visible-devices-as-volume-mounts` parameter is set to `true` or `false` in the NVIDIA container runtime configuration.
217
+
- Datadog recommends configuring the Agent both ways, as it reduces the chance of misconfiguration. There are no side effects to having both.
218
+
- Expose the PodResources socket to the Agent to integrate with the Kubernetes Device Plugin.
219
+
- This needs to be done globally, as the DAP does not yet support conditional volume mounts.
220
+
221
+
In summary, the changes that need to be applied to the DatadogAgent manifest are the following:
222
+
223
+
```yaml
224
+
spec:
225
+
features:
226
+
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods
227
+
enabled: true
228
+
229
+
override:
230
+
nodeAgent:
231
+
volumes:
232
+
- name: nvidia-devices
233
+
hostPath:
234
+
path: /dev/null
235
+
- name: pod-resources
236
+
hostPath:
237
+
path: /var/lib/kubelet/pod-resources
238
+
containers:
239
+
agent:
240
+
env:
241
+
- name: NVIDIA_VISIBLE_DEVICES
242
+
value: "all"
243
+
volumeMounts:
244
+
- name: nvidia-devices
245
+
mountPath: /dev/nvidia-visible-devices
246
+
- name: pod-resources
247
+
mountPath: /var/lib/kubelet/pod-resources
248
+
system-probe:
249
+
env:
250
+
- name: NVIDIA_VISIBLE_DEVICES
251
+
value: "all"
252
+
volumeMounts:
253
+
- name: nvidia-devices
254
+
mountPath: /dev/nvidia-visible-devices
255
+
- name: pod-resources
256
+
mountPath: /var/lib/kubelet/pod-resources
257
+
```
258
+
259
+
Once the DatadogAgent configuration is changed, create a profile that enables the GPU feature configuration on GPU nodes only:
210
260
211
261
```yaml
212
262
apiVersion: datadoghq.com/v1alpha1
@@ -229,12 +279,10 @@ spec:
229
279
env:
230
280
- name: DD_GPU_MONITORING_ENABLED
231
281
value: "true"
232
-
# add this env var, if using operator version 1.14.x
233
282
agent:
234
283
env:
235
284
- name: DD_ENABLE_NVML_DETECTION
236
-
value: "true"
237
-
# add this env var, if using operator versions 1.14.x or 1.15.x
0 commit comments