Skip to content

Commit 42cf786

Browse files
gjulianmjanine-c
andauthored
gpu: fix operator deployment instructions (#20552)
* Update operator instructions * Fix read-only paths * Update gpu/README.md Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com> * Update gpu/README.md Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com> * Update gpu/README.md Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com> --------- Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
1 parent 5d74a46 commit 42cf786

File tree

1 file changed

+54
-6
lines changed

1 file changed

+54
-6
lines changed

gpu/README.md

Lines changed: 54 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -200,13 +200,63 @@ spec:
200200
env:
201201
# add this env var, if using operator version 1.14.x
202202
- name: DD_ENABLE_NVML_DETECTION
203-
value: "true"
203+
value: "true"
204204
# add this env var, if using operator versions 1.14.x or 1.15.x
205205
- name: DD_COLLECT_GPU_TAGS
206-
value: "true"
206+
value: "true"
207207
```
208208

209-
For **mixed environments**, use the [DatadogAgentProfiles feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. In this case, it is not necessary to modify the DatadogAgent manifest. Instead, create a profile that enables the configuration on GPU nodes only:
209+
For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled. For more information, see [Enabling DatadogAgentProfiles](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).
210+
211+
Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:
212+
- In the existing configuration, enable the `system-probe` container in the datadog-agent pods. Because the DAP feature does not yet support conditionally enabling containers, a feature that uses `system-probe` needs to be enabled for all Agent pods.
213+
- You can check this by looking at the list of containers when running `kubectl describe pod <datadog-agent-pod-name> -n <namespace>`.
214+
- Datadog recommends enabling the `oomKill` integration, as it is lightweight and does not require any additional configuration or cost.
215+
- Configure the Agent so that the NVIDIA container runtime exposes GPUs to the Agent.
216+
- You can do this using environment variables or volume mounts, depending on whether the `accept-nvidia-visible-devices-as-volume-mounts` parameter is set to `true` or `false` in the NVIDIA container runtime configuration.
217+
- Datadog recommends configuring the Agent both ways, as it reduces the chance of misconfiguration. There are no side effects to having both.
218+
- Expose the PodResources socket to the Agent to integrate with the Kubernetes Device Plugin.
219+
- This needs to be done globally, as the DAP does not yet support conditional volume mounts.
220+
221+
In summary, the changes that need to be applied to the DatadogAgent manifest are the following:
222+
223+
```yaml
224+
spec:
225+
features:
226+
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods
227+
enabled: true
228+
229+
override:
230+
nodeAgent:
231+
volumes:
232+
- name: nvidia-devices
233+
hostPath:
234+
path: /dev/null
235+
- name: pod-resources
236+
hostPath:
237+
path: /var/lib/kubelet/pod-resources
238+
containers:
239+
agent:
240+
env:
241+
- name: NVIDIA_VISIBLE_DEVICES
242+
value: "all"
243+
volumeMounts:
244+
- name: nvidia-devices
245+
mountPath: /dev/nvidia-visible-devices
246+
- name: pod-resources
247+
mountPath: /var/lib/kubelet/pod-resources
248+
system-probe:
249+
env:
250+
- name: NVIDIA_VISIBLE_DEVICES
251+
value: "all"
252+
volumeMounts:
253+
- name: nvidia-devices
254+
mountPath: /dev/nvidia-visible-devices
255+
- name: pod-resources
256+
mountPath: /var/lib/kubelet/pod-resources
257+
```
258+
259+
Once the DatadogAgent configuration is changed, create a profile that enables the GPU feature configuration on GPU nodes only:
210260

211261
```yaml
212262
apiVersion: datadoghq.com/v1alpha1
@@ -229,12 +279,10 @@ spec:
229279
env:
230280
- name: DD_GPU_MONITORING_ENABLED
231281
value: "true"
232-
# add this env var, if using operator version 1.14.x
233282
agent:
234283
env:
235284
- name: DD_ENABLE_NVML_DETECTION
236-
value: "true"
237-
# add this env var, if using operator versions 1.14.x or 1.15.x
285+
value: "true"
238286
- name: DD_COLLECT_GPU_TAGS
239287
value: "true"
240288
```

0 commit comments

Comments
 (0)