Skip to content

Commit 442e337

Browse files
authored
Merge pull request #3006 from replicatedhq/rm-containerd-svc-nvidia-operator
nvidia gpu operator extension: add caveat that any existing containerd svcs running on the host need to be removed
2 parents cad71f2 + ec19ca6 commit 442e337

File tree

1 file changed

+30
-9
lines changed

1 file changed

+30
-9
lines changed

docs/vendor/embedded-using.mdx

Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -235,18 +235,39 @@ This section outlines some additional use cases for Embedded Cluster. These are
235235

236236
### NVIDIA GPU Operator
237237

238-
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs. For more information about this operator, see the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) documentation. You can include the operator in your release as an additional Helm chart, or using the Embedded Cluster Helm extensions. For information about Helm extensions, see [extensions](/reference/embedded-config#extensions) in _Embedded Cluster Config_.
238+
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs. For more information about this operator, see the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) documentation.
239239

240-
Using this operator with Embedded Cluster requires configuring the containerd options in the operator as follows:
240+
You can include the NVIDIA GPU Operator in your release as an additional Helm chart, or using Embedded Cluster Helm extensions. For information about adding Helm extensions, see [extensions](/reference/embedded-config#extensions) in _Embedded Cluster Config_.
241+
242+
Using the NVIDIA GPU Operator with Embedded Cluster requires configuring the containerd options in the operator as follows:
241243

242244
```yaml
243-
toolkit:
244-
env:
245-
- name: CONTAINERD_CONFIG
246-
value: /etc/k0s/containerd.d/nvidia.toml
247-
- name: CONTAINERD_SOCKET
248-
value: /run/k0s/containerd.sock
249-
```
245+
# Embedded Cluster Config
246+
247+
extensions:
248+
helm:
249+
repositories:
250+
- name: nvidia
251+
url: https://nvidia.github.io/gpu-operator
252+
charts:
253+
- name: gpu-operator
254+
chartname: nvidia/gpu-operator
255+
namespace: gpu-operator
256+
version: "v24.9.1"
257+
values: |
258+
# configure the containerd options
259+
toolkit:
260+
env:
261+
- name: CONTAINERD_CONFIG
262+
value: /etc/k0s/containerd.d/nvidia.toml
263+
- name: CONTAINERD_SOCKET
264+
value: /run/k0s/containerd.sock
265+
```
266+
When the containerd options are configured as shown above, the NVIDIA GPU Operator automatically creates the required configurations in the `/etc/k0s/containerd.d/nvidia.toml` file. It is not necessary to create this file manually, or modify any other configuration on the hosts.
267+
268+
:::note
269+
If you include the NVIDIA GPU Operator as a Helm extension, remove any existing containerd services that are running on the host (such as those deployed by Docker) before attempting to install the release with Embedded Cluster. If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail.
270+
:::
250271

251272
## Troubleshoot with Support Bundles
252273

0 commit comments

Comments
 (0)