Skip to content

Commit c9442f1

Browse files
committed
changes in response to kevin's last review
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
1 parent 1ace8b2 commit c9442f1

File tree

2 files changed

+17
-7
lines changed

2 files changed

+17
-7
lines changed

gpu-operator/dra-cds.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ Motivation
1818
NVIDIA's `GB200 NVL72 <https://www.nvidia.com/en-us/data-center/gb200-nvl72/>`_ and comparable systems are designed specifically around Multi-Node NVLink (`MNNVL <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html>`_) to turn a rack of GPU machines -- each with a small number of GPUs -- into a supercomputer with a large number of GPUs communicating at high bandwidth (1.8 TB/s chip-to-chip, and over `130 TB/s cumulative bandwidth <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html#fifth-generation-nvlink>`_ on a GB200 NVL72).
1919

2020
NVIDIA's DRA Driver for GPUs enables MNNVL for Kubernetes workloads by introducing a new concept -- the **ComputeDomain**:
21-
when workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.
21+
when a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.
2222

2323
.. note::
2424

2525
Users may appreciate to know that -- under the hood -- NVIDIA Internode Memory Exchange (`IMEX <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service>`_) primitives need to be orchestrated for mapping GPU memory over NVLink *securely*: IMEX provides an access control system to lock down GPU memory even between GPUs on the same NVLink partition.
2626

27-
A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects IMEX channels into containers as needed.
27+
A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects `IMEX channels <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html>`_ into containers as needed.
2828

2929

3030
.. _dra-docs-cd-guarantees:

gpu-operator/dra-intro-install.rst

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@ Prerequisites
4949

5050
- Kubernetes v1.32 or newer.
5151
- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation>`_).
52-
- GPU Driver 565 or later.
53-
- NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.
52+
- NVIDIA GPU Driver 565 or later.
53+
- While not strictly required, we recommend using NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.
5454

5555
..
5656
For convenience, the following example shows how to enable CDI upon GPU Operator installation:
@@ -80,15 +80,25 @@ Configure and Helm-install the driver
8080
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
8181
&& helm repo update
8282
83-
#. Install the driver, providing install-time configuration parameters. Example:
83+
#. Install the DRA driver, providing install-time configuration parameters.
84+
85+
Example for *Operator-provided* GPU driver:
8486

8587
.. code-block:: console
8688
8789
$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
8890
--version="25.3.0-rc.4" \
89-
--create-namespace \
90-
--namespace nvidia-dra-driver-gpu \
91+
--create-namespace --namespace nvidia-dra-driver-gpu \
92+
--set resources.gpus.enabled=false \
9193
--set nvidiaDriverRoot=/run/nvidia/driver \
94+
95+
Example for *host-provided* GPU driver:
96+
97+
.. code-block:: console
98+
99+
$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
100+
--version="25.3.0-rc.4" \
101+
--create-namespace --namespace nvidia-dra-driver-gpu \
92102
--set resources.gpus.enabled=false
93103
94104
All install-time configuration parameters can be listed by running ``helm show values nvidia/nvidia-dra-driver-gpu``.

0 commit comments

Comments
 (0)