Skip to content

Commit d2766f5

Browse files
authored
Merge pull request #162 from a-mccarthy/dra-driver
adding dra docs
2 parents 834b5e7 + b9922fd commit d2766f5

File tree

5 files changed

+391
-1
lines changed

5 files changed

+391
-1
lines changed

gpu-operator/dra-cds.rst

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
.. license-header
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
##########################
6+
NVIDIA DRA Driver for GPUs
7+
##########################
8+
9+
.. _dra_docs_compute_domains:
10+
11+
********************************************
12+
ComputeDomains: Multi-Node NVLink simplified
13+
********************************************
14+
15+
Motivation
16+
==========
17+
18+
NVIDIA's `GB200 NVL72 <https://www.nvidia.com/en-us/data-center/gb200-nvl72/>`_ and comparable systems are designed specifically around Multi-Node NVLink (`MNNVL <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html>`_) to turn a rack of GPU machines -- each with a small number of GPUs -- into a supercomputer with a large number of GPUs communicating at high bandwidth (1.8 TB/s chip-to-chip, and over `130 TB/s cumulative bandwidth <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html#fifth-generation-nvlink>`_ on a GB200 NVL72).
19+
20+
NVIDIA's DRA Driver for GPUs enables MNNVL for Kubernetes workloads by introducing a new concept -- the **ComputeDomain**:
21+
when a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.
22+
23+
.. note::
24+
25+
Users may appreciate to know that -- under the hood -- NVIDIA Internode Memory Exchange (`IMEX <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service>`_) primitives need to be orchestrated for mapping GPU memory over NVLink *securely*: IMEX provides an access control system to lock down GPU memory even between GPUs on the same NVLink partition.
26+
27+
A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects `IMEX channels <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html>`_ into containers as needed.
28+
29+
30+
.. _dra-docs-cd-guarantees:
31+
32+
Guarantees
33+
==========
34+
35+
By design, an individual ComputeDomain guarantees
36+
37+
#. **MNNVL-reachability** between pods that are in the domain.
38+
#. **secure isolation** from other pods that are not in the domain and in a different Kubernetes namespace.
39+
40+
In terms of lifetime, a ComputeDomain is ephemeral: its lifetime is bound to the lifetime of the consuming workload.
41+
In terms of placement, our design choice is that a ComputeDomain follows the workload.
42+
43+
That means: once workload pods get scheduled onto specific nodes, if they request a ComputeDomain, that domain automatically forms around them.
44+
Upon workload completion, all ComputeDomain-associated resources get torn down automatically.
45+
46+
For more detail on the security properties of a ComputeDomain, see `Security <dra-docs-cd-security_>`__.
47+
48+
49+
A deeper dive: related resources
50+
================================
51+
52+
For more background on how ComputeDomains facilitate orchestrating MNNVL workloads on Kubernetes, see `this doc <https://docs.google.com/document/d/1PrdDofsPFVJuZvcv-vtlI9n2eAh-YVf_fRQLIVmDwVY/edit?tab=t.0#heading=h.qkogm924v5so>`_ and `this slide deck <https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g28ac369118f_0_1647#slide=id.g28ac369118f_0_1647>`_.
53+
For an outlook on planned improvements on the ComputeDomain concept, please refer to `this document <https://github.com/NVIDIA/k8s-dra-driver-gpu/releases/tag/v25.3.0-rc.3>`_.
54+
55+
Details about IMEX and its relationship to NVLink may be found in `NVIDIA's IMEX guide <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html>`_, and in `NVIDIA's NVLink guide <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service>`_.
56+
CUDA API documentation for `cuMemCreate <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA_1g899d69a862bba36449789c64b430dc7c>`_ provides a starting point to learn about how to share GPU memory via IMEX/NVLink.
57+
If you are looking for a higher-level GPU communication library, `NVIDIA's NCCL <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/nccl.html>`_ newer than version 2.25 supports MNNVL.
58+
59+
60+
Usage example: a multi-node nvbandwidth test
61+
============================================
62+
63+
This example demonstrates how to run an MNNVL workload across multiple nodes using a ComputeDomain (CD).
64+
As example CUDA workload that performs MNNVL communication, we have picked `nvbandwidth <https://github.com/NVIDIA/nvbandwidth>`_.
65+
Since nvbandwidth requires MPI, below we also install the `Kubeflow MPI Operator <https://github.com/kubeflow/mpi-operator>`_.
66+
67+
**Steps:**
68+
69+
#. Install the MPI Operator.
70+
71+
.. code-block:: console
72+
73+
$ kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml
74+
75+
#. Create a test job file called ``nvbandwidth-test-job.yaml``.
76+
To do that, follow `this part of the CD validation instructions <https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-ComputeDomain-allocation#create-the-spec-file>`_.
77+
This example is configured to run across two nodes, using four GPUs per node.
78+
If you want to use different numbers, please adjust the parameters in the spec according to the table below:
79+
80+
.. list-table::
81+
:header-rows: 1
82+
83+
* - Parameter
84+
- Value (in example)
85+
86+
* - ``ComputeDomain.spec.numNodes``
87+
- Total number of nodes to use in the test (2).
88+
89+
* - ``MPIJob.spec.slotsPerWorker``
90+
- Number of GPUs per node to use -- this must match the ``ppr`` number below (4).
91+
92+
* - ``MPIJob.spec.mpiReplicaSpecs.Worker.replicas``
93+
- Also set this to the number of nodes (2).
94+
95+
* - ``mpirun`` command argument ``-ppr:4:node``
96+
- Set this to the number of GPUs to use per node (4)
97+
98+
* - ``mpirun`` command argument ``-np`` value
99+
- Set this to the total number of GPUs in the test (8).
100+
101+
#. Apply the manifest.
102+
103+
.. code-block:: console
104+
105+
$ kubectl apply -f nvbandwidth-test-job.yaml
106+
107+
*Example Output*
108+
109+
.. code-block:: output
110+
111+
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain configured
112+
mpijob.kubeflow.org/nvbandwidth-test configured
113+
114+
#. Verify that the nvbandwidth pods were created.
115+
116+
.. code-block:: console
117+
118+
$ kubectl get pods
119+
120+
*Example Output*
121+
122+
.. code-block:: output
123+
124+
NAME READY STATUS RESTARTS AGE
125+
nvbandwidth-test-launcher-lzv84 1/1 Running 0 8s
126+
nvbandwidth-test-worker-0 1/1 Running 0 15s
127+
nvbandwidth-test-worker-1 1/1 Running 0 15s
128+
129+
130+
#. Verify that the ComputeDomain pods were created for each node.
131+
132+
.. code-block:: console
133+
134+
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
135+
136+
*Example Output*
137+
138+
.. code-block:: output
139+
140+
NAME READY STATUS RESTARTS AGE
141+
nvbandwidth-test-compute-domain-ht24d-9jhmj 1/1 Running 0 20s
142+
nvbandwidth-test-compute-domain-ht24d-rcn2c 1/1 Running 0 20s
143+
144+
#. Verify the nvbandwidth test output.
145+
146+
.. code-block:: console
147+
148+
$ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
149+
150+
*Example Output*
151+
152+
.. code-block:: output
153+
154+
Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
155+
Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
156+
[nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
157+
158+
[...]
159+
160+
[nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
161+
nvbandwidth Version: v0.7
162+
Built from Git version: v0.7
163+
164+
MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
165+
CUDA Runtime Version: 12080
166+
CUDA Driver Version: 12080
167+
Driver Version: 570.124.06
168+
169+
Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00)
170+
Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00)
171+
Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00)
172+
Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00)
173+
Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00)
174+
Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00)
175+
Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00)
176+
Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00)
177+
178+
Running multinode_device_to_device_memcpy_read_ce.
179+
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
180+
0 1 2 3 4 5 6 7
181+
0 N/A 798.02 798.25 798.02 798.02 797.88 797.73 797.95
182+
1 798.10 N/A 797.80 798.02 798.02 798.25 797.88 798.02
183+
2 797.95 797.95 N/A 797.73 797.80 797.95 797.95 797.65
184+
3 798.10 798.02 797.95 N/A 798.02 798.10 797.88 797.73
185+
4 797.80 798.02 798.02 798.02 N/A 797.95 797.80 798.02
186+
5 797.80 797.95 798.10 798.10 797.95 N/A 797.95 797.88
187+
6 797.73 797.95 798.10 798.02 797.95 797.88 N/A 797.80
188+
7 797.88 798.02 797.95 798.02 797.88 797.95 798.02 N/A
189+
190+
SUM multinode_device_to_device_memcpy_read_ce 44685.29
191+
192+
NOTE: The reported results may not reflect the full capabilities of the platform.
193+
194+
#. Clean up.
195+
196+
.. code-block:: console
197+
198+
$ kubectl delete -f nvbandwidth-test-job.yaml
199+
200+
.. _dra-docs-cd-security:
201+
202+
Security
203+
========
204+
205+
As indicated in `Guarantees <dra-docs-cd-guarantees_>`__, the ComputeDomain primitive provides a *security boundary.* This section helps clarify why that boundary is needed, and how it works.
206+
207+
NVLink enables mapping a remote GPU's memory to "local" GPU's memory (so that it can be read from and written to with regular CUDA API calls).
208+
From a security point of view, that begs the question: can a process running on a GPU in a certain NVLink partition freely read and mutate the memory of other GPUs in the same NVLink partition -- or is there some notion of access control layer inbetween?
209+
210+
IMEX has been introduced specifically as that layer of access control.
211+
It is a means for providing secure isolation between GPUs that are in the same NVLink partition.
212+
With IMEX, every individual GPU memory export/import operation is subject to fine-grained access control.
213+
214+
To understand ComputeDomains, we additionally need to know:
215+
216+
- The ComputeDomain security boundary is implemented with IMEX.
217+
- A job submitted to Kubernetes namespace `A` cannot be part of a ComputeDomain created for namespace `B`.
218+
219+
220+
That is, ComputeDomains (only) promise robust IMEX-based isolation between jobs that are **not** part of the same Kubernetes namespace.
221+
If a bad actor has access to a Kubernetes namespace, they may be able to mutate ComputeDomains (and, as such, IMEX primitives) in that Kubernetes namespace.
222+
That, in turn, may allow for disabling or trivially working around IMEX access control.
223+
224+
With ComputeDomains, the overall ambition is that the security isolation between jobs in different Kubernetes namespaces is strong enough to responsibly allow for multi-tenant environments where compute jobs that conceptually cannot trust each other are "only" separated by the Kubernetes namespace boundary.
225+
226+
227+
Additional remarks
228+
==================
229+
230+
We are planning to extend the documentation for ComputeDomains, with a focus on API reference documentation and known limitations as well as best practices and security.
231+
232+
As we iterate on design and implementation, we are particularly interested and open to receiving your feedback -- please reach out via the issue tracker or discussion forum in the `GitHub repository <https://github.com/NVIDIA/k8s-dra-driver-gpu>`_.

gpu-operator/dra-gpus.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
.. license-header
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
##########################
6+
NVIDIA DRA Driver for GPUs
7+
##########################
8+
9+
.. _dra_docs_gpus:
10+
11+
**************
12+
GPU allocation
13+
**************
14+
15+
Compared to `traditional GPU allocation <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins/>`_ using coarse-grained count-based requests, the GPU allocation side of this driver enables fine-grained control and powerful features long desired by the community, such as:
16+
17+
#. Controlled sharing of individual GPUs between multiple pods and/or containers.
18+
#. GPU selection via complex constraints expressed via `CEL <https://kubernetes.io/docs/reference/using-api/cel/>`_.
19+
#. Dynamic partitioning.
20+
21+
To learn more about this part of the driver and about what we are planning to build in the future, have a look at `these release notes <https://github.com/NVIDIA/k8s-dra-driver-gpu/releases/tag/v25.3.0-rc.3>`_.
22+
23+
While the GPU allocation features of this driver can be tried out, they are not yet officially supported.
24+
Hence, the GPU kubelet plugin is currently disabled by default in the Helm chart installation.
25+
26+
For documentation on how to use and test the current set of GPU allocation features, please head over to the `demo section <https://github.com/NVIDIA/k8s-dra-driver-gpu?tab=readme-ov-file#a-kind-demo>`_ of the driver's README and to its `quickstart directory <https://github.com/NVIDIA/k8s-dra-driver-gpu/tree/main/demo/specs/quickstart>`_.
27+
28+
.. note::
29+
This part of the NVIDIA DRA Driver for GPUs is in **Technology Preview**.
30+
It is not yet supported in production environments and not yet functionally complete.
31+
Generally spoken, Technology Preview features provide early access to upcoming product features, enabling users to test functionality and provide feedback during the development process.
32+
Technology Preview releases may not have full documentation, and testing is limited.

0 commit comments

Comments
 (0)