|
| 1 | +.. license-header |
| 2 | + SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | + SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +########################## |
| 6 | +NVIDIA DRA Driver for GPUs |
| 7 | +########################## |
| 8 | + |
| 9 | +.. _dra_docs_compute_domains: |
| 10 | + |
| 11 | +******************************************** |
| 12 | +ComputeDomains: Multi-Node NVLink simplified |
| 13 | +******************************************** |
| 14 | + |
| 15 | +Motivation |
| 16 | +========== |
| 17 | + |
| 18 | +NVIDIA's `GB200 NVL72 <https://www.nvidia.com/en-us/data-center/gb200-nvl72/>`_ and comparable systems are designed specifically around Multi-Node NVLink (`MNNVL <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html>`_) to turn a rack of GPU machines -- each with a small number of GPUs -- into a supercomputer with a large number of GPUs communicating at high bandwidth (1.8 TB/s chip-to-chip, and over `130 TB/s cumulative bandwidth <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html#fifth-generation-nvlink>`_ on a GB200 NVL72). |
| 19 | + |
| 20 | +NVIDIA's DRA Driver for GPUs enables MNNVL for Kubernetes workloads by introducing a new concept -- the **ComputeDomain**: |
| 21 | +when a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload. |
| 22 | + |
| 23 | +.. note:: |
| 24 | + |
| 25 | + Users may appreciate to know that -- under the hood -- NVIDIA Internode Memory Exchange (`IMEX <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service>`_) primitives need to be orchestrated for mapping GPU memory over NVLink *securely*: IMEX provides an access control system to lock down GPU memory even between GPUs on the same NVLink partition. |
| 26 | + |
| 27 | + A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects `IMEX channels <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html>`_ into containers as needed. |
| 28 | + |
| 29 | + |
| 30 | +.. _dra-docs-cd-guarantees: |
| 31 | + |
| 32 | +Guarantees |
| 33 | +========== |
| 34 | + |
| 35 | +By design, an individual ComputeDomain guarantees |
| 36 | + |
| 37 | +#. **MNNVL-reachability** between pods that are in the domain. |
| 38 | +#. **secure isolation** from other pods that are not in the domain and in a different Kubernetes namespace. |
| 39 | + |
| 40 | +In terms of lifetime, a ComputeDomain is ephemeral: its lifetime is bound to the lifetime of the consuming workload. |
| 41 | +In terms of placement, our design choice is that a ComputeDomain follows the workload. |
| 42 | + |
| 43 | +That means: once workload pods get scheduled onto specific nodes, if they request a ComputeDomain, that domain automatically forms around them. |
| 44 | +Upon workload completion, all ComputeDomain-associated resources get torn down automatically. |
| 45 | + |
| 46 | +For more detail on the security properties of a ComputeDomain, see `Security <dra-docs-cd-security_>`__. |
| 47 | + |
| 48 | + |
| 49 | +A deeper dive: related resources |
| 50 | +================================ |
| 51 | + |
| 52 | +For more background on how ComputeDomains facilitate orchestrating MNNVL workloads on Kubernetes, see `this doc <https://docs.google.com/document/d/1PrdDofsPFVJuZvcv-vtlI9n2eAh-YVf_fRQLIVmDwVY/edit?tab=t.0#heading=h.qkogm924v5so>`_ and `this slide deck <https://docs.google.com/presentation/d/1Xupr8IZVAjs5bNFKJnYaK0LE7QWETnJjkz6KOfLu87E/edit?pli=1&slide=id.g28ac369118f_0_1647#slide=id.g28ac369118f_0_1647>`_. |
| 53 | +For an outlook on planned improvements on the ComputeDomain concept, please refer to `this document <https://github.com/NVIDIA/k8s-dra-driver-gpu/releases/tag/v25.3.0-rc.3>`_. |
| 54 | + |
| 55 | +Details about IMEX and its relationship to NVLink may be found in `NVIDIA's IMEX guide <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html>`_, and in `NVIDIA's NVLink guide <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service>`_. |
| 56 | +CUDA API documentation for `cuMemCreate <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA_1g899d69a862bba36449789c64b430dc7c>`_ provides a starting point to learn about how to share GPU memory via IMEX/NVLink. |
| 57 | +If you are looking for a higher-level GPU communication library, `NVIDIA's NCCL <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/nccl.html>`_ newer than version 2.25 supports MNNVL. |
| 58 | + |
| 59 | + |
| 60 | +Usage example: a multi-node nvbandwidth test |
| 61 | +============================================ |
| 62 | + |
| 63 | +This example demonstrates how to run an MNNVL workload across multiple nodes using a ComputeDomain (CD). |
| 64 | +As example CUDA workload that performs MNNVL communication, we have picked `nvbandwidth <https://github.com/NVIDIA/nvbandwidth>`_. |
| 65 | +Since nvbandwidth requires MPI, below we also install the `Kubeflow MPI Operator <https://github.com/kubeflow/mpi-operator>`_. |
| 66 | + |
| 67 | +**Steps:** |
| 68 | + |
| 69 | +#. Install the MPI Operator. |
| 70 | + |
| 71 | + .. code-block:: console |
| 72 | +
|
| 73 | + $ kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml |
| 74 | +
|
| 75 | +#. Create a test job file called ``nvbandwidth-test-job.yaml``. |
| 76 | + To do that, follow `this part of the CD validation instructions <https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-ComputeDomain-allocation#create-the-spec-file>`_. |
| 77 | + This example is configured to run across two nodes, using four GPUs per node. |
| 78 | + If you want to use different numbers, please adjust the parameters in the spec according to the table below: |
| 79 | + |
| 80 | + .. list-table:: |
| 81 | + :header-rows: 1 |
| 82 | + |
| 83 | + * - Parameter |
| 84 | + - Value (in example) |
| 85 | + |
| 86 | + * - ``ComputeDomain.spec.numNodes`` |
| 87 | + - Total number of nodes to use in the test (2). |
| 88 | + |
| 89 | + * - ``MPIJob.spec.slotsPerWorker`` |
| 90 | + - Number of GPUs per node to use -- this must match the ``ppr`` number below (4). |
| 91 | + |
| 92 | + * - ``MPIJob.spec.mpiReplicaSpecs.Worker.replicas`` |
| 93 | + - Also set this to the number of nodes (2). |
| 94 | + |
| 95 | + * - ``mpirun`` command argument ``-ppr:4:node`` |
| 96 | + - Set this to the number of GPUs to use per node (4) |
| 97 | + |
| 98 | + * - ``mpirun`` command argument ``-np`` value |
| 99 | + - Set this to the total number of GPUs in the test (8). |
| 100 | + |
| 101 | +#. Apply the manifest. |
| 102 | + |
| 103 | + .. code-block:: console |
| 104 | +
|
| 105 | + $ kubectl apply -f nvbandwidth-test-job.yaml |
| 106 | +
|
| 107 | + *Example Output* |
| 108 | + |
| 109 | + .. code-block:: output |
| 110 | +
|
| 111 | + computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain configured |
| 112 | + mpijob.kubeflow.org/nvbandwidth-test configured |
| 113 | +
|
| 114 | +#. Verify that the nvbandwidth pods were created. |
| 115 | + |
| 116 | + .. code-block:: console |
| 117 | +
|
| 118 | + $ kubectl get pods |
| 119 | +
|
| 120 | + *Example Output* |
| 121 | + |
| 122 | + .. code-block:: output |
| 123 | +
|
| 124 | + NAME READY STATUS RESTARTS AGE |
| 125 | + nvbandwidth-test-launcher-lzv84 1/1 Running 0 8s |
| 126 | + nvbandwidth-test-worker-0 1/1 Running 0 15s |
| 127 | + nvbandwidth-test-worker-1 1/1 Running 0 15s |
| 128 | +
|
| 129 | +
|
| 130 | +#. Verify that the ComputeDomain pods were created for each node. |
| 131 | + |
| 132 | + .. code-block:: console |
| 133 | +
|
| 134 | + $ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain |
| 135 | +
|
| 136 | + *Example Output* |
| 137 | + |
| 138 | + .. code-block:: output |
| 139 | +
|
| 140 | + NAME READY STATUS RESTARTS AGE |
| 141 | + nvbandwidth-test-compute-domain-ht24d-9jhmj 1/1 Running 0 20s |
| 142 | + nvbandwidth-test-compute-domain-ht24d-rcn2c 1/1 Running 0 20s |
| 143 | +
|
| 144 | +#. Verify the nvbandwidth test output. |
| 145 | + |
| 146 | + .. code-block:: console |
| 147 | +
|
| 148 | + $ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher |
| 149 | +
|
| 150 | + *Example Output* |
| 151 | + |
| 152 | + .. code-block:: output |
| 153 | +
|
| 154 | + Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts. |
| 155 | + Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts. |
| 156 | + [nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: |
| 157 | +
|
| 158 | + [...] |
| 159 | +
|
| 160 | + [nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] |
| 161 | + nvbandwidth Version: v0.7 |
| 162 | + Built from Git version: v0.7 |
| 163 | +
|
| 164 | + MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022 |
| 165 | + CUDA Runtime Version: 12080 |
| 166 | + CUDA Driver Version: 12080 |
| 167 | + Driver Version: 570.124.06 |
| 168 | +
|
| 169 | + Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00) |
| 170 | + Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00) |
| 171 | + Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00) |
| 172 | + Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00) |
| 173 | + Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00) |
| 174 | + Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00) |
| 175 | + Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00) |
| 176 | + Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00) |
| 177 | +
|
| 178 | + Running multinode_device_to_device_memcpy_read_ce. |
| 179 | + memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s) |
| 180 | + 0 1 2 3 4 5 6 7 |
| 181 | + 0 N/A 798.02 798.25 798.02 798.02 797.88 797.73 797.95 |
| 182 | + 1 798.10 N/A 797.80 798.02 798.02 798.25 797.88 798.02 |
| 183 | + 2 797.95 797.95 N/A 797.73 797.80 797.95 797.95 797.65 |
| 184 | + 3 798.10 798.02 797.95 N/A 798.02 798.10 797.88 797.73 |
| 185 | + 4 797.80 798.02 798.02 798.02 N/A 797.95 797.80 798.02 |
| 186 | + 5 797.80 797.95 798.10 798.10 797.95 N/A 797.95 797.88 |
| 187 | + 6 797.73 797.95 798.10 798.02 797.95 797.88 N/A 797.80 |
| 188 | + 7 797.88 798.02 797.95 798.02 797.88 797.95 798.02 N/A |
| 189 | +
|
| 190 | + SUM multinode_device_to_device_memcpy_read_ce 44685.29 |
| 191 | +
|
| 192 | + NOTE: The reported results may not reflect the full capabilities of the platform. |
| 193 | +
|
| 194 | +#. Clean up. |
| 195 | + |
| 196 | + .. code-block:: console |
| 197 | +
|
| 198 | + $ kubectl delete -f nvbandwidth-test-job.yaml |
| 199 | +
|
| 200 | +.. _dra-docs-cd-security: |
| 201 | + |
| 202 | +Security |
| 203 | +======== |
| 204 | + |
| 205 | +As indicated in `Guarantees <dra-docs-cd-guarantees_>`__, the ComputeDomain primitive provides a *security boundary.* This section helps clarify why that boundary is needed, and how it works. |
| 206 | + |
| 207 | +NVLink enables mapping a remote GPU's memory to "local" GPU's memory (so that it can be read from and written to with regular CUDA API calls). |
| 208 | +From a security point of view, that begs the question: can a process running on a GPU in a certain NVLink partition freely read and mutate the memory of other GPUs in the same NVLink partition -- or is there some notion of access control layer inbetween? |
| 209 | + |
| 210 | +IMEX has been introduced specifically as that layer of access control. |
| 211 | +It is a means for providing secure isolation between GPUs that are in the same NVLink partition. |
| 212 | +With IMEX, every individual GPU memory export/import operation is subject to fine-grained access control. |
| 213 | + |
| 214 | +To understand ComputeDomains, we additionally need to know: |
| 215 | + |
| 216 | +- The ComputeDomain security boundary is implemented with IMEX. |
| 217 | +- A job submitted to Kubernetes namespace `A` cannot be part of a ComputeDomain created for namespace `B`. |
| 218 | + |
| 219 | + |
| 220 | +That is, ComputeDomains (only) promise robust IMEX-based isolation between jobs that are **not** part of the same Kubernetes namespace. |
| 221 | +If a bad actor has access to a Kubernetes namespace, they may be able to mutate ComputeDomains (and, as such, IMEX primitives) in that Kubernetes namespace. |
| 222 | +That, in turn, may allow for disabling or trivially working around IMEX access control. |
| 223 | + |
| 224 | +With ComputeDomains, the overall ambition is that the security isolation between jobs in different Kubernetes namespaces is strong enough to responsibly allow for multi-tenant environments where compute jobs that conceptually cannot trust each other are "only" separated by the Kubernetes namespace boundary. |
| 225 | + |
| 226 | + |
| 227 | +Additional remarks |
| 228 | +================== |
| 229 | + |
| 230 | +We are planning to extend the documentation for ComputeDomains, with a focus on API reference documentation and known limitations as well as best practices and security. |
| 231 | + |
| 232 | +As we iterate on design and implementation, we are particularly interested and open to receiving your feedback -- please reach out via the issue tracker or discussion forum in the `GitHub repository <https://github.com/NVIDIA/k8s-dra-driver-gpu>`_. |
0 commit comments