You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clarify GPU Operator requirements for OpenShift Virtualization (#213)
- Clarify that VFIO Manager, Sandbox Device Plugin, and Sandbox Validator are required
for GPU Operator approach but not needed when using Red Hat's alternative procedures
- Add links to Red Hat's PCI passthrough and vGPU configuration procedures
- Clarify cluster policy configuration steps for GPU passthrough and vGPU workflows
- Update CLI and web console configuration sections with all required parameters
- Fix typo: sandboxWorloads -> sandboxWorkloads
Signed-off-by: Vitaliy Emporopulo <[email protected]>
Copy file name to clipboardExpand all lines: openshift/openshift-virtualization.rst
+59-29Lines changed: 59 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,15 +15,15 @@ Introduction
15
15
16
16
17
17
There is a growing demand among Red Hat customers to use virtual GPUs (NVIDIA vGPU)
18
-
with Red Hat OpenShift Virtualization.
19
-
Red Hat OpenShift Virtualization is based on KubeVirt, a virtual machine (VM) management add-on to Kubernetes that allows you to run and manage VMs in a Kubernetes cluster.
20
-
It eliminates the need to manage separate clusters for VM and container workloads, as both can now coexist in a single Kubernetes cluster.
18
+
with Red Hat OpenShift Virtualization.
19
+
Red Hat OpenShift Virtualization is based on KubeVirt, a virtual machine (VM) management add-on to Kubernetes that allows you to run and manage VMs in a Kubernetes cluster.
20
+
It eliminates the need to manage separate clusters for VM and container workloads, as both can now coexist in a single Kubernetes cluster.
21
21
Red Hat OpenShift Virtualization is an OpenShift feature to run virtual machines (VMs) orchestrated by OpenShift (Kubernetes).
22
22
23
23
In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines.
24
24
25
25
There are some different prerequisites required virtual machines with GPU(s) than running containers with GPU(s).
26
-
The primary difference is the drivers required.
26
+
The primary difference is the drivers required.
27
27
For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the `NVIDIA vGPU Manager <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu>`_ is needed for creating vGPU devices.
@@ -48,18 +48,38 @@ Node A receives the following software components:
48
48
* ``NVIDIA Kubernetes Device Plugin`` - To discover and advertise GPU resources to the kubelet.
49
49
* ``NVIDIA DCGM and DCGM Exporter`` - To monitor the GPU(s).
50
50
51
-
Node B receives the following software components:
51
+
There are two approaches to configuring GPU passthrough and vGPU for virtual machines:
52
52
53
-
* ``VFIO Manager`` - Optional. To load vfio-pci and bind it to all GPUs on the node.
54
-
* ``Sandbox Device Plugin`` - Optional. To discover and advertise the passthrough GPUs to the kubelet.
55
-
* ``Sandbox Validator`` -Optional. Validates that Sandbox Device Plugin is working.
53
+
1. **NVIDIA GPU Operator approach** - Uses the GPU Operator to deploy and manage GPU software components.
54
+
2. **Red Hat OpenShift Virtualization approach** - Uses Red Hat OpenShift Virtualization native procedures, which are tested and supported by Red Hat.
56
55
57
-
Node C receives the following software components:
56
+
Node B (GPU Passthrough) receives the following software components:
57
+
58
+
**NVIDIA GPU Operator approach:**
59
+
60
+
* ``VFIO Manager`` - To load vfio-pci and bind it to all GPUs on the node.
61
+
* ``Sandbox Device Plugin`` - To discover and advertise the passthrough GPUs to the kubelet.
62
+
* ``Sandbox Validator`` - Validates that Sandbox Device Plugin is working.
63
+
64
+
**Red Hat OpenShift Virtualization approach:**
65
+
66
+
* Uses Red Hat OpenShift Virtualization's `PCI passthrough configuration <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html-single/virtualization/index#virt-configuring-pci-passthrough>`_.
67
+
* When using this approach, NVIDIA GPU Operator's operands must be disabled on the node to avoid conflicts.
68
+
69
+
Node C (vGPU) receives the following software components:
70
+
71
+
**NVIDIA GPU Operator approach:**
58
72
59
73
* ``NVIDIA vGPU Manager`` - To install the driver.
60
74
* ``NVIDIA vGPU Device Manager`` - To create vGPU devices on the node.
61
-
* ``Sandbox Device Plugin`` -Optional. To discover and advertise the vGPU devices to kubelet.
62
-
* ``Sandbox Validator`` -Optional. Validates that Sandbox Device Plugin is working.
75
+
* ``Sandbox Device Plugin`` - To discover and advertise the vGPU devices to kubelet.
76
+
* ``Sandbox Validator`` - Validates that Sandbox Device Plugin is working.
77
+
78
+
**Red Hat OpenShift Virtualization approach:**
79
+
80
+
* Uses Red Hat OpenShift Virtualization's `vGPU configuration <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html-single/virtualization/index#virt-configuring-virtual-gpus>`_.
81
+
* Relies on OpenShift Virtualization’s capabilities to configure mediated devices.
82
+
* The NVIDIA GPU Operator is only used for installing drivers with the NVIDIA vGPU Manager. The GPU Operator does not configure mediated devices.
63
83
64
84
65
85
Assumptions, constraints, and dependencies
@@ -246,7 +266,7 @@ Use the following steps to build the vGPU Manager container and push it to a pri
@@ -330,15 +350,18 @@ Create the cluster policy using the CLI:
330
350
331
351
#. Modify the ``clusterpolicy.json`` file as follows:
332
352
333
-
* sandboxWorloads.enabled=true
334
-
* vgpuManager.enabled=true
335
-
* vgpuManager.repository=<path to private repository>
336
-
* vgpuManager.image=vgpu-manager
337
-
* vgpuManager.version=<driver version>
338
-
* vgpuManager.imagePullSecrets={<name of image pull secret>}
339
-
340
-
341
-
The ``vgpuManager`` options are only required if you want to use the NVIDIA vGPU. If you are only using GPU passthrough, these options should not be set.
353
+
* sandboxWorkloads.enabled=true
354
+
* sandboxDevicePlugin.enabled=true
355
+
* For GPU passthrough:
356
+
* vfioManager.enabled=true
357
+
* Optionally, sandboxWorkloads.defaultWorkload=vm-passthrough (if you want passthrough to be the default mode)
358
+
* For vGPU:
359
+
* vgpuManager.enabled=true
360
+
* vgpuManager.repository=<path to private repository>
361
+
* vgpuManager.image=vgpu-manager
362
+
* vgpuManager.version=<driver version>
363
+
* vgpuManager.imagePullSecrets={<name of image pull secret>}
364
+
* vgpuDeviceManager.enabled=true
342
365
343
366
In general, the flag ``sandboxWorkloads.enabled`` in ``ClusterPolicy`` controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. This flag is disabled by default, meaning all nodes get provisioned with the same software which enables container workloads, and the ``nvidia.com/gpu.workload.config`` node label is not used.
344
367
@@ -365,7 +388,7 @@ Creating a ClusterPolicy for the GPU Operator using the OpenShift Container Plat
365
388
366
389
As a cluster administrator, you can create a ClusterPolicy using the OpenShift Container Platform web console.
367
390
368
-
#. Navigate to **Operators** > **Installed Operators** and find your installed NVIDIA GPU Operator.
391
+
#. Navigate to **Operators** > **Installed Operators** and find your installed NVIDIA GPU Operator.
369
392
370
393
#. Under *Provided APIs*, click **ClusterPolicy**.
371
394
@@ -388,14 +411,21 @@ As a cluster administrator, you can create a ClusterPolicy using the OpenShift C
#. If you are planning to use NVIDIA vGPU, expand the **NVIDIA vGPU Manager config** section and fill in your desired configuration settings, including:
414
+
#. Expand the **Sandbox Device Plugin config** section and make sure that the **enabled** checkbox is checked.
392
415
393
-
* Select the **enabled** checkbox to enable the NVIDIA vGPU Manager.
394
-
* Add your **imagePullSecrets**.
395
-
* Under *driverManager*, fill in **repository** with the path to your private repository.
396
-
* Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version.
416
+
#. If you are planning to use NVIDIA vGPU
397
417
398
-
If you are only using GPU passthrough, you dont need to fill this section out.
418
+
* Expand the **NVIDIA vGPU Manager config** section and fill in your desired configuration settings, including:
419
+
* Select the **enabled** checkbox to enable the NVIDIA vGPU Manager.
420
+
* Add your **imagePullSecrets**.
421
+
* Under *driverManager*, fill in **repository** with the path to your private repository.
422
+
* Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version.
423
+
* Expand the **NVIDIA vGPU Device Manager config** section and make sure that the **enabled** checkbox is checked.
424
+
425
+
If you are only using GPU passthrough, you don't need to fill these sections out.
426
+
427
+
* Expand the **VFIO Manager config** section and select the **enabled** checkbox.
428
+
* Optionally, in the **Sandbox Workloads config** section, set **defaultWorkload** to ``vm-passthrough`` if you want passthrough to be the default mode.
0 commit comments