Clarify GPU Operator requirements for OpenShift Virtualization (#213)

empovit · web-flow · commit 0ec4b1038602 · 2025-07-17T09:52:04.000-07:00
- Clarify that VFIO Manager, Sandbox Device Plugin, and Sandbox Validator are required
  for GPU Operator approach but not needed when using Red Hat's alternative procedures
- Add links to Red Hat's PCI passthrough and vGPU configuration procedures
- Clarify cluster policy configuration steps for GPU passthrough and vGPU workflows
- Update CLI and web console configuration sections with all required parameters
- Fix typo: sandboxWorloads -&gt; sandboxWorkloads

Signed-off-by: Vitaliy Emporopulo &lt;vemporop@redhat.com&gt;
diff --git a/openshift/openshift-virtualization.rst b/openshift/openshift-virtualization.rst
@@ -15,15 +15,15 @@ Introduction
 
 
 There is a growing demand among Red Hat customers to use virtual GPUs (NVIDIA vGPU)
-with Red Hat OpenShift Virtualization. 
-Red Hat OpenShift Virtualization is based on KubeVirt, a virtual machine (VM) management add-on to Kubernetes that allows you to run and manage VMs in a Kubernetes cluster. 
-It eliminates the need to manage separate clusters for VM and container workloads, as both can now coexist in a single Kubernetes cluster. 
+with Red Hat OpenShift Virtualization.
+Red Hat OpenShift Virtualization is based on KubeVirt, a virtual machine (VM) management add-on to Kubernetes that allows you to run and manage VMs in a Kubernetes cluster.
+It eliminates the need to manage separate clusters for VM and container workloads, as both can now coexist in a single Kubernetes cluster.
 Red Hat OpenShift Virtualization is an OpenShift feature to run virtual machines (VMs) orchestrated by OpenShift (Kubernetes).
 
 In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines.
 
 There are some different prerequisites required virtual machines with GPU(s) than running containers with GPU(s).
-The primary difference is the drivers required. 
+The primary difference is the drivers required.
 For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the `NVIDIA vGPU Manager <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu>`_ is needed for creating vGPU devices.
 
 .. _configure-worker-nodes-for-gpu-operator-components:
@@ -48,18 +48,38 @@ Node A receives the following software components:
 * ``NVIDIA Kubernetes Device Plugin`` - To discover and advertise GPU resources to the kubelet.
 * ``NVIDIA DCGM and DCGM Exporter`` - To monitor the GPU(s).
 
-Node B receives the following software components:
+There are two approaches to configuring GPU passthrough and vGPU for virtual machines:
 
-* ``VFIO Manager`` - Optional. To load vfio-pci and bind it to all GPUs on the node.
-* ``Sandbox Device Plugin`` - Optional. To discover and advertise the passthrough GPUs to the kubelet.
-* ``Sandbox Validator`` -Optional. Validates that Sandbox Device Plugin is working.
+1. **NVIDIA GPU Operator approach** - Uses the GPU Operator to deploy and manage GPU software components.
+2. **Red Hat OpenShift Virtualization approach** - Uses Red Hat OpenShift Virtualization native procedures, which are tested and supported by Red Hat.
 
-Node C receives the following software components:
+Node B (GPU Passthrough) receives the following software components:
+
+**NVIDIA GPU Operator approach:**
+
+* ``VFIO Manager`` - To load vfio-pci and bind it to all GPUs on the node.
+* ``Sandbox Device Plugin`` - To discover and advertise the passthrough GPUs to the kubelet.
+* ``Sandbox Validator`` - Validates that Sandbox Device Plugin is working.
+
+**Red Hat OpenShift Virtualization approach:**
+
+* Uses Red Hat OpenShift Virtualization's `PCI passthrough configuration <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html-single/virtualization/index#virt-configuring-pci-passthrough>`_.
+* When using this approach, NVIDIA GPU Operator's operands must be disabled on the node to avoid conflicts.
+
+Node C (vGPU) receives the following software components:
+
+**NVIDIA GPU Operator approach:**
 
 * ``NVIDIA vGPU Manager`` - To install the driver.
 * ``NVIDIA vGPU Device Manager`` - To create vGPU devices on the node.
-* ``Sandbox Device Plugin`` -Optional. To discover and advertise the vGPU devices to kubelet.
-* ``Sandbox Validator`` -Optional. Validates that Sandbox Device Plugin is working.
+* ``Sandbox Device Plugin`` - To discover and advertise the vGPU devices to kubelet.
+* ``Sandbox Validator`` - Validates that Sandbox Device Plugin is working.
+
+**Red Hat OpenShift Virtualization approach:**
+
+* Uses Red Hat OpenShift Virtualization's `vGPU configuration <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html-single/virtualization/index#virt-configuring-virtual-gpus>`_.
+* Relies on OpenShift Virtualization’s capabilities to configure mediated devices.
+* The NVIDIA GPU Operator is only used for installing drivers with the NVIDIA vGPU Manager. The GPU Operator does not configure mediated devices.
 
 
 Assumptions, constraints, and dependencies
@@ -246,7 +266,7 @@ Use the following steps to build the vGPU Manager container and push it to a pri
 
    .. code-block:: console
 
-      $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11 
+      $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11
 
 .. note::
 
@@ -330,15 +350,18 @@ Create the cluster policy using the CLI:
 
 #. Modify the ``clusterpolicy.json`` file as follows:
 
-   * sandboxWorloads.enabled=true
-   * vgpuManager.enabled=true
-   * vgpuManager.repository=<path to private repository>
-   * vgpuManager.image=vgpu-manager
-   * vgpuManager.version=<driver version>
-   * vgpuManager.imagePullSecrets={<name of image pull secret>}
-   
-
-   The ``vgpuManager`` options are only required if you want to use the NVIDIA vGPU. If you are only using GPU passthrough, these options should not be set.
+   * sandboxWorkloads.enabled=true
+   * sandboxDevicePlugin.enabled=true
+   * For GPU passthrough:
+      * vfioManager.enabled=true
+      * Optionally, sandboxWorkloads.defaultWorkload=vm-passthrough (if you want passthrough to be the default mode)
+   * For vGPU:
+      * vgpuManager.enabled=true
+      * vgpuManager.repository=<path to private repository>
+      * vgpuManager.image=vgpu-manager
+      * vgpuManager.version=<driver version>
+      * vgpuManager.imagePullSecrets={<name of image pull secret>}
+      * vgpuDeviceManager.enabled=true
 
    In general, the flag ``sandboxWorkloads.enabled`` in ``ClusterPolicy`` controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. This flag is disabled by default, meaning all nodes get provisioned with the same software which enables container workloads, and the ``nvidia.com/gpu.workload.config`` node label is not used.
 
@@ -365,7 +388,7 @@ Creating a ClusterPolicy for the GPU Operator using the OpenShift Container Plat
 
 As a cluster administrator, you can create a ClusterPolicy using the OpenShift Container Platform web console.
 
-#. Navigate to **Operators** > **Installed Operators** and find your installed NVIDIA GPU Operator. 
+#. Navigate to **Operators** > **Installed Operators** and find your installed NVIDIA GPU Operator.
 
 #. Under *Provided APIs*, click **ClusterPolicy**.
 
@@ -388,14 +411,21 @@ As a cluster administrator, you can create a ClusterPolicy using the OpenShift C
 
    .. image:: graphics/cluster_policy_enable_sandbox_workloads.png
 
-#. If you are planning to use NVIDIA vGPU, expand the **NVIDIA vGPU Manager config** section and fill in your desired configuration settings, including:
+#. Expand the **Sandbox Device Plugin config** section and make sure that the **enabled** checkbox is checked.
 
-   * Select the **enabled** checkbox to enable the NVIDIA vGPU Manager.
-   * Add your **imagePullSecrets**.
-   * Under *driverManager*, fill in **repository** with the path to your private repository.
-   * Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version. 
+#. If you are planning to use NVIDIA vGPU
 
-   If you are only using GPU passthrough, you dont need to fill this section out.
+   * Expand the **NVIDIA vGPU Manager config** section and fill in your desired configuration settings, including:
+      * Select the **enabled** checkbox to enable the NVIDIA vGPU Manager.
+      * Add your **imagePullSecrets**.
+      * Under *driverManager*, fill in **repository** with the path to your private repository.
+      * Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version.
+   * Expand the **NVIDIA vGPU Device Manager config** section and make sure that the **enabled** checkbox is checked.
+
+   If you are only using GPU passthrough, you don't need to fill these sections out.
+
+   * Expand the **VFIO Manager config** section and select the **enabled** checkbox.
+   * Optionally, in the **Sandbox Workloads config** section, set **defaultWorkload** to ``vm-passthrough`` if you want passthrough to be the default mode.
 
    .. image:: graphics/cluster_policy_configure_vgpu.png
 
@@ -582,7 +612,7 @@ Procedure
    Example for vGPU:
 
    .. code-block:: yaml
-      
+
       apiVersion: kubevirt.io/v1alpha3
       kind: VirtualMachineInstance
       ...