Skip to content

Commit 0ec4b10

Browse files
authored
Clarify GPU Operator requirements for OpenShift Virtualization (#213)
- Clarify that VFIO Manager, Sandbox Device Plugin, and Sandbox Validator are required for GPU Operator approach but not needed when using Red Hat's alternative procedures - Add links to Red Hat's PCI passthrough and vGPU configuration procedures - Clarify cluster policy configuration steps for GPU passthrough and vGPU workflows - Update CLI and web console configuration sections with all required parameters - Fix typo: sandboxWorloads -> sandboxWorkloads Signed-off-by: Vitaliy Emporopulo <[email protected]>
1 parent ebe250a commit 0ec4b10

File tree

1 file changed

+59
-29
lines changed

1 file changed

+59
-29
lines changed

openshift/openshift-virtualization.rst

Lines changed: 59 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,15 @@ Introduction
1515

1616

1717
There is a growing demand among Red Hat customers to use virtual GPUs (NVIDIA vGPU)
18-
with Red Hat OpenShift Virtualization.
19-
Red Hat OpenShift Virtualization is based on KubeVirt, a virtual machine (VM) management add-on to Kubernetes that allows you to run and manage VMs in a Kubernetes cluster.
20-
It eliminates the need to manage separate clusters for VM and container workloads, as both can now coexist in a single Kubernetes cluster.
18+
with Red Hat OpenShift Virtualization.
19+
Red Hat OpenShift Virtualization is based on KubeVirt, a virtual machine (VM) management add-on to Kubernetes that allows you to run and manage VMs in a Kubernetes cluster.
20+
It eliminates the need to manage separate clusters for VM and container workloads, as both can now coexist in a single Kubernetes cluster.
2121
Red Hat OpenShift Virtualization is an OpenShift feature to run virtual machines (VMs) orchestrated by OpenShift (Kubernetes).
2222

2323
In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines.
2424

2525
There are some different prerequisites required virtual machines with GPU(s) than running containers with GPU(s).
26-
The primary difference is the drivers required.
26+
The primary difference is the drivers required.
2727
For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the `NVIDIA vGPU Manager <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu>`_ is needed for creating vGPU devices.
2828

2929
.. _configure-worker-nodes-for-gpu-operator-components:
@@ -48,18 +48,38 @@ Node A receives the following software components:
4848
* ``NVIDIA Kubernetes Device Plugin`` - To discover and advertise GPU resources to the kubelet.
4949
* ``NVIDIA DCGM and DCGM Exporter`` - To monitor the GPU(s).
5050

51-
Node B receives the following software components:
51+
There are two approaches to configuring GPU passthrough and vGPU for virtual machines:
5252

53-
* ``VFIO Manager`` - Optional. To load vfio-pci and bind it to all GPUs on the node.
54-
* ``Sandbox Device Plugin`` - Optional. To discover and advertise the passthrough GPUs to the kubelet.
55-
* ``Sandbox Validator`` -Optional. Validates that Sandbox Device Plugin is working.
53+
1. **NVIDIA GPU Operator approach** - Uses the GPU Operator to deploy and manage GPU software components.
54+
2. **Red Hat OpenShift Virtualization approach** - Uses Red Hat OpenShift Virtualization native procedures, which are tested and supported by Red Hat.
5655

57-
Node C receives the following software components:
56+
Node B (GPU Passthrough) receives the following software components:
57+
58+
**NVIDIA GPU Operator approach:**
59+
60+
* ``VFIO Manager`` - To load vfio-pci and bind it to all GPUs on the node.
61+
* ``Sandbox Device Plugin`` - To discover and advertise the passthrough GPUs to the kubelet.
62+
* ``Sandbox Validator`` - Validates that Sandbox Device Plugin is working.
63+
64+
**Red Hat OpenShift Virtualization approach:**
65+
66+
* Uses Red Hat OpenShift Virtualization's `PCI passthrough configuration <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html-single/virtualization/index#virt-configuring-pci-passthrough>`_.
67+
* When using this approach, NVIDIA GPU Operator's operands must be disabled on the node to avoid conflicts.
68+
69+
Node C (vGPU) receives the following software components:
70+
71+
**NVIDIA GPU Operator approach:**
5872

5973
* ``NVIDIA vGPU Manager`` - To install the driver.
6074
* ``NVIDIA vGPU Device Manager`` - To create vGPU devices on the node.
61-
* ``Sandbox Device Plugin`` -Optional. To discover and advertise the vGPU devices to kubelet.
62-
* ``Sandbox Validator`` -Optional. Validates that Sandbox Device Plugin is working.
75+
* ``Sandbox Device Plugin`` - To discover and advertise the vGPU devices to kubelet.
76+
* ``Sandbox Validator`` - Validates that Sandbox Device Plugin is working.
77+
78+
**Red Hat OpenShift Virtualization approach:**
79+
80+
* Uses Red Hat OpenShift Virtualization's `vGPU configuration <https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html-single/virtualization/index#virt-configuring-virtual-gpus>`_.
81+
* Relies on OpenShift Virtualization’s capabilities to configure mediated devices.
82+
* The NVIDIA GPU Operator is only used for installing drivers with the NVIDIA vGPU Manager. The GPU Operator does not configure mediated devices.
6383

6484

6585
Assumptions, constraints, and dependencies
@@ -246,7 +266,7 @@ Use the following steps to build the vGPU Manager container and push it to a pri
246266

247267
.. code-block:: console
248268
249-
$ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11
269+
$ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11
250270
251271
.. note::
252272

@@ -330,15 +350,18 @@ Create the cluster policy using the CLI:
330350
331351
#. Modify the ``clusterpolicy.json`` file as follows:
332352

333-
* sandboxWorloads.enabled=true
334-
* vgpuManager.enabled=true
335-
* vgpuManager.repository=<path to private repository>
336-
* vgpuManager.image=vgpu-manager
337-
* vgpuManager.version=<driver version>
338-
* vgpuManager.imagePullSecrets={<name of image pull secret>}
339-
340-
341-
The ``vgpuManager`` options are only required if you want to use the NVIDIA vGPU. If you are only using GPU passthrough, these options should not be set.
353+
* sandboxWorkloads.enabled=true
354+
* sandboxDevicePlugin.enabled=true
355+
* For GPU passthrough:
356+
* vfioManager.enabled=true
357+
* Optionally, sandboxWorkloads.defaultWorkload=vm-passthrough (if you want passthrough to be the default mode)
358+
* For vGPU:
359+
* vgpuManager.enabled=true
360+
* vgpuManager.repository=<path to private repository>
361+
* vgpuManager.image=vgpu-manager
362+
* vgpuManager.version=<driver version>
363+
* vgpuManager.imagePullSecrets={<name of image pull secret>}
364+
* vgpuDeviceManager.enabled=true
342365

343366
In general, the flag ``sandboxWorkloads.enabled`` in ``ClusterPolicy`` controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. This flag is disabled by default, meaning all nodes get provisioned with the same software which enables container workloads, and the ``nvidia.com/gpu.workload.config`` node label is not used.
344367

@@ -365,7 +388,7 @@ Creating a ClusterPolicy for the GPU Operator using the OpenShift Container Plat
365388

366389
As a cluster administrator, you can create a ClusterPolicy using the OpenShift Container Platform web console.
367390

368-
#. Navigate to **Operators** > **Installed Operators** and find your installed NVIDIA GPU Operator.
391+
#. Navigate to **Operators** > **Installed Operators** and find your installed NVIDIA GPU Operator.
369392

370393
#. Under *Provided APIs*, click **ClusterPolicy**.
371394

@@ -388,14 +411,21 @@ As a cluster administrator, you can create a ClusterPolicy using the OpenShift C
388411

389412
.. image:: graphics/cluster_policy_enable_sandbox_workloads.png
390413

391-
#. If you are planning to use NVIDIA vGPU, expand the **NVIDIA vGPU Manager config** section and fill in your desired configuration settings, including:
414+
#. Expand the **Sandbox Device Plugin config** section and make sure that the **enabled** checkbox is checked.
392415

393-
* Select the **enabled** checkbox to enable the NVIDIA vGPU Manager.
394-
* Add your **imagePullSecrets**.
395-
* Under *driverManager*, fill in **repository** with the path to your private repository.
396-
* Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version.
416+
#. If you are planning to use NVIDIA vGPU
397417

398-
If you are only using GPU passthrough, you dont need to fill this section out.
418+
* Expand the **NVIDIA vGPU Manager config** section and fill in your desired configuration settings, including:
419+
* Select the **enabled** checkbox to enable the NVIDIA vGPU Manager.
420+
* Add your **imagePullSecrets**.
421+
* Under *driverManager*, fill in **repository** with the path to your private repository.
422+
* Under *env*, fill in **image** with ``vgpu-manager`` and the **version** with your driver version.
423+
* Expand the **NVIDIA vGPU Device Manager config** section and make sure that the **enabled** checkbox is checked.
424+
425+
If you are only using GPU passthrough, you don't need to fill these sections out.
426+
427+
* Expand the **VFIO Manager config** section and select the **enabled** checkbox.
428+
* Optionally, in the **Sandbox Workloads config** section, set **defaultWorkload** to ``vm-passthrough`` if you want passthrough to be the default mode.
399429

400430
.. image:: graphics/cluster_policy_configure_vgpu.png
401431

@@ -582,7 +612,7 @@ Procedure
582612
Example for vGPU:
583613

584614
.. code-block:: yaml
585-
615+
586616
apiVersion: kubevirt.io/v1alpha3
587617
kind: VirtualMachineInstance
588618
...

0 commit comments

Comments
 (0)