Skip to content

Commit ff0df0d

Browse files
authored
add different vGPU Q profiles example (#243)
* edits to match style guide Signed-off-by: Andrew Chen <[email protected]> * add different vGPU Q profiles example Signed-off-by: Andrew Chen <[email protected]> * accept suggestion Signed-off-by: Andrew Chen <[email protected]> * fix sandboxing italics Signed-off-by: Andrew Chen <[email protected]> * accept suggestions Signed-off-by: Andrew Chen <[email protected]> * fix indentation Signed-off-by: Andrew Chen <[email protected]> * fix YAML indentation formatting Signed-off-by: Andrew Chen <[email protected]> --------- Signed-off-by: Andrew Chen <[email protected]>
1 parent cc01e96 commit ff0df0d

File tree

2 files changed

+57
-31
lines changed

2 files changed

+57
-31
lines changed

gpu-operator/gpu-operator-kubevirt.rst

Lines changed: 44 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ About the Operator with KubeVirt
1414
================================
1515

1616
`KubeVirt <https://kubevirt.io/>`_ is a virtual machine management add-on to Kubernetes that allows you to run and manage virtual machines in a Kubernetes cluster.
17-
It eliminates the need to manage separate clusters for virtual machine and container workloads, as both can now coexist in a single Kubernetes cluster.
17+
It eliminates the need to manage separate clusters for virtual machine and container workloads because both can now coexist in a single Kubernetes cluster.
1818

1919
In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines with KubeVirt.
2020

21-
There are some different prerequisites required when running virtual machines with GPU(s) than running containers with GPU(s).
21+
There are some different prerequisites required when running virtual machines with GPUs compared to running containers with GPUs.
2222
The primary difference is the drivers required.
2323
For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the `NVIDIA vGPU Manager <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu>`_ is needed for creating vGPU devices.
2424

@@ -62,15 +62,15 @@ To override the default GPU workload configuration, set the following value in `
6262
Assumptions, constraints, and dependencies
6363
------------------------------------------
6464

65-
* A GPU worker node can run GPU workloads of a particular type - containers, virtual machines with GPU Passthrough, or virtual machines with vGPU - but not a combination of any of them.
65+
* A GPU worker node can run GPU workloads of a particular type, such as containers, virtual machines with GPU Passthrough, or virtual machines with vGPU, but not a combination of any of them.
6666

67-
* The cluster admin or developer has knowledge about their cluster ahead of time, and can properly label nodes to indicate what types of GPU workloads they will run.
67+
* The cluster admin or developer has knowledge about their cluster ahead of time and can properly label nodes to indicate what types of GPU workloads they will run.
6868

6969
* Worker nodes running GPU accelerated virtual machines (with GPU passthrough or vGPU) are assumed to be bare metal.
7070

7171
* The GPU Operator will not automate the installation of NVIDIA drivers inside KubeVirt virtual machines with GPUs/vGPUs attached.
7272

73-
* Users must manually add all passthrough GPU and vGPU resources to the ``permittedDevices`` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. See the `KubeVirt documentation <https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices>`_ for more information.
73+
* Users must manually add all passthrough GPU and vGPU resources to the ``permittedDevices`` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the `KubeVirt documentation <https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices>`_ for more information.
7474

7575
* MIG-backed vGPUs are not supported.
7676

@@ -83,7 +83,7 @@ Before using KubeVirt with the GPU Operator, ensure the following prerequisites
8383

8484
* The host is booted with ``intel_iommu=on`` or ``amd_iommu=on`` on the kernel command line.
8585

86-
* If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the `NVIDIA vGPU Documentation <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#prereqs-vgpu>`_ to ensure you have met all of the prerequisites for using NVIDIA vGPU.
86+
* If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the `NVIDIA vGPU Documentation <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#prereqs-vgpu>`_ to ensure you have met all the prerequisites for using NVIDIA vGPU.
8787

8888
* KubeVirt is installed in the cluster.
8989

@@ -110,14 +110,16 @@ After configuring the :ref:`prerequisites<prerequisites>`, the high level workfl
110110
* :ref:`Install the GPU Operator <install-the-gpu-operator>` and set ``sandboxWorkloads.enabled=true``
111111

112112
If you are planning to deploy VMs with vGPU, the workflow is as follows:
113-
* :ref:`Build the NVIDIA vGPU Manager image <build-vgpu-manager-image>`
114-
* :ref:`Label the node for the vGPU configuration <vgpu-device-configuration>`
115-
* :ref:`Add vGPU resources to KubeVirt CR <add-vgpu-resources-to-kubevirt-cr>`
116-
* :ref:`Create a virtual machine with vGPU <create-a-virtual-machine-with-gpu>`
113+
114+
* :ref:`Build the NVIDIA vGPU Manager image <build-vgpu-manager-image>`
115+
* :ref:`Label the node for the vGPU configuration <vgpu-device-configuration>`
116+
* :ref:`Add vGPU resources to KubeVirt CR <add-vgpu-resources-to-kubevirt-cr>`
117+
* :ref:`Create a virtual machine with vGPU <create-a-virtual-machine-with-gpu>`
117118

118119
If you are planning to deploy VMs with GPU passthrough, the workflow is as follows:
119-
* :ref:`Add GPU passthrough resources to KubeVirt CR <add-gpu-passthrough-resources-to-kubevirt-cr>`
120-
* :ref:`Create a virtual machine with GPU passthrough <create-a-virtual-machine-with-gpu>`
120+
121+
* :ref:`Add GPU passthrough resources to KubeVirt CR <add-gpu-passthrough-resources-to-kubevirt-cr>`
122+
* :ref:`Create a virtual machine with GPU passthrough <create-a-virtual-machine-with-gpu>`
121123

122124
.. _label-worker-nodes:
123125

@@ -150,11 +152,11 @@ Follow one of the below subsections for installing the GPU Operator, depending o
150152

151153
.. note::
152154

153-
The following commnds set the``sandboxWorkloads.enabled`` flag.
155+
The following commands set the ``sandboxWorkloads.enabled`` flag.
154156
This ``ClusterPolicy`` flag controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads.
155157
This flag is disabled by default, meaning all nodes get provisioned with the same software to enable container workloads, and the ``nvidia.com/gpu.workload.config`` node label is not used.
156158

157-
The term ``sandboxing`` refers to running software in a separate isolated environment, typically for added security (i.e. a virtual machine).
159+
The term *sandboxing* refers to running software in a separate isolated environment, typically for added security (that is, a virtual machine).
158160
We use the term ``sandbox workloads`` to signify workloads that run in a virtual machine, irrespective of the virtualization technology used.
159161

160162
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -208,9 +210,9 @@ Follow the steps provided in :ref:`this section<build-vgpu-manager-image>`.
208210
--set vgpuManager.version=<driver version> \
209211
--set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}}
210212
211-
The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices which can be assigned to KubeVirt virtual machines.
213+
The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices that can be assigned to KubeVirt virtual machines.
212214
Without additional configuration, the GPU Operator creates a default set of devices on all GPUs.
213-
To learn more about how the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to :ref:`vGPU Device Configuration<vgpu-device-configuration>`.
215+
To learn more about the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to :ref:`vGPU Device Configuration<vgpu-device-configuration>`.
214216

215217
Add GPU resources to KubeVirt CR
216218
-------------------------------------
@@ -410,17 +412,29 @@ At runtime, adminstrators then point the vGPU Device Manager at one of these con
410412
The configuration file is created as a ConfigMap, and is shared across all worker nodes.
411413
At runtime, a node label, ``nvidia.com/vgpu.config``, can be used to decide which of these configurations to actually apply to a node at any given time.
412414
If the node is not labeled, then the ``default`` configuration will be used.
413-
For more information on this component and how it is configured, refer to the project `README <https://github.com/NVIDIA/vgpu-device-manager>`_.
415+
For more information on this component and how it is configured, refer to the `NVIDIA vGPU Device Manager README <https://github.com/NVIDIA/vgpu-device-manager>`_.
414416

415-
By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all `vGPU types <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu>`_ supported by NVIDIA vGPU.
417+
By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all `vGPU types supported by NVIDIA vGPU <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu>`_.
416418
Users can select a specific configuration for a worker node by applying the ``nvidia.com/vgpu.config`` node label.
417-
For example, labeling a node with ``nvidia.com/vgpu.config=A10-8Q`` would create 3 vGPU devices of type **A10-8Q** on all **A10** GPUs on the node (note: 3 is the maximum number of **A10-8Q** devices that can be created per GPU).
419+
For example, labeling a node with ``nvidia.com/vgpu.config=A10-8Q`` would create three vGPU devices of type **A10-8Q** on all **A10** GPUs on the node. Note that three is the maximum number of **A10-8Q** devices that can be created per GPU.
418420
If the node is not labeled, the ``default`` configuration will be applied.
419-
The ``default`` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device
420-
is half the total GPU memory.
421-
For example, the ``default`` configuration will create two **A10-12Q** devices on all **A10** GPUs, two **V100-8Q** devices on all **V100** GPUs, and two **T4-8Q** devices on all **T4** GPUs.
421+
The ``default`` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device is half the total GPU memory.
422+
For example, the ``default`` configuration will create two **A10-12Q** devices on all **A10** GPUs, two **V100-8Q** devices on all **V100** GPUs, and two **T4-8Q** devices on all **T4** GPUs.
423+
424+
You can also create different vGPU Q profiles on the same GPU using vGPU Device Manager configuration.
425+
For example, you can create a **A10-4Q** and a **A10-6Q** device on same GPU by creating a vGPU Device Manager configuration with the following content:
426+
427+
.. code-block:: yaml
428+
429+
version: v1
430+
vgpu-configs:
431+
custom-A10-config:
432+
- devices: all
433+
vgpu-devices:
434+
"A10-4Q": 3
435+
"A10-6Q": 2
422436
423-
If custom vGPU device configuration is desired, more than the default ConfigMap provides, you can create your own ConfigMap:
437+
If custom vGPU device configuration is desired, more than the default config map provides, you can create your own config map:
424438

425439
.. code-block:: console
426440
@@ -476,7 +490,7 @@ After the vGPU Device Manager finishes applying the new configuration, all GPU O
476490
nvidia-vgpu-device-manager-8mgg8 1/1 Running 0 30m
477491
nvidia-vgpu-manager-daemonset-fpplc 1/1 Running 0 31m
478492
479-
You can now see 12 **A10-4Q** devices on the node, as 6 **A10-4Q** devices can be created per **A10** GPU.
493+
You can now see 12 **A10-4Q** devices on the node, as six **A10-4Q** devices can be created per **A10** GPU.
480494

481495
.. code-block:: console
482496
@@ -500,10 +514,9 @@ This section covers building the NVIDIA vGPU Manager container image and pushing
500514

501515
Download the vGPU Software from the `NVIDIA Licensing Portal <https://nvid.nvidia.com/dashboard/#/dashboard>`_.
502516

503-
* Login to the NVIDIA Licensing Portal and navigate to the `Software Downloads` section.
504-
* The NVIDIA vGPU Software is located in the Software Downloads section of the NVIDIA Licensing Portal.
505-
* The vGPU Software bundle is packaged as a zip file.
506-
Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file, ``NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run``.
517+
* Login to the NVIDIA Licensing Portal and navigate to the **Software Downloads** section.
518+
* The NVIDIA vGPU Software is located in the **Software Downloads** section of the NVIDIA Licensing Portal.
519+
* The vGPU Software bundle is packaged as a zip file. Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file, ``NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run``.
507520

508521
.. start-nvaie-run-file
509522
@@ -512,7 +525,7 @@ Download the vGPU Software from the `NVIDIA Licensing Portal <https://nvid.nvidi
512525
NVIDIA AI Enterprise customers must use the ``aie`` .run file for building the NVIDIA vGPU Manager image.
513526
Download the ``NVIDIA-Linux-x86_64-<version>-vgpu-kvm-aie.run`` file instead, and rename it to
514527
``NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run`` before proceeding with the rest of the procedure.
515-
Refer to the ``Infrastructure Support Matrix`` under section under the `NVIDIA AI Enterprise Infra Release Branches <https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software>`_ for details on supported version number to use.
528+
Refer to the **Infrastructure Support Matrix** section under the `NVIDIA AI Enterprise Infrastructure Release Branches <https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software>`_ for details on supported version number to use.
516529
.. end-nvaie-run-file
517530
518531
Next, clone the driver container repository and build the driver image with the following steps.
@@ -532,7 +545,7 @@ Change to the vgpu-manager directory for your OS. We use Ubuntu 20.04 as an exam
532545
533546
.. note::
534547

535-
For RedHat OpenShift, run ``cd vgpu-manager/rhel8`` to use the ``rhel8`` folder instead.
548+
For Red Hat OpenShift, run ``cd vgpu-manager/rhel8`` to use the ``rhel8`` folder instead.
536549

537550
Copy the NVIDIA vGPU Manager from your extracted zip file
538551

@@ -543,7 +556,7 @@ Copy the NVIDIA vGPU Manager from your extracted zip file
543556
| Set the following environment variables:
544557
| ``PRIVATE_REGISTRY`` - name of private registry used to store driver image
545558
| ``VERSION`` - NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal
546-
| ``OS_TAG`` - this must match the Guest OS version. In the below example ``ubuntu20.04`` is used. For RedHat OpenShift this should be set to ``rhcos4.x`` where x is the supported minor OCP version.
559+
| ``OS_TAG`` - this must match the Guest OS version. In the following example ``ubuntu20.04`` is used. For Red Hat OpenShift this should be set to ``rhcos4.x`` where x is the supported minor OCP version.
547560
| ``CUDA_VERSION`` - CUDA base image version to build the driver image with.
548561
549562
.. code-block:: console

openshift/openshift-virtualization.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -656,6 +656,19 @@ If the node is not labeled, the ``default`` configuration will be applied.
656656
The ``default`` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device is half the total GPU memory.
657657
For example, the ``default`` configuration will create two **A10-12Q** devices on all **A10** GPUs, two **V100-8Q** devices on all **V100** GPUs, and two **T4-8Q** devices on all **T4** GPUs.
658658

659+
You can also create different vGPU Q profiles on the same GPU using vGPU Device Manager configuration.
660+
For example, you can create a **A10-4Q** and a **A10-6Q** device on same GPU by creating a vGPU Device Manager configuration with the following content:
661+
662+
.. code-block:: yaml
663+
664+
version: v1
665+
vgpu-configs:
666+
custom-A10-config:
667+
- devices: all
668+
vgpu-devices:
669+
"A10-4Q": 3
670+
"A10-6Q": 2
671+
659672
If custom vGPU device configuration is desired, more than the default ConfigMap provides, you can create your own ConfigMap:
660673

661674
.. code-block:: console

0 commit comments

Comments
 (0)