You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: gpu-operator/gpu-operator-kubevirt.rst
+44-31Lines changed: 44 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,11 +14,11 @@ About the Operator with KubeVirt
14
14
================================
15
15
16
16
`KubeVirt <https://kubevirt.io/>`_ is a virtual machine management add-on to Kubernetes that allows you to run and manage virtual machines in a Kubernetes cluster.
17
-
It eliminates the need to manage separate clusters for virtual machine and container workloads, as both can now coexist in a single Kubernetes cluster.
17
+
It eliminates the need to manage separate clusters for virtual machine and container workloads because both can now coexist in a single Kubernetes cluster.
18
18
19
19
In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines with KubeVirt.
20
20
21
-
There are some different prerequisites required when running virtual machines with GPU(s) than running containers with GPU(s).
21
+
There are some different prerequisites required when running virtual machines with GPUs compared to running containers with GPUs.
22
22
The primary difference is the drivers required.
23
23
For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the `NVIDIA vGPU Manager <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu>`_ is needed for creating vGPU devices.
24
24
@@ -62,15 +62,15 @@ To override the default GPU workload configuration, set the following value in `
62
62
Assumptions, constraints, and dependencies
63
63
------------------------------------------
64
64
65
-
* A GPU worker node can run GPU workloads of a particular type - containers, virtual machines with GPU Passthrough, or virtual machines with vGPU - but not a combination of any of them.
65
+
* A GPU worker node can run GPU workloads of a particular type, such as containers, virtual machines with GPU Passthrough, or virtual machines with vGPU, but not a combination of any of them.
66
66
67
-
* The cluster admin or developer has knowledge about their cluster ahead of time, and can properly label nodes to indicate what types of GPU workloads they will run.
67
+
* The cluster admin or developer has knowledge about their cluster ahead of time and can properly label nodes to indicate what types of GPU workloads they will run.
68
68
69
69
* Worker nodes running GPU accelerated virtual machines (with GPU passthrough or vGPU) are assumed to be bare metal.
70
70
71
71
* The GPU Operator will not automate the installation of NVIDIA drivers inside KubeVirt virtual machines with GPUs/vGPUs attached.
72
72
73
-
* Users must manually add all passthrough GPU and vGPU resources to the ``permittedDevices`` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. See the `KubeVirt documentation <https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices>`_ for more information.
73
+
* Users must manually add all passthrough GPU and vGPU resources to the ``permittedDevices`` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the `KubeVirt documentation <https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices>`_ for more information.
74
74
75
75
* MIG-backed vGPUs are not supported.
76
76
@@ -83,7 +83,7 @@ Before using KubeVirt with the GPU Operator, ensure the following prerequisites
83
83
84
84
* The host is booted with ``intel_iommu=on`` or ``amd_iommu=on`` on the kernel command line.
85
85
86
-
* If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the `NVIDIA vGPU Documentation <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#prereqs-vgpu>`_ to ensure you have met all of the prerequisites for using NVIDIA vGPU.
86
+
* If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the `NVIDIA vGPU Documentation <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#prereqs-vgpu>`_ to ensure you have met all the prerequisites for using NVIDIA vGPU.
87
87
88
88
* KubeVirt is installed in the cluster.
89
89
@@ -110,14 +110,16 @@ After configuring the :ref:`prerequisites<prerequisites>`, the high level workfl
110
110
* :ref:`Install the GPU Operator <install-the-gpu-operator>` and set ``sandboxWorkloads.enabled=true``
111
111
112
112
If you are planning to deploy VMs with vGPU, the workflow is as follows:
113
-
* :ref:`Build the NVIDIA vGPU Manager image <build-vgpu-manager-image>`
114
-
* :ref:`Label the node for the vGPU configuration <vgpu-device-configuration>`
115
-
* :ref:`Add vGPU resources to KubeVirt CR <add-vgpu-resources-to-kubevirt-cr>`
116
-
* :ref:`Create a virtual machine with vGPU <create-a-virtual-machine-with-gpu>`
113
+
114
+
* :ref:`Build the NVIDIA vGPU Manager image <build-vgpu-manager-image>`
115
+
* :ref:`Label the node for the vGPU configuration <vgpu-device-configuration>`
116
+
* :ref:`Add vGPU resources to KubeVirt CR <add-vgpu-resources-to-kubevirt-cr>`
117
+
* :ref:`Create a virtual machine with vGPU <create-a-virtual-machine-with-gpu>`
117
118
118
119
If you are planning to deploy VMs with GPU passthrough, the workflow is as follows:
119
-
* :ref:`Add GPU passthrough resources to KubeVirt CR <add-gpu-passthrough-resources-to-kubevirt-cr>`
120
-
* :ref:`Create a virtual machine with GPU passthrough <create-a-virtual-machine-with-gpu>`
120
+
121
+
* :ref:`Add GPU passthrough resources to KubeVirt CR <add-gpu-passthrough-resources-to-kubevirt-cr>`
122
+
* :ref:`Create a virtual machine with GPU passthrough <create-a-virtual-machine-with-gpu>`
121
123
122
124
.. _label-worker-nodes:
123
125
@@ -150,11 +152,11 @@ Follow one of the below subsections for installing the GPU Operator, depending o
150
152
151
153
.. note::
152
154
153
-
The following commnds set the``sandboxWorkloads.enabled`` flag.
155
+
The following commands set the``sandboxWorkloads.enabled`` flag.
154
156
This ``ClusterPolicy`` flag controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads.
155
157
This flag is disabled by default, meaning all nodes get provisioned with the same software to enable container workloads, and the ``nvidia.com/gpu.workload.config`` node label is not used.
156
158
157
-
The term ``sandboxing`` refers to running software in a separate isolated environment, typically for added security (i.e. a virtual machine).
159
+
The term *sandboxing* refers to running software in a separate isolated environment, typically for added security (that is, a virtual machine).
158
160
We use the term ``sandbox workloads`` to signify workloads that run in a virtual machine, irrespective of the virtualization technology used.
159
161
160
162
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -208,9 +210,9 @@ Follow the steps provided in :ref:`this section<build-vgpu-manager-image>`.
The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices which can be assigned to KubeVirt virtual machines.
213
+
The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices that can be assigned to KubeVirt virtual machines.
212
214
Without additional configuration, the GPU Operator creates a default set of devices on all GPUs.
213
-
To learn more about how the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to :ref:`vGPU Device Configuration<vgpu-device-configuration>`.
215
+
To learn more about the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to :ref:`vGPU Device Configuration<vgpu-device-configuration>`.
214
216
215
217
Add GPU resources to KubeVirt CR
216
218
-------------------------------------
@@ -410,17 +412,29 @@ At runtime, adminstrators then point the vGPU Device Manager at one of these con
410
412
The configuration file is created as a ConfigMap, and is shared across all worker nodes.
411
413
At runtime, a node label, ``nvidia.com/vgpu.config``, can be used to decide which of these configurations to actually apply to a node at any given time.
412
414
If the node is not labeled, then the ``default`` configuration will be used.
413
-
For more information on this component and how it is configured, refer to the project `README <https://github.com/NVIDIA/vgpu-device-manager>`_.
415
+
For more information on this component and how it is configured, refer to the `NVIDIA vGPU Device Manager README <https://github.com/NVIDIA/vgpu-device-manager>`_.
414
416
415
-
By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all `vGPU types <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu>`_ supported by NVIDIA vGPU.
417
+
By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all `vGPU types supported by NVIDIA vGPU <https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu>`_.
416
418
Users can select a specific configuration for a worker node by applying the ``nvidia.com/vgpu.config`` node label.
417
-
For example, labeling a node with ``nvidia.com/vgpu.config=A10-8Q`` would create 3 vGPU devices of type **A10-8Q** on all **A10** GPUs on the node (note: 3 is the maximum number of **A10-8Q** devices that can be created per GPU).
419
+
For example, labeling a node with ``nvidia.com/vgpu.config=A10-8Q`` would create three vGPU devices of type **A10-8Q** on all **A10** GPUs on the node. Note that three is the maximum number of **A10-8Q** devices that can be created per GPU.
418
420
If the node is not labeled, the ``default`` configuration will be applied.
419
-
The ``default`` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device
420
-
is half the total GPU memory.
421
-
For example, the ``default`` configuration will create two **A10-12Q** devices on all **A10** GPUs, two **V100-8Q** devices on all **V100** GPUs, and two **T4-8Q** devices on all **T4** GPUs.
421
+
The ``default`` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device is half the total GPU memory.
422
+
For example, the ``default`` configuration will create two **A10-12Q** devices on all **A10** GPUs, two **V100-8Q** devices on all **V100** GPUs, and two **T4-8Q** devices on all **T4** GPUs.
423
+
424
+
You can also create different vGPU Q profiles on the same GPU using vGPU Device Manager configuration.
425
+
For example, you can create a **A10-4Q** and a **A10-6Q** device on same GPU by creating a vGPU Device Manager configuration with the following content:
426
+
427
+
.. code-block:: yaml
428
+
429
+
version: v1
430
+
vgpu-configs:
431
+
custom-A10-config:
432
+
- devices: all
433
+
vgpu-devices:
434
+
"A10-4Q": 3
435
+
"A10-6Q": 2
422
436
423
-
If custom vGPU device configuration is desired, more than the default ConfigMap provides, you can create your own ConfigMap:
437
+
If custom vGPU device configuration is desired, more than the default config map provides, you can create your own config map:
424
438
425
439
.. code-block:: console
426
440
@@ -476,7 +490,7 @@ After the vGPU Device Manager finishes applying the new configuration, all GPU O
You can now see 12 **A10-4Q** devices on the node, as 6 **A10-4Q** devices can be created per **A10** GPU.
493
+
You can now see 12 **A10-4Q** devices on the node, as six **A10-4Q** devices can be created per **A10** GPU.
480
494
481
495
.. code-block:: console
482
496
@@ -500,10 +514,9 @@ This section covers building the NVIDIA vGPU Manager container image and pushing
500
514
501
515
Download the vGPU Software from the `NVIDIA Licensing Portal <https://nvid.nvidia.com/dashboard/#/dashboard>`_.
502
516
503
-
* Login to the NVIDIA Licensing Portal and navigate to the `Software Downloads` section.
504
-
* The NVIDIA vGPU Software is located in the Software Downloads section of the NVIDIA Licensing Portal.
505
-
* The vGPU Software bundle is packaged as a zip file.
506
-
Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file, ``NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run``.
517
+
* Login to the NVIDIA Licensing Portal and navigate to the **Software Downloads** section.
518
+
* The NVIDIA vGPU Software is located in the **Software Downloads** section of the NVIDIA Licensing Portal.
519
+
* The vGPU Software bundle is packaged as a zip file. Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file, ``NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run``.
507
520
508
521
.. start-nvaie-run-file
509
522
@@ -512,7 +525,7 @@ Download the vGPU Software from the `NVIDIA Licensing Portal <https://nvid.nvidi
512
525
NVIDIA AI Enterprise customers must use the ``aie`` .run file for building the NVIDIA vGPU Manager image.
513
526
Download the ``NVIDIA-Linux-x86_64-<version>-vgpu-kvm-aie.run`` file instead, and rename it to
514
527
``NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run`` before proceeding with the rest of the procedure.
515
-
Refer to the ``Infrastructure Support Matrix`` under section under the `NVIDIA AI Enterprise Infra Release Branches <https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software>`_ for details on supported version number to use.
528
+
Refer to the **Infrastructure Support Matrix** section under the `NVIDIA AI Enterprise Infrastructure Release Branches <https://docs.nvidia.com/ai-enterprise/index.html#infrastructure-software>`_ for details on supported version number to use.
516
529
.. end-nvaie-run-file
517
530
518
531
Next, clone the driver container repository and build the driver image with the following steps.
@@ -532,7 +545,7 @@ Change to the vgpu-manager directory for your OS. We use Ubuntu 20.04 as an exam
532
545
533
546
.. note::
534
547
535
-
For RedHat OpenShift, run ``cd vgpu-manager/rhel8`` to use the ``rhel8`` folder instead.
548
+
For Red Hat OpenShift, run ``cd vgpu-manager/rhel8`` to use the ``rhel8`` folder instead.
536
549
537
550
Copy the NVIDIA vGPU Manager from your extracted zip file
538
551
@@ -543,7 +556,7 @@ Copy the NVIDIA vGPU Manager from your extracted zip file
543
556
|Set the following environment variables:
544
557
|``PRIVATE_REGISTRY`` - name of private registry used to store driver image
545
558
|``VERSION`` - NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal
546
-
|``OS_TAG`` - this must match the Guest OS version. In the below example ``ubuntu20.04`` is used. For RedHat OpenShift this should be set to ``rhcos4.x`` where x is the supported minor OCP version.
559
+
|``OS_TAG`` - this must match the Guest OS version. In the following example ``ubuntu20.04`` is used. For Red Hat OpenShift this should be set to ``rhcos4.x`` where x is the supported minor OCP version.
547
560
|``CUDA_VERSION`` - CUDA base image version to build the driver image with.
Copy file name to clipboardExpand all lines: openshift/openshift-virtualization.rst
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -656,6 +656,19 @@ If the node is not labeled, the ``default`` configuration will be applied.
656
656
The ``default`` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device is half the total GPU memory.
657
657
For example, the ``default`` configuration will create two **A10-12Q** devices on all **A10** GPUs, two **V100-8Q** devices on all **V100** GPUs, and two **T4-8Q** devices on all **T4** GPUs.
658
658
659
+
You can also create different vGPU Q profiles on the same GPU using vGPU Device Manager configuration.
660
+
For example, you can create a **A10-4Q** and a **A10-6Q** device on same GPU by creating a vGPU Device Manager configuration with the following content:
661
+
662
+
.. code-block:: yaml
663
+
664
+
version: v1
665
+
vgpu-configs:
666
+
custom-A10-config:
667
+
- devices: all
668
+
vgpu-devices:
669
+
"A10-4Q": 3
670
+
"A10-6Q": 2
671
+
659
672
If custom vGPU device configuration is desired, more than the default ConfigMap provides, you can create your own ConfigMap:
0 commit comments