Skip to content

Commit effb9a1

Browse files
committed
[WIP] CNV-13531: mdev/vGPU configuration
1 parent 6eee559 commit effb9a1

File tree

6 files changed

+343
-0
lines changed

6 files changed

+343
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3090,6 +3090,8 @@ Topics:
30903090
File: virt-configuring-pci-passthrough
30913091
- Name: Configuring vGPU passthrough
30923092
File: virt-configuring-vgpu-passthrough
3093+
- Name: Configuring mediated devices
3094+
File: virt-configuring-mediated-devices
30933095
- Name: Configuring a watchdog device
30943096
File: virt-configuring-a-watchdog
30953097
- Name: Enabling descheduler evictions on virtual machines
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="virt-about-using-virtual-gpus_{context}"]
7+
= About using virtual GPUs with {VirtProductName}
8+
9+
Some graphics processing unit (GPU) cards support the creation of virtual GPUs (vGPUs). {VirtProductName} can automatically create vGPUs and other mediated devices if an administrator provides configuration details in the `HyperConverged` custom resource (CR). This automation is especially useful for large clusters.
10+
11+
[NOTE]
12+
====
13+
Refer to your hardware vendor's documentation for functionality and support details.
14+
====
15+
16+
Mediated device:: A physical device that is divided into one or more virtual devices. A vGPU is a type of mediated device (mdev); the performance of the physical GPU is divided among the virtual devices. You can assign mediated devices to one or more virtual machines (VMs), but the number of guests must be compatible with your GPU. Some GPUs do not support multiple guests.
17+
18+
[id="configuration-overview_{context}"]
19+
== Configuration overview
20+
21+
When configuring mediated devices, an administrator must:
22+
23+
* Create the mediated devices.
24+
* Expose the mediated devices to the cluster.
25+
26+
The `HyperConverged` CR includes APIs that accomplish both tasks:
27+
28+
.Creating mediated devices
29+
30+
[source,yaml]
31+
----
32+
...
33+
spec:
34+
mediatedDevicesConfiguration:
35+
mediatedDevicesTypes: <.>
36+
- <device_type>
37+
nodeMediatedDeviceTypes: <.>
38+
- mediatedDevicesTypes: <.>
39+
- <device_type>
40+
nodeSelector: <.>
41+
<node_selector_key>: <node_selector_value>
42+
...
43+
----
44+
<.> Required: Configures global settings for the cluster.
45+
<.> Optional: Overrides the global configuration for a specific node or group of nodes. Must be used with the global `mediatedDevicesTypes` configuration.
46+
<.> Required if you use `nodeMediatedDeviceTypes`. Overrides the global `mediatedDevicesTypes` configuration for select nodes.
47+
<.> Required if you use `nodeMediatedDeviceTypes`. Must include a `key:value` pair.
48+
49+
.Exposing mediated devices to the cluster
50+
51+
[source,yaml]
52+
----
53+
...
54+
permittedHostDevices:
55+
mediatedDevices:
56+
- mdevNameSelector: GRID T4-2Q <.>
57+
resourceName: nvidia.com/GRID_T4-2Q
58+
...
59+
----
60+
<.> Exposes the mediated devices that map to this value on the host.
61+
+
62+
[NOTE]
63+
====
64+
You can see the mediated device types that your device supports by viewing the contents of `/sys/bus/pci/devices/<slot>:<bus>:<domain>.<function>/mdev_supported_types/<type>/name`, substituting the correct values for your system.
65+
66+
For example, the name file for the `nvidia-231` type contains the selector string `GRID T4-2Q`. Using `GRID T4-2Q` as the `mdevNameSelector` value allows nodes to use the `nvidia-231` type.
67+
====
68+
69+
[id="how-vgpus-are-assigned-to-nodes_{context}"]
70+
== How vGPUs are assigned to nodes
71+
72+
For each physical device, {VirtProductName} configures:
73+
74+
* A single mdev type.
75+
* The maximum number of instances of the selected mdev type.
76+
77+
The cluster architecture affects how devices are created and assigned to nodes.
78+
79+
Large cluster with multiple cards per node:: On nodes with multiple cards that can support similar vGPU types, the relevant device types are created in a round-robin manner.
80+
For example:
81+
+
82+
[source,yaml]
83+
----
84+
...
85+
mediatedDevicesConfiguration:
86+
mediatedDevicesTypes:
87+
- nvidia-222
88+
- nvidia-228
89+
- nvidia-105
90+
- nvidia-108
91+
...
92+
----
93+
+
94+
In this scenario, each node has two cards, both of which support the following vGPU types:
95+
+
96+
[source,text]
97+
----
98+
nvidia-105
99+
...
100+
nvidia-108
101+
nvidia-217
102+
nvidia-299
103+
...
104+
----
105+
+
106+
On each node, {VirtProductName} creates:
107+
108+
* 16 vGPUs of type nvidia-105 on the first card.
109+
* 2 vGPUs of type nvidia-108 on the second card.
110+
111+
One node has a single card that supports more than one requested vGPU type:: {VirtProductName} uses the supported type that comes first on the `mediatedDevicesTypes` list.
112+
+
113+
For example, a node's card supports `nvidia-223` and `nvidia-224`. The following `mediatedDevicesTypes` list is configured:
114+
+
115+
[source,yaml]
116+
----
117+
...
118+
mediatedDevicesConfiguration:
119+
mediatedDevicesTypes:
120+
- nvidia-22
121+
- nvidia-223
122+
- nvidia-224
123+
...
124+
----
125+
+
126+
In this example, {VirtProductName} uses the `nvidia-223` type.
127+
128+
[id="about-changing-removing-mediated-devices_{context}"]
129+
== About changing and removing mediated devices
130+
131+
{VirtProductName} updates the cluster's mediated device configuration if:
132+
133+
* You edit the `HyperConverged` CR and change the contents of the `mediatedDevicesTypes` stanza.
134+
135+
* You change the node labels that match the `nodeMediatedDeviceTypes` node selector.
136+
137+
* You remove the device information from the `spec.mediatedDevicesConfiguration` and `spec.permittedHostDevices` stanzas of the `HyperConverged` CR.
138+
+
139+
[NOTE]
140+
====
141+
If you remove the device information from the `spec.permittedHostDevices` stanza without also removing it from the `spec.mediatedDevicesConfiguration` stanza, you cannot create a new mediated device type on the same node. To properly remove mediated devices, remove the device information from both stanzas.
142+
====
143+
144+
Depending on the specific changes, these actions cause {VirtProductName} to reconfigure mediated devices or remove them from the cluster nodes.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="virt-assigning-mediated-device-virtual-machine_{context}"]
7+
= Assigning a mediated device to a virtual machine
8+
9+
Assign mediated devices such as virtual GPUs (vGPUs) to virtual machines.
10+
11+
.Prerequisites
12+
13+
* The mediated device is configured in the `HyperConverged` custom resource.
14+
15+
.Procedure
16+
17+
* Assign the mediated device to a virtual machine (VM) by editing the `spec.domain.devices.gpus` stanza of the `VirtualMachine` manifest:
18+
+
19+
.Example virtual machine manifest
20+
[source,yaml]
21+
----
22+
apiVersion: kubevirt.io/v1
23+
kind: VirtualMachine
24+
spec:
25+
domain:
26+
devices:
27+
gpus:
28+
- deviceName: nvidia.com/TU104GL_Tesla_T4 <1>
29+
name: gpu1 <2>
30+
- deviceName: nvidia.com/GRID_T4-1Q
31+
name: gpu2
32+
----
33+
<1> The resource name associated with the mediated device.
34+
<2> A name to identify the device on the VM.
35+
36+
.Verification
37+
38+
* To verify that the device is available from the virtual machine, run the following command, substituting `<device_name>` with the `deviceName` value from the `VirtualMachine` manifest:
39+
+
40+
[source,terminal]
41+
----
42+
$ lspci -nnk | grep <device_name>
43+
----
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="virt-creating-and-exposing-mediated-devices_{context}"]
7+
= Creating and exposing mediated devices
8+
9+
You can expose and create mediated devices such as virtual GPUs (vGPUs) by editing the `HyperConverged` custom resource (CR).
10+
11+
.Prerequisites
12+
13+
* You enabled the IOMMU (Input-Output Memory Management Unit) driver.
14+
15+
.Procedure
16+
17+
. Edit the `HyperConverged` CR in your default editor by running the following command:
18+
+
19+
[source,terminal]
20+
----
21+
$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
22+
----
23+
24+
. Add the mediated device information to the `HyperConverged` CR `spec`, ensuring that you include the `mediatedDevicesConfiguration` and `permittedHostDevices` stanzas. For example:
25+
+
26+
.Example configuration file
27+
[source,yaml]
28+
----
29+
apiVersion: hco.kubevirt.io/v1
30+
kind: HyperConverged
31+
metadata:
32+
name: kubevirt-hyperconverged
33+
namespace: openshift-cnv
34+
spec:
35+
mediatedDevicesConfiguration: <.>
36+
mediatedDevicesTypes: <.>
37+
- nvidia-231
38+
nodeMediatedDeviceTypes: <.>
39+
- mediatedDevicesTypes: <.>
40+
- nvidia-233
41+
nodeSelector:
42+
kubernetes.io/hostname: node-11.redhat.com
43+
permittedHostDevices: <.>
44+
mediatedDevices:
45+
- mdevNameSelector: GRID T4-2Q
46+
resourceName: nvidia.com/GRID_T4-2Q
47+
- mdevNameSelector: GRID T4-8Q
48+
resourceName: nvidia.com/GRID_T4-8Q
49+
...
50+
----
51+
<.> Creates mediated devices.
52+
<.> Required: Global `mediatedDevicesTypes` configuration.
53+
<.> Optional: Overrides the global configuration for specific nodes.
54+
<.> Required if you use `nodeMediatedDeviceTypes`.
55+
<.> Exposes mediated devices to the cluster.
56+
57+
. Save your changes and exit the editor.
58+
59+
.Verification
60+
61+
* You can verify that a device was added to a specific node by running the following command:
62+
+
63+
[source,terminal]
64+
----
65+
$ oc describe node <node_name>
66+
----
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="virt-removing-mediated-device-from-cluster-cli_{context}"]
7+
= Removing mediated devices from the cluster using the CLI
8+
9+
To remove a mediated device from the cluster, delete the information for that device from the `HyperConverged` custom resource (CR).
10+
11+
.Procedure
12+
13+
. Edit the `HyperConverged` CR in your default editor by running the following command:
14+
+
15+
[source,terminal]
16+
----
17+
$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
18+
----
19+
20+
. Remove the device information from the `spec.mediatedDevicesConfiguration` and `spec.permittedHostDevices` stanzas of the `HyperConverged` CR. Removing both entries ensures that you can later create a new mediated device type on the same node. For example:
21+
+
22+
.Example configuration file
23+
[source,yaml]
24+
----
25+
apiVersion: hco.kubevirt.io/v1
26+
kind: HyperConverged
27+
metadata:
28+
name: kubevirt-hyperconverged
29+
namespace: openshift-cnv
30+
spec:
31+
mediatedDevicesConfiguration:
32+
mediatedDevicesTypes: <1>
33+
- nvidia-231
34+
permittedHostDevices:
35+
mediatedDevices: <2>
36+
- mdevNameSelector: GRID T4-2Q
37+
resourceName: nvidia.com/GRID_T4-2Q
38+
----
39+
<1> To remove the `nvidia-231` device type, delete it from the `mediatedDevicesTypes` array.
40+
<2> To remove the `GRID T4-2Q` device, delete the `mdevNameSelector` field and its corresponding `resourceName` field.
41+
42+
. Save your changes and exit the editor.
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
:_content-type: ASSEMBLY
2+
[id="virt-configuring-mediated-devices"]
3+
= Configuring mediated devices
4+
include::_attributes/virt-document-attributes.adoc[]
5+
include::_attributes/common-attributes.adoc[]
6+
:context: virt-configuring-mediated-devices
7+
8+
toc::[]
9+
10+
{VirtProductName} automatically creates mediated devices, such as virtual GPUs (vGPUs), if you provide a list of devices in the `HyperConverged` custom resource (CR).
11+
12+
ifdef::openshift-enterprise[]
13+
:FeatureName: Declarative configuration of mediated devices
14+
include::snippets/technology-preview.adoc[]
15+
endif::[]
16+
17+
[id="prerequisites_virt-configuring-mediated-devices"]
18+
== Prerequisites
19+
20+
* If your hardware vendor provides drivers, you installed them on the nodes where you want to create mediated devices.
21+
** If you use NVIDIA cards, you link:https://access.redhat.com/solutions/6738411[installed the NVIDIA GRID driver].
22+
23+
include::modules/virt-about-using-virtual-gpus.adoc[leveloffset=+1]
24+
25+
[id="virt-preparing-host-for-mdevs"]
26+
== Preparing hosts for mediated devices
27+
28+
You must enable the IOMMU (Input-Output Memory Management Unit) driver before you can configure mediated devices.
29+
30+
include::modules/virt-adding-kernel-arguments-enable-iommu.adoc[leveloffset=+2]
31+
32+
[id="virt-adding-and-removing-mediated-devices"]
33+
== Adding and removing mediated devices
34+
35+
include::modules/virt-creating-and-exposing-mediated-devices.adoc[leveloffset=+2]
36+
37+
include::modules/virt-removing-mediated-device-from-cluster-cli.adoc[leveloffset=+2]
38+
39+
// VM owner task:
40+
41+
include::modules/virt-assigning-mediated-device-virtual-machine.adoc[leveloffset=+1]
42+
43+
[role="_additional-resources"]
44+
[id="additional-resources_virt-configuring-mediated-devices"]
45+
== Additional resources
46+
* link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-troubleshooting-enabling_intel_vt_x_and_amd_v_virtualization_hardware_extensions_in_bios[Enabling Intel VT-X and AMD-V Virtualization Hardware Extensions in BIOS]

0 commit comments

Comments
 (0)