Skip to content

Commit 460a199

Browse files
authored
Merge pull request #53509 from kquinn1204/TELCODOCS-969
2 parents 4f29e1c + cfcda33 commit 460a199

10 files changed

+227
-141
lines changed

modules/about-using-gpu-operator.adoc

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="about-using-nvidia-gpu_{context}"]
7+
= About using the NVIDIA GPU Operator
8+
9+
The NVIDIA GPU Operator manages NVIDIA GPU resources in an {product-title} cluster and automates tasks related to bootstrapping GPU nodes.
10+
Since the GPU is a special resource in the cluster, you must install some components before deploying application workloads onto the GPU.
11+
These components include the NVIDIA drivers which enables compute unified device architecture (CUDA), Kubernetes device plugin, container runtime and others such as automatic node labelling, monitoring and more.
12+
[NOTE]
13+
====
14+
The NVIDIA GPU Operator is supported only by NVIDIA. For more information about obtaining support from NVIDIA, see link:https://access.redhat.com/solutions/5174941[Obtaining Support from NVIDIA].
15+
====
16+
17+
There are two ways to enable GPUs with {product-title} {VirtProductName}: the {product-title}-native way described here and by using the NVIDIA GPU Operator.
18+
19+
The NVIDIA GPU Operator is a Kubernetes Operator that enables {product-title} {VirtProductName} to expose GPUs to virtualized workloads running on {product-title}.
20+
It allows users to easily provision and manage GPU-enabled virtual machines, providing them with the ability to run complex artificial intelligence/machine learning (AI/ML) workloads on the same platform as their other workloads.
21+
It also provides an easy way to scale the GPU capacity of their infrastructure, allowing for rapid growth of GPU-based workloads.
22+
23+
For more information about using the NVIDIA GPU Operator to provision worker nodes for running GPU-accelerated VMs, see link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/openshift-virtualization.html[NVIDIA GPU Operator with OpenShift Virtualization].

modules/using-mediated-devices.adoc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="virt-using-mediated-devices_{context}"]
7+
= Using mediated devices
8+
9+
A vGPU is a type of mediated device; the performance of the physical GPU is divided among the virtual devices. You can assign mediated devices to one or more virtual machines.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
7+
[id="about-changing-removing-mediated-devices_{context}"]
8+
= About changing and removing mediated devices
9+
10+
The cluster's mediated device configuration can be updated with {VirtProductName} by:
11+
12+
* Editing the `HyperConverged` CR and change the contents of the `mediatedDevicesTypes` stanza.
13+
14+
* Changing the node labels that match the `nodeMediatedDeviceTypes` node selector.
15+
16+
* Removing the device information from the `spec.mediatedDevicesConfiguration` and `spec.permittedHostDevices` stanzas of the `HyperConverged` CR.
17+
+
18+
[NOTE]
19+
====
20+
If you remove the device information from the `spec.permittedHostDevices` stanza without also removing it from the `spec.mediatedDevicesConfiguration` stanza, you cannot create a new mediated device type on the same node. To properly remove mediated devices, remove the device information from both stanzas.
21+
====
22+
23+
Depending on the specific changes, these actions cause {VirtProductName} to reconfigure mediated devices or remove them from the cluster nodes.

modules/virt-about-using-virtual-gpus.adoc

Lines changed: 0 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -15,130 +15,3 @@ Refer to your hardware vendor's documentation for functionality and support deta
1515

1616
Mediated device:: A physical device that is divided into one or more virtual devices. A vGPU is a type of mediated device (mdev); the performance of the physical GPU is divided among the virtual devices. You can assign mediated devices to one or more virtual machines (VMs), but the number of guests must be compatible with your GPU. Some GPUs do not support multiple guests.
1717

18-
[id="configuration-overview_{context}"]
19-
== Configuration overview
20-
21-
When configuring mediated devices, an administrator must:
22-
23-
* Create the mediated devices.
24-
* Expose the mediated devices to the cluster.
25-
26-
The `HyperConverged` CR includes APIs that accomplish both tasks:
27-
28-
.Creating mediated devices
29-
30-
[source,yaml]
31-
----
32-
...
33-
spec:
34-
mediatedDevicesConfiguration:
35-
mediatedDevicesTypes: <.>
36-
- <device_type>
37-
nodeMediatedDeviceTypes: <.>
38-
- mediatedDevicesTypes: <.>
39-
- <device_type>
40-
nodeSelector: <.>
41-
<node_selector_key>: <node_selector_value>
42-
...
43-
----
44-
<.> Required: Configures global settings for the cluster.
45-
<.> Optional: Overrides the global configuration for a specific node or group of nodes. Must be used with the global `mediatedDevicesTypes` configuration.
46-
<.> Required if you use `nodeMediatedDeviceTypes`. Overrides the global `mediatedDevicesTypes` configuration for select nodes.
47-
<.> Required if you use `nodeMediatedDeviceTypes`. Must include a `key:value` pair.
48-
49-
.Exposing mediated devices to the cluster
50-
51-
[source,yaml]
52-
----
53-
...
54-
permittedHostDevices:
55-
mediatedDevices:
56-
- mdevNameSelector: GRID T4-2Q <.>
57-
resourceName: nvidia.com/GRID_T4-2Q
58-
...
59-
----
60-
<.> Exposes the mediated devices that map to this value on the host.
61-
+
62-
[NOTE]
63-
====
64-
You can see the mediated device types that your device supports by viewing the contents of `/sys/bus/pci/devices/<slot>:<bus>:<domain>.<function>/mdev_supported_types/<type>/name`, substituting the correct values for your system.
65-
66-
For example, the name file for the `nvidia-231` type contains the selector string `GRID T4-2Q`. Using `GRID T4-2Q` as the `mdevNameSelector` value allows nodes to use the `nvidia-231` type.
67-
====
68-
69-
[id="how-vgpus-are-assigned-to-nodes_{context}"]
70-
== How vGPUs are assigned to nodes
71-
72-
For each physical device, {VirtProductName} configures:
73-
74-
* A single mdev type.
75-
* The maximum number of instances of the selected mdev type.
76-
77-
The cluster architecture affects how devices are created and assigned to nodes.
78-
79-
Large cluster with multiple cards per node:: On nodes with multiple cards that can support similar vGPU types, the relevant device types are created in a round-robin manner.
80-
For example:
81-
+
82-
[source,yaml]
83-
----
84-
...
85-
mediatedDevicesConfiguration:
86-
mediatedDevicesTypes:
87-
- nvidia-222
88-
- nvidia-228
89-
- nvidia-105
90-
- nvidia-108
91-
...
92-
----
93-
+
94-
In this scenario, each node has two cards, both of which support the following vGPU types:
95-
+
96-
[source,text]
97-
----
98-
nvidia-105
99-
...
100-
nvidia-108
101-
nvidia-217
102-
nvidia-299
103-
...
104-
----
105-
+
106-
On each node, {VirtProductName} creates:
107-
108-
* 16 vGPUs of type nvidia-105 on the first card.
109-
* 2 vGPUs of type nvidia-108 on the second card.
110-
111-
One node has a single card that supports more than one requested vGPU type:: {VirtProductName} uses the supported type that comes first on the `mediatedDevicesTypes` list.
112-
+
113-
For example, a node's card supports `nvidia-223` and `nvidia-224`. The following `mediatedDevicesTypes` list is configured:
114-
+
115-
[source,yaml]
116-
----
117-
...
118-
mediatedDevicesConfiguration:
119-
mediatedDevicesTypes:
120-
- nvidia-22
121-
- nvidia-223
122-
- nvidia-224
123-
...
124-
----
125-
+
126-
In this example, {VirtProductName} uses the `nvidia-223` type.
127-
128-
[id="about-changing-removing-mediated-devices_{context}"]
129-
== About changing and removing mediated devices
130-
131-
{VirtProductName} updates the cluster's mediated device configuration if:
132-
133-
* You edit the `HyperConverged` CR and change the contents of the `mediatedDevicesTypes` stanza.
134-
135-
* You change the node labels that match the `nodeMediatedDeviceTypes` node selector.
136-
137-
* You remove the device information from the `spec.mediatedDevicesConfiguration` and `spec.permittedHostDevices` stanzas of the `HyperConverged` CR.
138-
+
139-
[NOTE]
140-
====
141-
If you remove the device information from the `spec.permittedHostDevices` stanza without also removing it from the `spec.mediatedDevicesConfiguration` stanza, you cannot create a new mediated device type on the same node. To properly remove mediated devices, remove the device information from both stanzas.
142-
====
143-
144-
Depending on the specific changes, these actions cause {VirtProductName} to reconfigure mediated devices or remove them from the cluster nodes.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="virt-adding-and-removing-mediated-devices_context"]
7+
= Adding and removing mediated devices
8+
9+
You can add or remove mediated devices.
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: REFERENCE
6+
[id="how-vgpus-are-assigned-to-nodes_{context}"]
7+
= How vGPUs are assigned to nodes
8+
9+
For each physical device, {VirtProductName} configures the following values:
10+
11+
* A single mdev type.
12+
* The maximum number of instances of the selected `mdev` type.
13+
14+
The cluster architecture affects how devices are created and assigned to nodes.
15+
16+
Large cluster with multiple cards per node:: On nodes with multiple cards that can support similar vGPU types, the relevant device types are created in a round-robin manner.
17+
For example:
18+
+
19+
[source,yaml]
20+
----
21+
...
22+
mediatedDevicesConfiguration:
23+
mediatedDevicesTypes:
24+
- nvidia-222
25+
- nvidia-228
26+
- nvidia-105
27+
- nvidia-108
28+
...
29+
----
30+
+
31+
In this scenario, each node has two cards, both of which support the following vGPU types:
32+
+
33+
[source,text]
34+
----
35+
nvidia-105
36+
...
37+
nvidia-108
38+
nvidia-217
39+
nvidia-299
40+
...
41+
----
42+
+
43+
On each node, {VirtProductName} creates the following vGPUs:
44+
45+
* 16 vGPUs of type nvidia-105 on the first card.
46+
* 2 vGPUs of type nvidia-108 on the second card.
47+
48+
One node has a single card that supports more than one requested vGPU type:: {VirtProductName} uses the supported type that comes first on the `mediatedDevicesTypes` list.
49+
+
50+
For example, the card on a node card supports `nvidia-223` and `nvidia-224`. The following `mediatedDevicesTypes` list is configured:
51+
+
52+
[source,yaml]
53+
----
54+
...
55+
mediatedDevicesConfiguration:
56+
mediatedDevicesTypes:
57+
- nvidia-22
58+
- nvidia-223
59+
- nvidia-224
60+
...
61+
----
62+
+
63+
In this example, {VirtProductName} uses the `nvidia-223` type.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
7+
[id="virt-preparing-host-for-mdevs_{context}"]
8+
= Preparing hosts for mediated devices
9+
10+
You must enable the Input-Output Memory Management Unit (IOMMU) driver before you can configure mediated devices.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="prerequisites_{context}"]
7+
== Prerequisites
8+
9+
* If your hardware vendor provides drivers, you installed them on the nodes where you want to create mediated devices.
10+
** If you use NVIDIA cards, you link:https://access.redhat.com/solutions/6738411[installed the NVIDIA GRID driver].
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
4+
5+
:_content-type: REFERENCE
6+
[id="configuration-overview_{context}"]
7+
= Configuration overview
8+
9+
When configuring mediated devices, an administrator must complete the following tasks:
10+
11+
* Create the mediated devices.
12+
* Expose the mediated devices to the cluster.
13+
14+
The `HyperConverged` CR includes APIs that accomplish both tasks.
15+
16+
.Creating mediated devices
17+
18+
[source,yaml]
19+
----
20+
...
21+
spec:
22+
mediatedDevicesConfiguration:
23+
mediatedDevicesTypes: <1>
24+
- <device_type>
25+
nodeMediatedDeviceTypes: <2>
26+
- mediatedDevicesTypes: <3>
27+
- <device_type>
28+
nodeSelector: <4>
29+
<node_selector_key>: <node_selector_value>
30+
...
31+
----
32+
<1> Required: Configures global settings for the cluster.
33+
<2> Optional: Overrides the global configuration for a specific node or group of nodes. Must be used with the global `mediatedDevicesTypes` configuration.
34+
<3> Required if you use `nodeMediatedDeviceTypes`. Overrides the global `mediatedDevicesTypes` configuration for the specified nodes.
35+
<4> Required if you use `nodeMediatedDeviceTypes`. Must include a `key:value` pair.
36+
37+
.Exposing mediated devices to the cluster
38+
39+
[source,yaml]
40+
----
41+
...
42+
permittedHostDevices:
43+
mediatedDevices:
44+
- mdevNameSelector: GRID T4-2Q <1>
45+
resourceName: nvidia.com/GRID_T4-2Q <2>
46+
...
47+
----
48+
<1> Exposes the mediated devices that map to this value on the host.
49+
+
50+
[NOTE]
51+
====
52+
You can see the mediated device types that your device supports by viewing the contents of `/sys/bus/pci/devices/<slot>:<bus>:<domain>.<function>/mdev_supported_types/<type>/name`, substituting the correct values for your system.
53+
54+
For example, the name file for the `nvidia-231` type contains the selector string `GRID T4-2Q`. Using `GRID T4-2Q` as the `mdevNameSelector` value allows nodes to use the `nvidia-231` type.
55+
====
56+
<2> The `resourceName` should match that allocated on the node. Find the `resourceName` by using the following command:
57+
+
58+
[source,terminal]
59+
----
60+
$ oc get $NODE -o json \
61+
| jq '.status.allocatable | \
62+
with_entries(select(.key | startswith("nvidia.com/"))) | \
63+
with_entries(select(.value != "0"))'
64+
----

0 commit comments

Comments
 (0)