Skip to content

Commit 34a87a8

Browse files
TELCODOCS-776: First draft
1 parent 5b65ae8 commit 34a87a8

13 files changed

+284
-6
lines changed

_topic_maps/_topic_map.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ Topics:
8787
- Name: Control plane architecture
8888
File: control-plane
8989
Distros: openshift-enterprise,openshift-origin,openshift-online
90+
- Name: NVIDIA GPU architecture overview
91+
File: nvidia-gpu-architecture-overview
92+
Distros: openshift-enterprise
9093
- Name: Understanding OpenShift development
9194
File: understanding-development
9295
Distros: openshift-enterprise
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
:_content-type: ASSEMBLY
2+
[id="nvidia-gpu-architecture-overview"]
3+
= NVIDIA GPU architecture overview
4+
include::_attributes/common-attributes.adoc[]
5+
:context: nvidia-gpu-architecture-overview
6+
7+
toc::[]
8+
9+
NVIDIA supports the use of graphics processing unit (GPU) resources on {product-title}. {product-title} is a security-focused and hardened Kubernetes platform developed and supported by Red Hat for deploying and managing Kubernetes clusters at scale. {product-title} includes enhancements to Kubernetes so that users can easily configure and use NVIDIA GPU resources to accelerate workloads.
10+
11+
The NVIDIA GPU Operator leverages the Operator framework within {product-title} to manage the full lifecycle of NVIDIA software components required to run GPU-accelerated workloads.
12+
13+
These components include the NVIDIA drivers (to enable CUDA), the Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node tagging using GPU feature discovery (GFD), DCGM-based monitoring, and others.
14+
15+
[NOTE]
16+
====
17+
The NVIDIA GPU Operator is only supported by NVIDIA. For more information about obtaining support from NVIDIA, see link:https://access.redhat.com/solutions/5174941[Obtaining Support from NVIDIA].
18+
====
19+
20+
include::modules/nvidia-gpu-prerequisites.adoc[leveloffset=+1]
21+
// New enablement modules
22+
include::modules/nvidia-gpu-enablement.adoc[leveloffset=+1]
23+
24+
include::modules/nvidia-gpu-bare-metal.adoc[leveloffset=+2]
25+
[role="_additional-resources"]
26+
.Additional resources
27+
* link:https://docs.nvidia.com/ai-enterprise/deployment-guide-openshift-on-bare-metal/0.1.0/on-bare-metal.html[Red Hat OpenShift on Bare Metal Stack]
28+
29+
include::modules/nvidia-gpu-virtualization.adoc[leveloffset=+2]
30+
[role="_additional-resources"]
31+
.Additional resources
32+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/openshift-virtualization.html[NVIDIA GPU Operator with OpenShift Virtualization]
33+
34+
include::modules/nvidia-gpu-vsphere.adoc[leveloffset=+2]
35+
[role="_additional-resources"]
36+
.Additional resources
37+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/nvaie-with-ocp.html#openshift-container-platform-on-vmware-vsphere-with-nvidia-vgpus[OpenShift Container Platform on VMware vSphere with NVIDIA vGPUs]
38+
39+
include::modules/nvidia-gpu-kvm.adoc[leveloffset=+2]
40+
[role="_additional-resources"]
41+
.Additional resources
42+
* link:https://computingforgeeks.com/how-to-deploy-openshift-container-platform-on-kvm/[How To Deploy OpenShift Container Platform 4.13 on KVM]
43+
44+
include::modules/nvidia-gpu-csps.adoc[leveloffset=+2]
45+
[role="_additional-resources"]
46+
.Additional resources
47+
* link:https://docs.nvidia.com/ai-enterprise/deployment-guide-cloud/0.1.0/aws-redhat-openshift.html[Red Hat Openshift in the Cloud]
48+
49+
include::modules/nvidia-gpu-red-hat-device-edge.adoc[leveloffset=+2]
50+
[role="_additional-resources"]
51+
.Additional resources
52+
* link:https://cloud.redhat.com/blog/how-to-accelerate-workloads-with-nvidia-gpus-on-red-hat-device-edge[How to accelerate workloads with NVIDIA GPUs on Red Hat Device Edge]
53+
54+
include::modules/nvidia-gpu-features.adoc[leveloffset=+1]
55+
56+
[role="_additional-resources"]
57+
.Additional resources
58+
59+
* link:https://docs.nvidia.com/ngc/ngc-deploy-on-premises/nvidia-certified-systems/index.html[NVIDIA-Certified Systems]
60+
* link:https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/nvidia-gpu-admin-dashboard#doc-wrapper[The NVIDIA GPU administration dashboard]
61+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/nvaie-with-ocp.html[NVIDIA AI Enterprise with OpenShift]
62+
* link:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html#[NVIDIA Container Toolkit]
63+
* link:https://developer.nvidia.com/dcgm[NVIDIA DCGM]
64+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/mig-ocp.html[MIG Support in OpenShift Container Platform]
65+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/time-slicing-gpus-in-openshift.html[Time-slicing NVIDIA GPUs in OpenShift]
66+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/mirror-gpu-ocp-disconnected.html[Deploy GPU Operators in a disconnected or airgapped environment]
67+
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-nfd.html[Installing the Node Feature Discovery (NFD) Operator]
68+
* link:https://docs.openshift.com/container-platform/4.13/hardware_enablement/psap-node-feature-discovery-operator.html#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[{product-title} Installing the Node Feature Discovery Operator]
116 KB
Loading

modules/about-using-gpu-operator.adoc

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,20 @@
22
//
33
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
44

5+
56
:_content-type: CONCEPT
67
[id="about-using-nvidia-gpu_{context}"]
78
= About using the NVIDIA GPU Operator
89

9-
The NVIDIA GPU Operator manages NVIDIA GPU resources in an {product-title} cluster and automates tasks related to bootstrapping GPU nodes.
10-
Since the GPU is a special resource in the cluster, you must install some components before deploying application workloads onto the GPU.
11-
These components include the NVIDIA drivers which enables compute unified device architecture (CUDA), Kubernetes device plugin, container runtime and others such as automatic node labelling, monitoring and more.
10+
The NVIDIA GPU Operator manages NVIDIA GPU resources in a {product-title} cluster and automates tasks related to bootstrapping GPU nodes. Because the GPU is a special resource in the cluster, you must install some components before you can deploy application workloads to the GPU. These components include the NVIDIA drivers that enable the compute unified device architecture (CUDA), Kubernetes device plugin, container runtime, and other features such as automatic node labeling, monitoring, and more.
11+
1212
[NOTE]
1313
====
1414
The NVIDIA GPU Operator is supported only by NVIDIA. For more information about obtaining support from NVIDIA, see link:https://access.redhat.com/solutions/5174941[Obtaining Support from NVIDIA].
1515
====
1616

1717
There are two ways to enable GPUs with {product-title} {VirtProductName}: the {product-title}-native way described here and by using the NVIDIA GPU Operator.
1818

19-
The NVIDIA GPU Operator is a Kubernetes Operator that enables {product-title} {VirtProductName} to expose GPUs to virtualized workloads running on {product-title}.
20-
It allows users to easily provision and manage GPU-enabled virtual machines, providing them with the ability to run complex artificial intelligence/machine learning (AI/ML) workloads on the same platform as their other workloads.
21-
It also provides an easy way to scale the GPU capacity of their infrastructure, allowing for rapid growth of GPU-based workloads.
19+
The NVIDIA GPU Operator is a Kubernetes Operator that uses {product-title} {VirtProductName} to provision GPUs for virtualized workloads running on {product-title}. With the Operator, you can easily provision and manage GPU-enabled virtual machines to run complex artificial intelligence/machine learning (AI/ML) workloads on the same platform as their other workloads. The Operator also provides an easy way to scale the GPU capacity of their infrastructure, enabling rapid growth of GPU-based workloads.
2220

2321
For more information about using the NVIDIA GPU Operator to provision worker nodes for running GPU-accelerated VMs, see link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/openshift-virtualization.html[NVIDIA GPU Operator with OpenShift Virtualization].

modules/nvidia-gpu-bare-metal.adoc

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * architecture/nvidia-gpu-architecture-overview.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nvidia-gpu-bare-metal_{context}"]
7+
= GPUs and bare metal
8+
9+
You can deploy {product-title} on an NVIDIA-certified bare metal server but with some limitations:
10+
11+
* Control plane nodes can be CPU nodes.
12+
13+
* Worker nodes must be GPU nodes, provided that AI/ML workloads are executed on these worker nodes.
14+
+
15+
In addition, the worker nodes can host one or more GPUs, but they must be of the same type. For example, a node can have two NVIDIA A100 GPUs, but a node with one A100 GPU and one T4 GPU is not supported. The NVIDIA Device Plugin for Kubernetes does not support mixing different GPU models on the same node.
16+
17+
* When using OpenShift, note that one or three or more servers are required. Clusters with two servers are not supported. The single server deployment is called single node openShift (SNO) and using this configuration results in a non-high availability OpenShift environment.
18+
19+
You can choose one of the following methods to access the containerized GPUs:
20+
21+
* GPU passthrough
22+
* Multi-Instance GPU (MIG)

modules/nvidia-gpu-csps.adoc

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * architecture/nvidia-gpu-architecture-overview.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nvidia-gpu-csps_{context}"]
7+
= GPUs and CSPs
8+
9+
You can deploy {product title} to one of the major cloud service providers (CSPs): Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure.
10+
11+
Two modes of operation are available: a fully managed deployment and a self-managed deployment.
12+
13+
* In a fully managed deployment, everything is automated by Red Hat in collaboration with CSP. You can request an OpenShift instance through the CSP web console, and the cluster is automatically created and fully managed by Red Hat. You do not have to worry about node failures or errors in the environment. Red Hat is fully responsible for maintaining the uptime of the cluster. The fully managed services are available on AWS and Azure. For AWS, the OpenShift service is called ROSA (Red Hat OpenShift Service on AWS). For Azure, the service is called Azure Red Hat OpenShift.
14+
15+
* In a self-managed deployment, you are responsible for instantiating and maintaining the OpenShift cluster. Red Hat provides the OpenShift-install utility to support the deployment of the OpenShift cluster in this case. The self-managed services are available globally to all CSPs.
16+
17+
It is important that this compute instance is a GPU-accelerated compute instance and that the GPU type matches the list of supported GPUs from NVIDIA AI Enterprise. For example, T4, V100, and A100 are part of this list.
18+
19+
You can choose one of the following methods to access the containerized GPUs:
20+
21+
* GPU passthrough to access and use GPU hardware within a virtual machine (VM).
22+
23+
* GPU (vGPU) time slicing when the entire GPU is not required.

modules/nvidia-gpu-enablement.adoc

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * architecture/nvidia-gpu-architecture-overview.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nvidia-gpu-enablement_{context}"]
7+
= NVIDIA GPU enablement
8+
9+
The following diagram shows how the GPU architecture is enabled for OpenShift:
10+
11+
.NVIDIA GPU enablement
12+
image::349_OpenShift_NVIDIA_GPU_arch_0723.png[NVIDIA GPU enablement]

modules/nvidia-gpu-features.adoc

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * architecture/nvidia-gpu-architecture-overview.adoc
4+
5+
6+
:_content-type: CONCEPT
7+
[id="nvidia-gpu-features_{context}"]
8+
= NVIDIA GPU features for {product-title}
9+
10+
// NVIDIA GPU Operator::
11+
// The NVIDIA GPU Operator is a Kubernetes Operator that enables {product-title} {VirtProductName} to expose GPUs to virtualized workloads running on {product-title}.
12+
// It allows users to easily provision and manage GPU-enabled virtual machines, providing them with the ability to run complex artificial intelligence/machine learning (AI/ML) workloads on the same platform as their other workloads.
13+
// It also provides an easy way to scale the GPU capacity of their infrastructure, allowing for rapid growth of GPU-based workloads.
14+
15+
NVIDIA Container Toolkit::
16+
NVIDIA Container Toolkit enables you to create and run GPU-accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to use NVIDIA GPUs.
17+
18+
NVIDIA AI Enterprise::
19+
NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software optimized, certified, and supported with NVIDIA-Certified systems.
20+
+
21+
NVIDIA AI Enterprise includes support for Red Hat {product-title}. The following installation methods are supported:
22+
+
23+
* {product-title} on bare metal or VMware vSphere with GPU Passthrough.
24+
25+
* {product-title} on VMware vSphere with NVIDIA vGPU.
26+
27+
28+
Multi-Instance GPU (MIG) Support in {product-title}::
29+
MIG is useful whenever you have an application that does not require the full power of an entire GPU. The MIG feature of the new NVIDIA Ampere architecture enables you to split your hardware resources into multiple GPU instances, each of which is available to the operating system as an independent CUDA-enabled GPU. The NVIDIA GPU Operator version 1.7.0 and higher provides MIG support for the A100 and A30 Ampere cards. These GPU instances are designed to support multiple independent CUDA applications (up to 7) so that they operate completely isolated from each other with dedicated hardware resources.
30+
+
31+
The GPU's compute units, in addition to their memory, can be split into multiple MIG instances. Each of these instances represents a standalone GPU device from a system perspective and can be connected to any application, container or virtual machine running on the node.
32+
+
33+
From the perspective of the software that uses the GPU, each of these MIG instances looks like its own individual GPU.
34+
35+
Time-slicing NVIDIA GPUs in OpenShift::
36+
GPU time-slicing enables workloads scheduled on overloaded GPUs to be interleaved.
37+
+
38+
This mechanism for enabling time-slicing of GPUs in Kubernetes enables a system administrator to define a set of replicas for a GPU, each of which can be independently distributed to a pod to run workloads on. Unlike multi-instance GPU (MIG), there is no memory or fault isolation between replicas, but for some workloads this is better than not sharing at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU.
39+
+
40+
You can apply a cluster-wide default configuration for time slicing. You can also apply node-specific configurations. For example, you can apply a time-slicing configuration only to nodes with Tesla T4 GPUs and not modify nodes with other GPU models.
41+
+
42+
You can combine these two approaches by applying a cluster-wide default configuration and then label nodes to give those nodes receive a node-specific configuration.
43+
44+
GPU Feature Discovery::
45+
NVIDIA GPU Feature Discovery for Kubernetes is a software component that enables you to automatically generate labels for the GPUs available on a node. GPU Feature Discovery uses node feature discovery (NFD) to perform this labeling.
46+
+
47+
The Node Feature Discovery Operator (NFD) manages the discovery of hardware features and configurations in an OpenShift Container Platform cluster by labeling nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, OS version, and so on.
48+
+
49+
You can find the NFD Operator in the Operator Hub by searching for “Node Feature Discovery”.
50+
51+
52+
NVIDIA GPU Operator with OpenShift Virtualization::
53+
Up until this point, the GPU Operator only provisioned worker nodes to run GPU-accelerated containers. Now, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines (VMs).
54+
+
55+
You can configure the GPU Operator to deploy different software components to worker nodes depending on which GPU workload is configured to run on those nodes.
56+
57+
GPU Operator dashboard::
58+
You can install a console plugin to display GPU usage information on the cluster utilization screen in the {product title} web console. GPU utilization information includes the number of available GPUs, power consumption (in watts) for each GPU and the percentage of GPU workload used for video encoding and decoding.

modules/nvidia-gpu-kvm.adoc

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * architecture/nvidia-gpu-architecture-overview.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nvidia-gpu-kvm_{context}"]
7+
= GPUs and Red Hat KVM
8+
9+
You can use {product-title} on an NVIDIA-certified kernel-based virtual machine (KVM) server.
10+
11+
Similar to bare-metal deployments, one or three or more servers are required. Clusters with two servers are not supported.
12+
13+
However, unlike bare-metal deployments, you can use different types of GPUs in the server. This is because you can assign these GPUs to different VMs that act as Kubernetes nodes. The only limitation is that a Kubernetes node must have the same set of GPU types at its own level.
14+
15+
You can choose one of the following methods to access the containerized GPUs:
16+
17+
* GPU passthrough for accessing and using GPU hardware within a virtual machine (VM)
18+
19+
* GPU (vGPU) time-slicing when not all of the GPU is needed
20+
21+
To enable the vGPU capability, a special driver must be installed at the host level. This driver is delivered as a RPM package. This host driver is not required at all for GPU passthrough allocation.

modules/nvidia-gpu-prerequisites.adoc

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * architecture/nvidia-gpu-architecture-overview.adoc
4+
5+
6+
:_content-type: CONCEPT
7+
[id="nvidia-gpu-prerequisites_{context}"]
8+
= NVIDIA GPU prerequisites
9+
10+
* A working OpenShift cluster with at least one GPU worker node.
11+
12+
* Access to the OpenShift cluster as a `cluster-admin` to perform the required steps.
13+
14+
* OpenShift CLI (`oc`) is installed.
15+
16+
* The node feature discovery (NFD) Operator is installed and a `nodefeaturediscovery` instance is created.

0 commit comments

Comments
 (0)