-
Notifications
You must be signed in to change notification settings - Fork 12
Roadmap for future Gardener GPU support #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
bae8242
4415917
48c9667
41cd1d9
26c7563
5a75156
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,346 @@ | ||
| # Future GPU support in Gardener | ||
|
|
||
| ## Executive Summary (AI generated, human-edited) | ||
| This document outlines the current state and future plans for GPU support in Gardener, focusing primarily on NVIDIA GPUs. | ||
| ### Current State | ||
| - GPU support is possible but requires significant manual effort | ||
| - Requires three DaemonSets to be configured and deployed: | ||
| 1. NVIDIA driver installer | ||
| 2. GKE Device Plugin | ||
| 3. DCGM Exporter | ||
|
|
||
| - Current implementation has some limitations: | ||
| - Requires building images for each Garden Linux & NVIDIA driver version combination | ||
| - Relies on Google's GKE Device Plugin | ||
| - Requires manual node labeling | ||
| - Limited to basic GPU features | ||
|
|
||
| - The above effort and limitations can be avoided by enabling Gardener to use the | ||
| NVIDIA GPU Operator, but this requires some development work. This work would also | ||
| enable the use of AMD and Intel GPUs in future. | ||
|
|
||
| ### Future Roadmap | ||
| #### Step 1: Add Garden Linux Support to NVIDIA GPU Operator | ||
| - Add Garden Linux support to NVIDIA GPU Driver Container | ||
| - Implement NVIDIA Container Toolkit installation for Garden Linux | ||
| - Integrate Garden Linux support into NVIDIA GPU Operator | ||
|
|
||
| #### Step 2: S3 Storage Support | ||
| - Add S3 bucket support for storing pre-built kernel modules | ||
| - Enable sharing compiled modules across clusters | ||
| - Reduce compilation overhead | ||
|
|
||
| #### Step 3: NFS PV Storage Integration | ||
| - Implement NFS-based Persistent Volume storage | ||
| - Enable first-time compilation with cached results for subsequent nodes | ||
| - Leverage hyperscaler's NFS CSI driver | ||
|
|
||
| #### Step 4: Gardener UI and Shoot Specification Enhancement | ||
| - Add GPU support checkbox in Gardener UI | ||
| - Automate deployment of NVIDIA GPU Operator (with Node Feature Discovery operator) | ||
| - Enable NVIDIA Container runtime options for worker pools | ||
| - Implement GPU configuration through `shoot.yaml` specifications | ||
|
|
||
| #### Step 5: Multi-vendor Support (Two possible approaches) | ||
| 1. Extend NVIDIA GPU Operator | ||
| - Add support for AMD & Intel GPUs | ||
| - Align with projects like HAMi | ||
|
|
||
| 2. Extend Gardener GPU Extension | ||
| - Support multiple vendor operators | ||
| - Enable configuration for different GPU vendors through custom resources | ||
|
|
||
| The ultimate goal is to simplify GPU deployment in Gardener clusters while providing flexible options for different GPU vendors and use cases. | ||
|
|
||
|
|
||
| ## Introduction: What do we want? | ||
|
|
||
| We want easy-to-consume support for using GPUs in a Gardener cluster, | ||
| beginning with NVIDIA GPUs. | ||
|
|
||
| Using GPUs in Gardener is possible right now, but involves a lot | ||
| of work. What we want is to create a worker pool of GPU nodes, and to be | ||
| able to then schedule GPU-using Pods to those nodes with as | ||
| little effort as possible. | ||
|
|
||
| In a perfect world a user would create a worker pool of GPU instances, | ||
| and everything "just works". | ||
|
|
||
| In an almost-perfect world a user would select an NVIDIA GPU option | ||
| as "Additional OCI Runtime" (dropdown) in the Gardener UI / `containerRuntime` in the shoot spec. | ||
|
|
||
| ## How we do it now | ||
|
|
||
| We deploy DaemonSets for the following 3 features: | ||
|
|
||
| - NVIDIA driver installer | ||
|
|
||
| - Installs the Linux kernel module that creates /dev/nvidia\* | ||
| devices along with NVIDIA-related /bin and /lib folders in the | ||
| host OS filesystem. | ||
|
|
||
| - For A100, H100 and similar GPUs, runs the NVIDIA Fabric Manager | ||
| which enables inter-GPU communication. | ||
|
|
||
|
|
||
| - GKE Device Plugin | ||
|
|
||
| - This image (used by GKE) makes Kubernetes aware of the GPU | ||
| devices on the node, and takes care of inserting into GPU-using | ||
| pods the /dev, /bin and /lib files from the NVIDIA driver | ||
| installer. | ||
|
|
||
|
|
||
| - DCGM Exporter | ||
|
|
||
| - Exposes Prometheus-compatible metrics for the GPUs on each node | ||
|
|
||
| The GKE Device Plugin and DCGM Exporter use images created by Google and | ||
| NVIDIA, respectively. | ||
|
|
||
| The NVIDIA Driver Installer uses images that are built by AI Core. | ||
| [These are technically from the Garden Linux | ||
| team](https://github.com/gardenlinux/gardenlinux-nvidia-installer), but | ||
| in reality are 95% maintained by AI Core. | ||
|
|
||
| For each version of Garden Linux and each version of the NVIDIA driver | ||
| we want to support, we have to build an image. This image contains the | ||
| specified NVIDIA driver compiled for the kernel of that Garden Linux | ||
| version. | ||
|
|
||
| The [Garden Linux | ||
| repo](https://github.com/gardenlinux/gardenlinux-nvidia-installer) | ||
| mentioned above tells a user how to build an image for a given driver & | ||
| kernel version. AI Core embeds this repo into [its own build | ||
| process](https://github.wdf.sap.corp/ICN-ML/aicore/blob/main/system-services/nvidia-installer/component.yaml#L18-L45) | ||
| in order to generate the set of images required to support AI Core, | ||
| which are hosted in an AI Core registry for use only by AI Core. | ||
|
|
||
| ### Pros & Cons | ||
|
|
||
| #### Pro: It works | ||
|
|
||
| For AI Core at least, the current way works fairly well. Every few | ||
| months we update the AI Core build list to add support for a new version | ||
| of Garden Linux or a new NVIDIA driver, and then configure AI Core to | ||
| use the new versions. This is achieved with a few lines of configuration | ||
| in our config-as-code repos and takes just a day or two (build, deploy, | ||
| test, etc). | ||
|
|
||
| #### Con: It requires building an image for every version combination of Garden Linux & NVIDIA driver | ||
|
|
||
| For other users of Gardener, all they see is the Garden Linux repo. This | ||
| is fine for doing a proof-of-concept to build a driver for a cluster | ||
| with a given version, but day 2 operations require the user to create a | ||
| build pipeline and deployment system parallel to the one used by AI Core | ||
| in order to have the images required for future versions of Garden Linux | ||
| & NVIDIA driver. (AI Core's build/deploy system is not easily usable | ||
| outside of the AI Core context.) Because such images contain proprietary | ||
| NVIDIA code (the driver is not open source), it is legally difficult to | ||
| put such images into a publicly-accessible registry for use by all. | ||
|
|
||
| #### Con: It's not ideal that we use the GKE Device Plugin | ||
|
|
||
| The [GKE Device | ||
| Plugin](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/cmd/nvidia_gpu/README.md) | ||
| works well, but is used by only one other organisation (Google) and we | ||
| do not have explicit permission to use it - although [it is Apache | ||
| open-source | ||
| licensed](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/LICENSE) | ||
| so the risk is low. Nevertheless we are tying ourselves to a specific | ||
| vendor other than NVIDIA. | ||
|
|
||
| #### Con: It requires Gardener users to label GPU nodes | ||
|
|
||
| Because the NVIDIA driver installer image is specific to each Garden | ||
| Linux version, each GPU node requires a label identifying this version, | ||
| for example **os-version: 1592.4.0**. Gardener does not take care of | ||
| adding such labels, so this becomes a chore for the operations team. | ||
|
Comment on lines
+155
to
+158
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, this sounds like something rather trivial to quickly/easily add (modulo in cases of conflict). |
||
|
|
||
| Note: This can be automated by deploying the [Node Feature | ||
| Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/v0.17/get-started/index.html) | ||
| operator and creating the following rule: | ||
|
|
||
|
|
||
| apiVersion: nfd.k8s-sigs.io/v1alpha1 | ||
| kind: NodeFeatureRule | ||
| metadata: | ||
| name: gardenlinux-version | ||
| spec: | ||
| rules: | ||
| - name: "Garden Linux version" | ||
| labels: | ||
| "node.gardener.cloud/gardenlinux-version": "@system.osrelease.GARDENLINUX_VERSION" | ||
| matchFeatures: | ||
| - feature: system.osrelease | ||
| matchExpressions: | ||
| GARDENLINUX_VERSION: {op: Exists} | ||
|
|
||
|
|
||
| This rule will result in a label similar to this: | ||
| **`node.gardener.cloud/gardenlinux-version: '1592.9'`** | ||
|
|
||
| ## NVIDIA GPU Operator | ||
|
|
||
| ### Pros & Cons | ||
|
|
||
| #### Pro: It is the official method supported by NVIDIA and does everything for you | ||
|
|
||
| The [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) takes | ||
| care of installing the following: | ||
|
|
||
| - NVIDIA GPU Driver | ||
|
|
||
| - NVIDIA Container Toolkit (a collection of libraries and utilities | ||
| enabling users to build and run GPU-accelerated containers) | ||
|
|
||
| - NVIDIA Device Plugin | ||
|
|
||
| - DCGM Exporter | ||
|
|
||
| - vGPU manager | ||
|
|
||
| #### Pro: Enables advanced GPU features | ||
|
|
||
| Features such as multi-instace GPU, vGPU, GPU time slicing and GPUDirect RDMA are | ||
| supported by the NVIDIA GPU Operator. These features are not | ||
| supported by the current AI Core implementation. | ||
|
|
||
| #### Con: Driver installer by default downloads and compiles at runtime | ||
|
|
||
| The default configuration runs a container image that downloads and | ||
| installs OS packages and then downloads, compiles and installs the | ||
| NVIDIA driver kernel modules - this is all done by the DaemonSet's Pod | ||
| when it starts on the GPU node. We used to do something similar for AI | ||
| Core, but found the approach to be somewhat fragile as well as adding a | ||
| significant amount of time to the node startup phase. | ||
|
|
||
| It is possible to tell the operator to use "precompiled" images instead, | ||
| which results in a similar approach to how AI Core is installing the | ||
| NVIDIA driver. Of course, a build pipeline must be set up to create | ||
| these images. | ||
|
|
||
| Both types of image (download & compile; precompiled) are built from the | ||
| [NVIDIA GPU Driver | ||
| Container](https://github.com/NVIDIA/gpu-driver-container) repo. The | ||
| root of this repo contains folders for various operating systems. | ||
|
|
||
| Only Ubuntu 22.04 and Ubuntu 24.04 are officially supported for | ||
| precompiled images, although the repo also contains the required files | ||
| and instructions to build precompiled images for RHEL 8 and RHEL 9. | ||
|
|
||
| #### Con: Garden Linux is not a supported platform | ||
|
|
||
| The NVIDIA GPU Operator supports only Ubuntu and Red Hat operating | ||
| systems. In principle, support for Garden Linux could be added | ||
| reasonably easily - however NVIDIA might not accept PRs for Garden Linux | ||
| support and therefore we might need to use and maintain a fork. | ||
|
|
||
| #### Con: NVIDIA Container Runtime requires host OS configuration | ||
|
|
||
| See [Installing the NVIDIA Container Toolkit — NVIDIA Container | ||
| Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd-for-kubernetes) | ||
|
|
||
| The NVIDIA Container Toolkit requires a functioning package manager on | ||
| the host OS, but the Garden Linux read-only filesystem prevents new | ||
| packages from being installed. This is probably the biggest barrier to | ||
| getting things working. | ||
|
|
||
| ## Roadmap for the future | ||
|
|
||
| ### Step 1 - Add Garden Linux support to the NVIDIA GPU Operator | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does that help after you said:
|
||
|
|
||
| There are several sub-steps here: | ||
|
|
||
| 1. Add support for Garden Linux in the [NVIDIA GPU Driver | ||
| Container](https://github.com/NVIDIA/gpu-driver-container) repo | ||
|
|
||
| It should be possible to use the Ubuntu examples from the [NVIDIA GPU | ||
| Driver Container](https://github.com/NVIDIA/gpu-driver-container) repo | ||
| in combination with the existing Garden Linux scripts to synthesise | ||
| Garden Linux support for both default and precompiled images. | ||
|
|
||
|
|
||
| 2. Figure out how to install the NVIDIA Container Toolkit/Runtime on | ||
| Garden Linux | ||
|
|
||
| The toolkit itself is [open-source on | ||
| GitHub](https://github.com/NVIDIA/nvidia-container-toolkit) so we | ||
| might be able to figure out an alternative way to install it. In the | ||
| worst case we would need to build a specific Garden Linux image to | ||
| support NVIDIA GPUs. | ||
|
Comment on lines
+270
to
+271
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, so pre-installed images then. Is that driver-version-independent? What kind of compatibility matrix/issues are to be expected?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding your first comment: Regarding your second comment: |
||
|
|
||
|
|
||
| 3. Add support for Garden Linux in the [NVIDIA GPU | ||
| Operator](https://github.com/NVIDIA/gpu-operator) | ||
|
|
||
| Not a great deal needs to be done here - mostly adding a few lines of | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But, isn't NVIDIA the gatekeeper and probably blocking that? I haven't seen support for anything but Ubuntu and Red Hat until now. Can you please share a detail link where more operating systems are supported?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am somewhat sceptical, but it would be good to see that @pnpavlov .
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In terms of the container images for installing the driver, the NVIDIA repo has top-level folders for Azure Linux, Photon Linux and SLES15 in addition to Ubuntu and the various RedHat OSes. For the GPU operator the only place I could find OS-specific code (outside of test data & test code) is the following few lines from
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update: there is also some OS-specific code further down in the same file which includes some SLES-specific code. |
||
| config, a few lines of code, and a few tests. The GPU Operator is | ||
| mostly concerned with deploying the results of the previous sub-steps. | ||
|
|
||
| ### Step 2 - Add support for S3 storage to the NVIDIA GPU Operator | ||
|
|
||
| The project that served as the basis for the Garden Linux NVIDIA | ||
| installer is [squat/modulus](https://github.com/squat/modulus), which | ||
| was designed to do something very similar for Flatcar Linux / CoreOS. | ||
| This project supports having a S3 bucket, such that kernel modules are | ||
| still downloaded & compiled at runtime, but only once - the resulting | ||
| files are stored in the S3 bucket and the installer checks this bucket | ||
| for pre-built kernel modules. This has the advantages of the | ||
|
Comment on lines
+286
to
+289
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, so that's the way around the slow+fragile vs. license issue. But what about the NVIDIA gatekeeper issue? The link above (Kinvolk/Flatcar Linux) only helps with the kernel modules, not the integration into the operator, does it?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned earlier, we might need to use and maintain forks of the NVIDIA GPU Operator repo and the NVIDIA GPU Driver Container repo. Given the structures of these repos it would not be too difficult to keep in sync; that said, we are reasonably hopeful that NVIDIA will accept our PRs as it will help them to sell GPUs to cloud providers using Gardener for their Kubernetes offering. |
||
| default GPU operator behaviour (no need to build a container image for each | ||
| kernel & driver version) along with the advantages of the precompiled | ||
| images approach (no need to download & compile for every node in the | ||
| cluster). All of a user's clusters could share the same S3 bucket | ||
| such that the initial compilation is done in a preproduction cluster | ||
| and then production clusters would always have access to prebuilt | ||
| kernel modules. | ||
|
|
||
| ### Step 3 - Add support for NFS PV storage to the NVIDIA GPU Operator | ||
|
|
||
| The previous option is almost ideal, but requires the user to set up a | ||
| S3 bucket and configure the operator to use it. Another option is for | ||
| the operator to use a NFS-based PV in which compiled images are stored | ||
| (the hyperscaler's NFS CSI driver would be deployed and take care of PV | ||
| provisioning, or alternatively Gardener could take care of setting up a | ||
| NFS volume on the hyperscaler). This would mean that the first node | ||
| using a particular kernel/driver combination would trigger module | ||
| download & compilation, but all future nodes could just get the required | ||
| files from the PV. This would deliver exactly the required user | ||
| experience, subject to Gardener deploying the required components in | ||
| response to the user enabling GPU functionality in the cluster (see next step). | ||
|
|
||
| ### Step 4 - Enable GPU support in the Gardener UI and Shoot specification | ||
|
|
||
| Up until this point GPU support is made easier, but is still not automatic - the | ||
| user needs to take care of configuring and deploying the GPU operator and the | ||
| `NodeFeatureRule` that enables the node label for the Garden Linux version. | ||
| The next step is to add a checkbox to the Gardener UI | ||
| to enable GPU support in a cluster. This would automatically deploy the NVIDIA GPU Operator and | ||
| the Node Feature Discovery operator (and associated rule to label nodes with the | ||
| Garden Linux version) and would enable the NVIDIA Container runtime as an option | ||
| for worker pools. | ||
|
|
||
| With the NVIDIA GPU Operator deployed to the cluster, its configuration would be | ||
| maintained by editing the deployed `NVIDIADriver` custom resource - see | ||
| [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html#about-the-nvidia-driver-custom-resource) | ||
| for details. | ||
| This custom resource would be embedded in the Gardener `shoot.yaml` in a `spec.extensions.providerConfig` | ||
| for an extension of type `gpu-support` (to be developed). One or more such CRs could be included in the | ||
| `providerConfig`; each `NVIDIADriver` CR can contain a `nodeSelector` and a `version`, and this would allow different | ||
| driver versions to be deployed to different worker pools based on a node label. | ||
|
|
||
| ### Step 5a - Extend the NVIDIA GPU Operator to support AMD & Intel GPUs | ||
|
|
||
| NVIDIA, Intel and AMD are all currently maintaining operators for supporting | ||
| their GPUs on Kubernetes. This is not a competitive differentiator for any of them. | ||
| Bringing them together would reduce the overhead for all, and improve the user experience. | ||
| Something similar is already happening with [Project HAMi](https://github.com/Project-HAMi/HAMi), | ||
| which supports the GPUs of multiple vendors. With that said, such unification may be unlikely | ||
| due to political/marketing reasons. | ||
|
|
||
| ### Step 5b - Extend the Gardener GPU extension to support multiple vendors | ||
|
|
||
| An alternative to Step 5a above is for the Gardener GPU extension to supports operators from | ||
| multiple vendors, and the extension `providerConfig` could then include CRs of type | ||
| `nvidia.com/v1alpha1/NVIDIADriver`, `amd.com/v1alpha1/DeviceConfig`, | ||
| `deviceplugin.intel.com/v1/GpuDevicePlugin` and others. | ||
Uh oh!
There was an error while loading. Please reload this page.