gardenlinux · dhague · May 14, 2025 · May 14, 2025 · May 15, 2025 · May 15, 2025
diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -0,0 +1,346 @@
+# Future GPU support in Gardener
+
+## Executive Summary (AI generated, human-edited)
+This document outlines the current state and future plans for GPU support in Gardener, focusing primarily on NVIDIA GPUs.
+### Current State
+- GPU support is possible but requires significant manual effort
+- Requires three DaemonSets to be configured and deployed:
+    1. NVIDIA driver installer
+    2. GKE Device Plugin
+    3. DCGM Exporter
+
+- Current implementation has some limitations:
+    - Requires building images for each Garden Linux & NVIDIA driver version combination
+    - Relies on Google's GKE Device Plugin
+    - Requires manual node labeling
+    - Limited to basic GPU features
+
+- The above effort and limitations can be avoided by enabling Gardener to use the 
+    NVIDIA GPU Operator, but this requires some development work. This work would also 
+    enable the use of AMD and Intel GPUs in future.
+
+### Future Roadmap
+#### Step 1: Add Garden Linux Support to NVIDIA GPU Operator
+- Add Garden Linux support to NVIDIA GPU Driver Container
+- Implement NVIDIA Container Toolkit installation for Garden Linux
+- Integrate Garden Linux support into NVIDIA GPU Operator
+
+#### Step 2: S3 Storage Support
+- Add S3 bucket support for storing pre-built kernel modules
+- Enable sharing compiled modules across clusters
+- Reduce compilation overhead
+
+#### Step 3: NFS PV Storage Integration
+- Implement NFS-based Persistent Volume storage
+- Enable first-time compilation with cached results for subsequent nodes
+- Leverage hyperscaler's NFS CSI driver
+
+#### Step 4: Gardener UI and Shoot Specification Enhancement
+- Add GPU support checkbox in Gardener UI
+- Automate deployment of NVIDIA GPU Operator (with Node Feature Discovery operator)
+- Enable NVIDIA Container runtime options for worker pools
+- Implement GPU configuration through `shoot.yaml` specifications
+
+#### Step 5: Multi-vendor Support (Two possible approaches)
+1. Extend NVIDIA GPU Operator
+    - Add support for AMD & Intel GPUs
+    - Align with projects like HAMi
+
+2. Extend Gardener GPU Extension
+    - Support multiple vendor operators
+    - Enable configuration for different GPU vendors through custom resources
+
+The ultimate goal is to simplify GPU deployment in Gardener clusters while providing flexible options for different GPU vendors and use cases.
+
+
+## Introduction: What do we want?
+
+We want easy-to-consume support for using GPUs in a Gardener cluster,
+    beginning with NVIDIA GPUs.
+
+Using GPUs in Gardener is possible right now, but involves a lot
+    of work. What we want is to create a worker pool of GPU nodes, and to be
+    able to then schedule GPU-using Pods to those nodes with as
+    little effort as possible.
+
+In a perfect world a user would create a worker pool of GPU instances, 
+and everything "just works".
+
+In an almost-perfect world a user would select an NVIDIA GPU option
+as "Additional OCI Runtime" (dropdown) in the Gardener UI / `containerRuntime` in the shoot spec.
+
+## How we do it now
+
+We deploy DaemonSets for the following 3 features:
+
+  - NVIDIA driver installer
+
+      - Installs the Linux kernel module that creates /dev/nvidia\*
+        devices along with NVIDIA-related /bin and /lib folders in the
+        host OS filesystem.
+
+      - For A100, H100 and similar GPUs, runs the NVIDIA Fabric Manager
+        which enables inter-GPU communication.
+
+
+  - GKE Device Plugin
+
+      - This image (used by GKE) makes Kubernetes aware of the GPU
+        devices on the node, and takes care of inserting into GPU-using
+        pods the /dev, /bin and /lib files from the NVIDIA driver
+        installer.
+
+
+  - DCGM Exporter
+
+      - Exposes Prometheus-compatible metrics for the GPUs on each node
+
+The GKE Device Plugin and DCGM Exporter use images created by Google and
+NVIDIA, respectively.
+
+The NVIDIA Driver Installer uses images that are built by AI Core.
+[These are technically from the Garden Linux
+team](https://github.com/gardenlinux/gardenlinux-nvidia-installer), but
+in reality are 95% maintained by AI Core.
+
+For each version of Garden Linux and each version of the NVIDIA driver
+we want to support, we have to build an image. This image contains the
+specified NVIDIA driver compiled for the kernel of that Garden Linux
+version.
+
+The [Garden Linux
+repo](https://github.com/gardenlinux/gardenlinux-nvidia-installer)
+mentioned above tells a user how to build an image for a given driver &
+kernel version. AI Core embeds this repo into [its own build
+process](https://github.wdf.sap.corp/ICN-ML/aicore/blob/main/system-services/nvidia-installer/component.yaml#L18-L45)
+in order to generate the set of images required to support AI Core,
+which are hosted in an AI Core registry for use only by AI Core.
+
+### Pros & Cons
+
+#### Pro: It works
+
+For AI Core at least, the current way works fairly well. Every few
+months we update the AI Core build list to add support for a new version
+of Garden Linux or a new NVIDIA driver, and then configure AI Core to
+use the new versions. This is achieved with a few lines of configuration
+in our config-as-code repos and takes just a day or two (build, deploy,
+test, etc).
+
+#### Con: It requires building an image for every version combination of Garden Linux & NVIDIA driver
+
+For other users of Gardener, all they see is the Garden Linux repo. This
+is fine for doing a proof-of-concept to build a driver for a cluster
+with a given version, but day 2 operations require the user to create a
+build pipeline and deployment system parallel to the one used by AI Core
+in order to have the images required for future versions of Garden Linux
+& NVIDIA driver. (AI Core's build/deploy system is not easily usable
+outside of the AI Core context.) Because such images contain proprietary
+NVIDIA code (the driver is not open source), it is legally difficult to
+put such images into a publicly-accessible registry for use by all.
+
+#### Con: It's not ideal that we use the GKE Device Plugin
+
+The [GKE Device
+Plugin](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/cmd/nvidia_gpu/README.md)
+works well, but is used by only one other organisation (Google) and we
+do not have explicit permission to use it - although [it is Apache
+open-source
+licensed](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/LICENSE)
+so the risk is low. Nevertheless we are tying ourselves to a specific
+vendor other than NVIDIA.
+
+#### Con: It requires Gardener users to label GPU nodes
+
+Because the NVIDIA driver installer image is specific to each Garden
+Linux version, each GPU node requires a label identifying this version,
+for example **os-version: 1592.4.0**. Gardener does not take care of
+adding such labels, so this becomes a chore for the operations team.
+
+Note: This can be automated by deploying the [Node Feature
+    Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/v0.17/get-started/index.html)
+    operator and creating the following rule:
+
+
+    apiVersion: nfd.k8s-sigs.io/v1alpha1
+    kind: NodeFeatureRule
+    metadata:
+      name: gardenlinux-version
+    spec:
+      rules:
+        - name: "Garden Linux version"
+          labels:
+            "node.gardener.cloud/gardenlinux-version": "@system.osrelease.GARDENLINUX_VERSION"
+          matchFeatures:
+            - feature: system.osrelease
+              matchExpressions:
+                GARDENLINUX_VERSION: {op: Exists}
+
+
+This rule will result in a label similar to this:
+**`node.gardener.cloud/gardenlinux-version: '1592.9'`**
+
+## NVIDIA GPU Operator
+
+### Pros & Cons
+
+#### Pro: It is the official method supported by NVIDIA and does everything for you
+
+The [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) takes
+care of installing the following:
+
+  - NVIDIA GPU Driver
+
+  - NVIDIA Container Toolkit (a collection of libraries and utilities
+    enabling users to build and run GPU-accelerated containers)
+
+  - NVIDIA Device Plugin
+
+  - DCGM Exporter
+
+  - vGPU manager
+
+#### Pro: Enables advanced GPU features
+
+Features such as multi-instace GPU, vGPU, GPU time slicing and GPUDirect RDMA are
+    supported by the NVIDIA GPU Operator. These features are not
+    supported by the current AI Core implementation.
+
+#### Con: Driver installer by default downloads and compiles at runtime
+
+The default configuration runs a container image that downloads and
+installs OS packages and then downloads, compiles and installs the
+NVIDIA driver kernel modules - this is all done by the DaemonSet's Pod
+when it starts on the GPU node. We used to do something similar for AI
+Core, but found the approach to be somewhat fragile as well as adding a
+significant amount of time to the node startup phase.
+
+It is possible to tell the operator to use "precompiled" images instead,
+which results in a similar approach to how AI Core is installing the
+NVIDIA driver. Of course, a build pipeline must be set up to create
+these images.
+
+Both types of image (download & compile; precompiled) are built from the
+[NVIDIA GPU Driver
+Container](https://github.com/NVIDIA/gpu-driver-container) repo. The
+root of this repo contains folders for various operating systems.
+
+Only Ubuntu 22.04 and Ubuntu 24.04 are officially supported for
+precompiled images, although the repo also contains the required files
+and instructions to build precompiled images for RHEL 8 and RHEL 9.
+
+#### Con: Garden Linux is not a supported platform
+
+The NVIDIA GPU Operator supports only Ubuntu and Red Hat operating
+systems. In principle, support for Garden Linux could be added
+reasonably easily - however NVIDIA might not accept PRs for Garden Linux
+support and therefore we might need to use and maintain a fork.
+
+#### Con: NVIDIA Container Runtime requires host OS configuration
+
+See [Installing the NVIDIA Container Toolkit — NVIDIA Container
+Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd-for-kubernetes)
+
+The NVIDIA Container Toolkit requires a functioning package manager on
+the host OS, but the Garden Linux read-only filesystem prevents new
+packages from being installed. This is probably the biggest barrier to
+getting things working.
+
+## Roadmap for the future
+
+### Step 1 - Add Garden Linux support to the NVIDIA GPU Operator
+
+There are several sub-steps here:
+
+1.  Add support for Garden Linux in the [NVIDIA GPU Driver
+    Container](https://github.com/NVIDIA/gpu-driver-container) repo
+
+    It should be possible to use the Ubuntu examples from the [NVIDIA GPU
+Driver Container](https://github.com/NVIDIA/gpu-driver-container) repo
+in combination with the existing Garden Linux scripts to synthesise
+Garden Linux support for both default and precompiled images.
+
+
+2.  Figure out how to install the NVIDIA Container Toolkit/Runtime on
+    Garden Linux
+
+    The toolkit itself is [open-source on
+GitHub](https://github.com/NVIDIA/nvidia-container-toolkit) so we
+might be able to figure out an alternative way to install it. In the
+worst case we would need to build a specific Garden Linux image to
+support NVIDIA GPUs.
+
+
+3. Add support for Garden Linux in the [NVIDIA GPU
+    Operator](https://github.com/NVIDIA/gpu-operator)
+
+    Not a great deal needs to be done here - mostly adding a few lines of
+config, a few lines of code, and a few tests. The GPU Operator is
+mostly concerned with deploying the results of the previous sub-steps.
+
+### Step 2 - Add support for S3 storage to the NVIDIA GPU Operator
+
+The project that served as the basis for the Garden Linux NVIDIA
+installer is [squat/modulus](https://github.com/squat/modulus), which
+was designed to do something very similar for Flatcar Linux / CoreOS.
+This project supports having a S3 bucket, such that kernel modules are
+still downloaded & compiled at runtime, but only once - the resulting
+files are stored in the S3 bucket and the installer checks this bucket
+for pre-built kernel modules. This has the advantages of the
+default GPU operator behaviour (no need to build a container image for each
+kernel & driver version) along with the advantages of the precompiled
+images approach (no need to download & compile for every node in the
+cluster). All of a user's clusters could share the same S3 bucket
+such that the initial compilation is done in a preproduction cluster
+and then production clusters would always have access to prebuilt
+kernel modules.
+
+### Step 3 - Add support for NFS PV storage to the NVIDIA GPU Operator
+
+The previous option is almost ideal, but requires the user to set up a
+S3 bucket and configure the operator to use it. Another option is for
+the operator to use a NFS-based PV in which compiled images are stored
+(the hyperscaler's NFS CSI driver would be deployed and take care of PV
+provisioning, or alternatively Gardener could take care of setting up a
+NFS volume on the hyperscaler). This would mean that the first node
+using a particular kernel/driver combination would trigger module
+download & compilation, but all future nodes could just get the required
+files from the PV. This would deliver exactly the required user
+experience, subject to Gardener deploying the required components in
+response to the user enabling GPU functionality in the cluster (see next step).
+
+### Step 4 - Enable GPU support in the Gardener UI and Shoot specification
+
+Up until this point GPU support is made easier, but is still not automatic - the
+user needs to take care of configuring and deploying the GPU operator and the
+`NodeFeatureRule` that enables the node label for the Garden Linux version. 
+The next step is to add a checkbox to the Gardener UI
+to enable GPU support in a cluster. This would automatically deploy the NVIDIA GPU Operator and
+the Node Feature Discovery operator (and associated rule to label nodes with the
+Garden Linux version) and would enable the NVIDIA Container runtime as an option
+for worker pools.
+
+With the NVIDIA GPU Operator deployed to the cluster, its configuration would be
+maintained by editing the deployed `NVIDIADriver` custom resource - see 
+[here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html#about-the-nvidia-driver-custom-resource)
+for details. 
+This custom resource would be embedded in the Gardener `shoot.yaml` in a `spec.extensions.providerConfig` 
+for an extension of type `gpu-support` (to be developed). One or more such CRs could be included in the 
+`providerConfig`; each `NVIDIADriver` CR can contain a `nodeSelector` and a `version`, and this would allow different 
+driver versions to be deployed to different worker pools based on a node label.
+
+### Step 5a - Extend the NVIDIA GPU Operator to support AMD & Intel GPUs
+
+NVIDIA, Intel and AMD are all currently maintaining operators for supporting 
+their GPUs on Kubernetes. This is not a competitive differentiator for any of them. 
+Bringing them together would reduce the overhead for all, and improve the user experience. 
+Something similar is already happening with [Project HAMi](https://github.com/Project-HAMi/HAMi), 
+which supports the GPUs of multiple vendors. With that said, such unification may be unlikely
+due to political/marketing reasons. 
+
+### Step 5b - Extend the Gardener GPU extension to support multiple vendors
+
+An alternative to Step 5a above is for the Gardener GPU extension to supports operators from 
+multiple vendors, and the extension `providerConfig` could then include CRs of type 
+`nvidia.com/v1alpha1/NVIDIADriver`, `amd.com/v1alpha1/DeviceConfig`, 
+`deviceplugin.intel.com/v1/GpuDevicePlugin` and others.