Skip to content

Commit 34f9b25

Browse files
pbochynskimmitoraj
andauthored
GPU guide (#576)
* GPU guide * Refactor GPU workload README for clarity and updates Updated formatting and improved clarity in the README for running GPU workloads in a Kyma cluster. Enhanced instructions and added notes for better understanding. * Lg fix Corrected formatting and improved clarity in the README. * Update prerequisites Updated prerequisites and procedure sections in README. * Add GPU example to index --------- Co-authored-by: Małgorzata Świeca <[email protected]>
1 parent 502d83e commit 34f9b25

File tree

5 files changed

+279
-0
lines changed

5 files changed

+279
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ Running various samples requires access to the Kyma runtime. There are also othe
107107
| [HandsOn DSAG Technology Days 2022](./dsagtt22/) | This sample gives a walk-through setting up a scenario combining on-premise systems with Kyma Functions and the Event Mesh | - |
108108
| [Query LDAP Users on on-premise](./sample-ldap/README.md) | This sample queries the LDAP users from an on-premise LDAP Server via SAP Connectivity proxy | - |
109109
| [Deploy Highly Available Workloads](./multi-zone-ha-deployment/README.md) | This sample demonstrates deploying highly available workloads in Kyma runtime | - |
110+
| [Running GPU Workload in a Kyma Cluster](./gpu/README.md) | This sample demonstrates how to set up and run GPU-accelerated workloads on Kyma runtime, including AI image generation | - |
110111
| [Power of serverless with SAP BTP, Kyma runtime.](./kyma-serverless/README.md) | This sample demonstrates how to leverage latest features of kyma functions with SAP HANA Cloud and SAP libraries | [Post](https://blogs.sap.com/2023/02/06/power-of-serverless-with-sap-btp-kyma-runtime.-secrets-mounted-as-volumes./) |
111112
| [Using the on-premise Docker registry with Kyma runtime](./on-premise-docker-registry/README.md) | This sample demonstrates how to pull images from the on-premise Docker registry for applications deployed on Kyma runtime | - |
112113
| [KEDA Cron based scaler](./keda-cron-scaler/README.md) | This sample demonstrates how to leverage KEDA Cron scaler for efficient scaling strategies. | - |

gpu/README.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Running GPU Workload in a Kyma Cluster
2+
3+
> [!Note]
4+
> This sample is based on [NVIDIA GPU Operator Installation Guide for Gardener](https://github.com/gardener/gardener-ai-conformance/blob/main/v1.33/NVIDIA-GPU-Operator.md).
5+
6+
## Prerequisites
7+
8+
- Helm 3.x installed. For more information, see the [Kubernetes](https://github.com/SAP-samples/kyma-runtime-samples/tree/main/prerequisites#kubernetes) section.
9+
- kubectl installed and configured to access your Kyma cluster. For more information, see the [Kubernetes](https://github.com/SAP-samples/kyma-runtime-samples/tree/main/prerequisites#kubernetes) section.
10+
- You have an SAP BTP, Kyma runtime instance.
11+
12+
## Procedure
13+
14+
### Setting Up a GPU Worker Pool
15+
16+
Follor these step, to set up a worker pool with GPU nodes available in your Kyma cluster. For more information, see [Additional Worker Node Pools](https://help.sap.com/docs/btp/sap-business-technology-platform/provisioning-and-update-parameters-in-kyma-environment?version=Cloud#additional-worker-node-pools).
17+
18+
1. Go to the SAP BTP cockpit and update your Kyma instance by adding a new worker pool named `gpu`.
19+
2. Add some nodes with the GPU support, for example, `g6.xlarge`.
20+
3. Set auto-scaling min nodes to `0` and max nodes to a desired number, for example, `2`. This way, when no GPU workloads are running, the cluster scales down to zero GPU nodes, saving costs.
21+
22+
### Installation
23+
24+
1. Add the NVIDIA Helm repository.
25+
26+
```bash
27+
# Add the NVIDIA Helm repository
28+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
29+
30+
# Update repository information
31+
helm repo update
32+
33+
# Verify repository is added
34+
helm search repo nvidia/gpu-operator
35+
```
36+
37+
2. Install the GPU operator with Garden Linux configuration.
38+
39+
The key to successful installation on Garden Linux is using the specialized values file that handles the Garden Linux-specific requirements.
40+
41+
```bash
42+
# Install GPU Operator with Garden Linux optimized values
43+
helm upgrade --install --create-namespace -n gpu-operator gpu-operator nvidia/gpu-operator --values \
44+
https://raw.githubusercontent.com/SAP-samples/kyma-runtime-samples/refs/heads/main/gpu/gpu-operator-values.yaml
45+
46+
# Wait for installation to complete
47+
helm status gpu-operator -n gpu-operator
48+
```
49+
50+
> [!Note]
51+
> The [gpu-operator-values.yaml](gpu-operator-values.yaml) file is configured for driver version 570, which is compatible with current Garden Linux kernel versions in Kyma clusters. If you need a different driver version, adjust the `driver.version` field in the values file accordingly (download the file and modify it locally before installation).
52+
53+
3. The GPU operator deploys several components as DaemonSets and Deployments. Monitor the installation.
54+
55+
```bash
56+
# Watch all pods in gpu-operator namespace
57+
kubectl get pods -n gpu-operator -w
58+
59+
# Check deployment status
60+
kubectl get all -n gpu-operator
61+
```
62+
63+
### Installation Verification
64+
65+
1. Deploy a simple GPU test workload.
66+
67+
```bash
68+
# Create test GPU workload
69+
cat <<EOF | kubectl apply -f -
70+
apiVersion: v1
71+
kind: Pod
72+
metadata:
73+
name: gpu-test
74+
spec:
75+
containers:
76+
- name: gpu-test
77+
image: nvcr.io/nvidia/cuda:13.0.1-runtime-ubuntu24.04
78+
command: ["nvidia-smi"]
79+
resources:
80+
limits:
81+
nvidia.com/gpu: 1
82+
restartPolicy: Never
83+
EOF
84+
```
85+
86+
If your cluster does not have GPU resources available, the Pod remains in the `Pending` state for a while until a GPU node is provisioned.
87+
88+
Once the node is up, the NVIDIA GPU Operator deploys the device plugin DaemonSet, which then advertises `nvidia.com/gpu` resources on that node.
89+
90+
2. Check the autoscaler config to see if GPU nodes are being considered.
91+
92+
```bash
93+
kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml
94+
```
95+
96+
This is an example section from the ConfigMap showing a GPU worker pool with one node started:
97+
98+
```yaml
99+
- name: shoot--kyma--c-1f226cf-gpu-z1
100+
health:
101+
status: Healthy
102+
nodeCounts:
103+
registered:
104+
total: 1
105+
ready: 1
106+
notStarted: 0
107+
longUnregistered: 0
108+
unregistered: 0
109+
cloudProviderTarget: 1
110+
minSize: 0
111+
maxSize: 3
112+
lastProbeTime: "2025-12-11T14:40:54.65491129Z"
113+
lastTransitionTime: "2025-12-11T02:13:08.790764467Z"
114+
scaleUp:
115+
status: NoActivity
116+
lastProbeTime: "2025-12-11T14:40:54.65491129Z"
117+
lastTransitionTime: "2025-12-11T12:12:13.016415154Z"
118+
scaleDown:
119+
status: NoCandidates
120+
lastProbeTime: "2025-12-11T14:40:54.65491129Z"
121+
lastTransitionTime: "2025-12-11T13:10:13.472558018Z"
122+
```
123+
124+
3. Observe this config map to see if the GPU worker pool is recognized and nodes are being provisioned as needed. When the GPU node is ready, the `nvidia.com/gpu` resource should be available for scheduling, and the test Pod should complete successfully.
125+
126+
You can run these commands to monitor the test Pod, check logs, and clean up afterward:
127+
128+
```bash
129+
# Wait for pod to complete and check output
130+
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/gpu-test --timeout=300s
131+
kubectl logs gpu-test
132+
133+
# Clean up test pod
134+
kubectl delete pod gpu-test
135+
```
136+
137+
### More Spectacular GPU Demo - AI Image Generation
138+
139+
For a more impressive demonstration that showcases real GPU acceleration, follow these steps:
140+
141+
1. Deploy an AI image generation workload using fooocus and the Stable Diffusion XL model.
142+
143+
```bash
144+
kubectl apply -f https://raw.githubusercontent.com/SAP-samples/kyma-runtime-samples/main/gpu/fooocus.yaml
145+
```
146+
147+
The web UI is exposed using an APIRule, and you can access it via browser using your cluster domain and fooocus subdomain, for example, `https://fooocus.xxxxxxxx.kyma.ondemand.com/`.
148+
149+
![Fooocus UI](./piglet.png)
150+
151+
2. To delete the demo app, run:
152+
153+
```bash
154+
kubectl delete -f https://raw.githubusercontent.com/SAP-samples/kyma-runtime-samples/main/gpu/fooocus.yaml
155+
```
156+
157+
### Cleanup
158+
159+
If you delete all the Pods that require GPU, your worker pool should be scaled down to zero nodes again, saving costs. You can check if cluster autoscaler recognizes that there are no GPU nodes needed by checking the cluster-autoscaler-status ConfigMap.
160+
161+
```bash
162+
kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml
163+
```
164+
165+
You should see candidates for scaling down in the GPU worker pool section. Bear in mind that scaling down takes 60 minutes (this is the Kyma cluster default setting).

gpu/fooocus.yaml

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: fooocus
5+
labels:
6+
istio-injection: enabled
7+
8+
---
9+
apiVersion: apps/v1
10+
kind: Deployment
11+
metadata:
12+
name: fooocus
13+
namespace: fooocus
14+
spec:
15+
replicas: 1
16+
selector:
17+
matchLabels:
18+
app: fooocus
19+
template:
20+
metadata:
21+
labels:
22+
app: fooocus
23+
spec:
24+
# Ensure it runs on GPU node
25+
nodeSelector:
26+
nvidia.com/gpu.present: "true"
27+
28+
containers:
29+
- name: fooocus
30+
image: ghcr.io/lllyasviel/fooocus:latest
31+
imagePullPolicy: IfNotPresent
32+
ports:
33+
- containerPort: 7865
34+
resources:
35+
limits:
36+
nvidia.com/gpu: 1 # request 1 GPU
37+
cpu: "2"
38+
requests:
39+
nvidia.com/gpu: 1
40+
cpu: "1"
41+
memory: "8Gi"
42+
volumeMounts:
43+
- name: models
44+
mountPath: /app/models
45+
- name: outputs
46+
mountPath: /app/outputs
47+
48+
volumes:
49+
- name: models
50+
emptyDir: {} # change to PVC if you want persistent models
51+
- name: outputs
52+
emptyDir: {} # change to PVC if you want persistent outputs
53+
54+
---
55+
apiVersion: v1
56+
kind: Service
57+
metadata:
58+
name: fooocus
59+
namespace: fooocus
60+
spec:
61+
type: ClusterIP
62+
ports:
63+
- port: 80
64+
targetPort: 7865
65+
protocol: TCP
66+
name: http
67+
selector:
68+
app: fooocus
69+
70+
---
71+
apiVersion: gateway.kyma-project.io/v2
72+
kind: APIRule
73+
metadata:
74+
name: fooocus
75+
namespace: fooocus
76+
spec:
77+
gateway: kyma-system/kyma-gateway
78+
hosts:
79+
- fooocus
80+
service:
81+
name: fooocus
82+
port: 80
83+
rules:
84+
- path: /*
85+
methods: ["GET", "POST", "PUT", "DELETE", "PATCH"]
86+
noAuth: true

gpu/gpu-operator-values.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
cdi:
2+
enabled: true
3+
default: true
4+
toolkit:
5+
installDir: /opt/nvidia
6+
driver:
7+
imagePullPolicy: Always
8+
usePrecompiled: true
9+
# Use a driver version that has a published image for your kernel
10+
# If 580 is required, ensure the corresponding image exists in ghcr.io and that you have access.
11+
# version: 580
12+
# repository: ghcr.io/gardenlinux/gardenlinux-nvidia-installer/proprietary
13+
version: 570
14+
repository: ghcr.io/gardenlinux/gardenlinux-nvidia-installer
15+
node-feature-discovery:
16+
worker:
17+
config:
18+
sources:
19+
custom:
20+
- name: "gardenlinux-version"
21+
labelsTemplate: |
22+
{{ range .system.osrelease }}feature.node.kubernetes.io/system-os_release.VERSION_ID=0
23+
{{ end }}
24+
matchFeatures:
25+
- feature: system.osrelease
26+
matchExpressions:
27+
GARDENLINUX_VERSION: {op: Exists}

gpu/piglet.png

1.33 MB
Loading

0 commit comments

Comments
 (0)