Skip to content

Commit 57cfe35

Browse files
authored
Merge pull request #159 from run-ai/erez/compute-domain-docs-1
docs: add Compute Domain DRA usage instructions to README
2 parents 149732b + b283c13 commit 57cfe35

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,84 @@ spec:
206206

207207
See [test/integration/manifests/](test/integration/manifests/) for more examples.
208208

209+
## 🔐 Compute Domain DRA (Secure Workload Isolation)
210+
211+
The Fake GPU Operator supports simulating [NVIDIA Compute Domains](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html) for secure workload isolation without requiring actual NVIDIA hardware. Compute Domains provide IMEX channel simulation for multi-node GPU workloads.
212+
213+
### Prerequisites
214+
215+
- Kubernetes 1.31+ with DynamicResourceAllocation feature gate enabled
216+
- DRA plugin enabled in the Fake GPU Operator
217+
218+
### Enable Compute Domain in Helm chart
219+
220+
```yaml
221+
# values.yaml
222+
computeDomainController:
223+
enabled: true
224+
computeDomainDraPlugin:
225+
enabled: true
226+
draPlugin:
227+
enabled: true
228+
devicePlugin:
229+
enabled: false # Disable legacy plugin when using DRA
230+
```
231+
232+
### Deploy with Compute Domain
233+
234+
First, create a ComputeDomain resource:
235+
236+
```yaml
237+
apiVersion: resource.nvidia.com/v1beta1
238+
kind: ComputeDomain
239+
metadata:
240+
name: my-compute-domain
241+
namespace: default
242+
spec:
243+
numNodes: 1
244+
channel:
245+
allocationMode: Single # or "All" for all channels
246+
resourceClaimTemplate:
247+
name: my-compute-domain
248+
```
249+
250+
The compute-domain-controller will automatically create a ResourceClaimTemplate for the ComputeDomain.
251+
252+
Then, deploy a pod that uses the compute domain:
253+
254+
```yaml
255+
apiVersion: v1
256+
kind: Pod
257+
metadata:
258+
name: compute-domain-pod
259+
namespace: default
260+
spec:
261+
containers:
262+
- name: main
263+
image: ubuntu:22.04
264+
command: ["sleep", "infinity"]
265+
resources:
266+
claims:
267+
- name: compute-domain
268+
resourceClaims:
269+
- name: compute-domain
270+
resourceClaimTemplateName: my-compute-domain
271+
```
272+
273+
### Verify Compute Domain Status
274+
275+
```bash
276+
# Check ComputeDomain status
277+
kubectl get computedomain my-compute-domain -o yaml
278+
279+
# Verify status shows Ready and allocated nodes
280+
# status:
281+
# status: Ready
282+
# nodes:
283+
# - name: <node-name>
284+
# status: Ready
285+
```
286+
209287
## 🎭 KWOK Integration (Simulated Nodes)
210288

211289
[KWOK](https://kwok.sigs.k8s.io/) (Kubernetes WithOut Kubelet) is a toolkit that allows you to simulate thousands of Kubernetes nodes without running actual kubelet processes. When combined with the Fake GPU Operator, you can create large-scale GPU cluster simulations entirely without hardware - perfect for testing schedulers, autoscalers, and resource management at scale.

0 commit comments

Comments
 (0)