Skip to content

Commit 72c584b

Browse files
Update using-rdma-network-locality-when-running-workloads-on-oke.md
1 parent 9c260a7 commit 72c584b

File tree

1 file changed

+179
-89
lines changed

1 file changed

+179
-89
lines changed

docs/using-rdma-network-locality-when-running-workloads-on-oke.md

Lines changed: 179 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ curl -H 'Authorization: Bearer Oracle' http://169.254.169.254/opc/v2/host/rdmaTo
2727

2828
## Which shapes are supported?
2929
**H100, H200, B200, MI300x**
30+
- Kueue
3031
- Kubernetes Node Affinity
3132
- Kubernetes Pod Affinity
32-
- Kueue
3333
- Node Ordering script as Init Container
3434

3535
**A100**
@@ -54,6 +54,184 @@ oci.oraclecloud.com/rdma.local_block_id=4tjxbt4s6ua
5454
oci.oraclecloud.com/rdma.network_block_id=7xmzl4p4wba
5555
```
5656

57+
> [!NOTE]
58+
> We recommend using Kueue.
59+
>
60+
61+
### Using Kueue with Topology Aware Scheduling
62+
63+
Kueue supports **Topology Aware Scheduling (TAS)**, which allows you to create a hierarchy of nodes based on node labels.
64+
65+
Topology Aware Scheduling is a beta feature and is enabled by default starting with **v0.14**.
66+
67+
You can deploy Kueue using Helm with:
68+
```bash
69+
helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.1" --create-namespace --namespace=kueue-system
70+
```
71+
72+
This example shows how to:
73+
74+
- Define a **Topology** for OCI RDMA domains
75+
- Create a **ResourceFlavor** for H100 GPU nodes
76+
- Configure a **ClusterQueue** and **LocalQueue**
77+
- Run a **Job** that uses Topology Aware Scheduling
78+
79+
In this setup, we use the node label:
80+
81+
```yaml
82+
node.kubernetes.io/instance-type: "BM.GPU.H100.8"
83+
```
84+
85+
This label targets OCI bare metal H100 GPU nodes. You can replace it with any label that exists on the target nodes in your environment.
86+
87+
---
88+
89+
#### 1. Create a Topology
90+
91+
Define how nodes are grouped at different hierarchy levels.
92+
93+
Save the following as `topology.yaml`:
94+
95+
```yaml
96+
apiVersion: kueue.x-k8s.io/v1alpha1
97+
kind: Topology
98+
metadata:
99+
name: "oci-topology"
100+
spec:
101+
levels:
102+
- nodeLabel: "oci.oraclecloud.com/rdma.hpc_island_id"
103+
- nodeLabel: "oci.oraclecloud.com/rdma.network_block_id"
104+
- nodeLabel: "oci.oraclecloud.com/rdma.local_block_id"
105+
- nodeLabel: "kubernetes.io/hostname"
106+
```
107+
108+
Apply it:
109+
110+
```bash
111+
kubectl apply -f topology.yaml
112+
```
113+
114+
---
115+
116+
#### 2. Create a ResourceFlavor
117+
118+
Define a flavor for your node type and reference the topology.
119+
120+
Save the following as `resourceflavor.yaml`:
121+
122+
```yaml
123+
apiVersion: kueue.x-k8s.io/v1beta1
124+
kind: ResourceFlavor
125+
metadata:
126+
name: "tas-flavor"
127+
spec:
128+
nodeLabels:
129+
node.kubernetes.io/instance-type: "BM.GPU.H100.8"
130+
topologyName: "oci-topology"
131+
```
132+
133+
Apply it:
134+
135+
```bash
136+
kubectl apply -f resourceflavor.yaml
137+
```
138+
139+
---
140+
141+
#### 3. Create a ClusterQueue
142+
143+
Define a shared queue of resources available to all namespaces.
144+
145+
Save the following as `clusterqueue.yaml`:
146+
147+
```yaml
148+
apiVersion: kueue.x-k8s.io/v1beta1
149+
kind: ClusterQueue
150+
metadata:
151+
name: "tas-cluster-queue"
152+
spec:
153+
namespaceSelector: {}
154+
resourceGroups:
155+
- coveredResources: ["cpu", "memory"]
156+
flavors:
157+
- name: "tas-flavor"
158+
resources:
159+
- name: "cpu"
160+
nominalQuota: 100
161+
- name: "memory"
162+
nominalQuota: 100Gi
163+
```
164+
165+
Apply it:
166+
167+
```bash
168+
kubectl apply -f clusterqueue.yaml
169+
```
170+
171+
---
172+
173+
#### 4. Create a LocalQueue
174+
175+
Create a namespace-specific queue linked to the cluster queue.
176+
177+
Save the following as `localqueue.yaml`:
178+
179+
```yaml
180+
apiVersion: kueue.x-k8s.io/v1beta1
181+
kind: LocalQueue
182+
metadata:
183+
name: tas-user-queue
184+
spec:
185+
clusterQueue: tas-cluster-queue
186+
```
187+
188+
Apply it:
189+
190+
```bash
191+
kubectl apply -f localqueue.yaml
192+
```
193+
194+
---
195+
196+
#### 5. Run an Example Job
197+
198+
The annotation `kueue.x-k8s.io/podset-preferred-topology` tells Kueue to **prefer placing all pods within the same topology domain**. If that is not possible, Kueue will progressively move up the hierarchy until it finds a level where the job fits. If no level can contain all pods, they are distributed across multiple topology domains.
199+
200+
Save the following as `job.yaml`:
201+
202+
```yaml
203+
apiVersion: batch/v1
204+
kind: Job
205+
metadata:
206+
generateName: tas-sample-preferred-
207+
labels:
208+
kueue.x-k8s.io/queue-name: tas-user-queue
209+
spec:
210+
parallelism: 2
211+
completions: 2
212+
completionMode: Indexed
213+
template:
214+
metadata:
215+
annotations:
216+
kueue.x-k8s.io/podset-preferred-topology: "oci.oraclecloud.com/rdma.local_block_id"
217+
spec:
218+
containers:
219+
- name: dummy-job
220+
image: registry.k8s.io/e2e-test-images/agnhost:2.53
221+
args: ["pause"]
222+
resources:
223+
requests:
224+
cpu: "1"
225+
memory: "200Mi"
226+
restartPolicy: Never
227+
```
228+
229+
Apply it:
230+
231+
```bash
232+
kubectl apply -f job.yaml
233+
```
234+
57235
### Using Kubernetes node affinity
58236
You can use the labels explained above to create affinity rules for your workloads. Visit [this link](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) if you want to learn more about using affinity rules on Kubernetes.
59237

@@ -189,95 +367,7 @@ spec:
189367
190368
```
191369

192-
### Using Kueue
193-
You will need to [enable the feature gate](https://kueue.sigs.k8s.io/docs/installation/#change-the-feature-gates-configuration) for [Topology Aware Scheduling (TAS)](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling) in Kueue. Topology Aware Scheduling is currently in alpha state since Kueue v0.9.
194-
195-
The following example uses `node.kubernetes.io/instance-type: "BM.GPU.H100.8"` to select H100s, but you can use any label that exists on all your nodes that you're targeting with the Resource Flavor.
196-
197-
#### Create a Topology
198-
```yaml
199-
apiVersion: kueue.x-k8s.io/v1alpha1
200-
kind: Topology
201-
metadata:
202-
name: "oci-topology"
203-
spec:
204-
levels:
205-
- nodeLabel: "oci.oraclecloud.com/rdma.hpc_island_id"
206-
- nodeLabel: "oci.oraclecloud.com/rdma.network_block_id"
207-
- nodeLabel: "oci.oraclecloud.com/rdma.local_block_id"
208-
- nodeLabel: "kubernetes.io/hostname"
209-
```
210-
211-
#### Create a Resource Flavor
212-
```yaml
213-
kind: ResourceFlavor
214-
apiVersion: kueue.x-k8s.io/v1beta1
215-
metadata:
216-
name: "tas-flavor"
217-
spec:
218-
nodeLabels:
219-
node.kubernetes.io/instance-type: "BM.GPU.H100.8"
220-
topologyName: "oci-topology"
221-
```
222370

223-
#### Create a Cluster Queue
224-
```yaml
225-
apiVersion: kueue.x-k8s.io/v1beta1
226-
kind: ClusterQueue
227-
metadata:
228-
name: "tas-cluster-queue"
229-
spec:
230-
namespaceSelector: {}
231-
resourceGroups:
232-
- coveredResources: ["cpu", "memory"]
233-
flavors:
234-
- name: "tas-flavor"
235-
resources:
236-
- name: "cpu"
237-
nominalQuota: 100
238-
- name: "memory"
239-
nominalQuota: 100Gi
240-
```
241-
242-
#### Create a Local Queue
243-
```yaml
244-
apiVersion: kueue.x-k8s.io/v1beta1
245-
kind: LocalQueue
246-
metadata:
247-
name: tas-user-queue
248-
spec:
249-
clusterQueue: tas-cluster-queue
250-
```
251-
252-
#### Run example job
253-
`kueue.x-k8s.io/podset-preferred-topology` indicates that a PodSet requires Topology Aware Scheduling, but scheduling all pods within pods on nodes within the same topology domain is a preference rather than requirement. The levels are evaluated one-by-one going up from the level indicated by the annotation. If the PodSet cannot fit within a given topology domain then the next topology level up is considered. If the PodSet cannot fit at the highest topology level, then it gets admitted as distributed among multiple topology domains.
254-
255-
```yaml
256-
apiVersion: batch/v1
257-
kind: Job
258-
metadata:
259-
generateName: tas-sample-preferred
260-
labels:
261-
kueue.x-k8s.io/queue-name: tas-user-queue
262-
spec:
263-
parallelism: 2
264-
completions: 2
265-
completionMode: Indexed
266-
template:
267-
metadata:
268-
annotations:
269-
kueue.x-k8s.io/podset-preferred-topology: "oci.oraclecloud.com/rdma.local_block_id"
270-
spec:
271-
containers:
272-
- name: dummy-job
273-
image: registry.k8s.io/e2e-test-images/agnhost:2.53
274-
args: ["pause"]
275-
resources:
276-
requests:
277-
cpu: "1"
278-
memory: "200Mi"
279-
restartPolicy: Never
280-
```
281371

282372
### Using Node Ordering script as an Init Container
283373
If your workload can use an ordered hostfile or a rankfile (e.g. `mpirun`), you can use the [Node Ordering script](../docker/node-ordering/node_ordering.py) to generate the ordered hostfile/rankfile using an Init Container and then use the generated hostlist/rankfile in your job.

0 commit comments

Comments
 (0)