Skip to content

Commit 294cd71

Browse files
Merge pull request #54 from oracle-quickstart/topology
Add Using network locality when running workloads on OKE doc
2 parents 076df17 + 0a562d5 commit 294cd71

File tree

3 files changed

+269
-0
lines changed

3 files changed

+269
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,3 +286,6 @@ You can deploy the health check script with Node Problem Detector by following t
286286

287287
### Can I autoscale my RDMA enabled nodes in a Cluster Network?
288288
You can setup autoscaling for your nodes in a Cluster Network using the instructions [here.](./docs/using-cluster-autoscaler-with-cluster-networks.md)
289+
290+
### How do I use network locality information when running workloads on OKE?
291+
You can follow the instructions [here.](./docs/using-rdma-network-locality-when-running-workloads-on-oke.md)

docs/tiers.png

523 KB
Loading
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# Using network locality when running workloads on OKE
2+
3+
> [!IMPORTANT]
4+
> To use the instructions in this guide, you must have a dedicated capacity pool and you must create a capacity topology. Otherwise, `rdmaTopologyData` in instance metadata service and related node labels in OKE will not be available.
5+
6+
## What is network locality?
7+
Generative AI workloads drive a different set of engineering tradeoffs than traditional cloud workloads. So, we designed a purpose-built GenAI network tailored to the needs of the best-of-breed Generative AI workloads.
8+
9+
When possible, running a job using the nodes in the same Local Block will provide the best performance. Because the number of nodes in a Local Block is limited; depending on the number of nodes you have, the number of your concurrent jobs running, and the size of your jobs, you might need to use the nodes from another Local Block in the same Network Block or from another Network Block.
10+
11+
Local Block is the first latency band (Tier-0), Network Block is the second latency band (Tier-1), and HPC Island is the third latency band (Tier-2) in OCI's RDMA networks. You can read [this blog post](https://blogs.oracle.com/cloud-infrastructure/post/first-principles-zettascale-oci-superclusters) and watch the [YouTube video](https://www.youtube.com/watch?v=cZy22n5Ih78) for learning more about OCI's RDMA network design.
12+
13+
![OCI Cluster Network Fabric](./tiers.png)
14+
15+
## What type of network tier information will I have?
16+
When you have a dedicated capacity pool and a capacity topology created for the availability domain, the following information will be available in the instance metadata service for bare metal GPU shapes:
17+
18+
```
19+
curl -H 'Authorization: Bearer Oracle' http://169.254.169.254/opc/v2/host/rdmaTopologyData
20+
21+
{
22+
"customerHPCIslandId": "ocid1.hpcisland.oc1.iad.anuwcljrg5pyaeycajoqlss...",
23+
"customerHostId": "ocid1.computebaremetalhost.oc1.iad.anuwcljrg5pyaeycu...",
24+
"customerLocalBlock": "ocid1.computelocalblock.oc1.iad.anuwcljrg5pyaeyc...",
25+
"customerNetworkBlock": "ocid1.computenetworkblock.oc1.iad.anuwclddsdef..."
26+
```
27+
28+
## How do I use network locality information when running workloads on OKE?
29+
When the locality information is available in the instance metadata service, OKE will add the following labels to your nodes during bootstrapping:
30+
31+
```
32+
oci.oraclecloud.com/rdma.host_id
33+
oci.oraclecloud.com/rdma.hpc_island_id
34+
oci.oraclecloud.com/rdma.local_block_id
35+
oci.oraclecloud.com/rdma.network_block_id
36+
```
37+
The values of the labels are hashes of the information available in instance metadata and they will be different than the OCIDs above.
38+
39+
Example:
40+
```
41+
oci.oraclecloud.com/rdma.host_id=ab3zs7y7v7q
42+
oci.oraclecloud.com/rdma.hpc_island_id=af7ubvouuyq
43+
oci.oraclecloud.com/rdma.local_block_id=4tjxbt4s6ua
44+
oci.oraclecloud.com/rdma.network_block_id=7xmzl4p4wba
45+
```
46+
47+
You can use these labels to create affinity rules for your workloads. Visit [this link](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) if you want to learn more about using affinity rules on Kubernetes.
48+
49+
Note that because we're using soft rules (`preferredDuringSchedulingIgnoredDuringExecution`), the scheduler will try to find a node that meets the rules. If a matching node is not available, the scheduler will still schedule the pod.
50+
51+
You can use hard rules instead (`requiredDuringSchedulingIgnoredDuringExecution`), but that means the scheduler can't schedule the pod unless the rules are met. So your jobs might not start depending on node availability.
52+
53+
### Using node affinity
54+
When using node affinity, you will need to provide the values of the `oci.oraclecloud.com/rdma.local_block_id`, `oci.oraclecloud.com/rdma.network_block_id`, and `oci.oraclecloud.com/rdma.hpc_island_id` labels. Instead of hardcoding them, you can use tools like `sed` or `yq` to change them when you're scheduling jobs. Or if you're using Helm, you can templatize those values.
55+
56+
```yaml
57+
apiVersion: apps/v1
58+
kind: Deployment
59+
metadata:
60+
name: node-affinity-example
61+
spec:
62+
replicas: 3
63+
selector:
64+
matchLabels:
65+
app: node-affinity-app
66+
template:
67+
metadata:
68+
labels:
69+
app: node-affinity-app
70+
spec:
71+
affinity:
72+
nodeAffinity:
73+
preferredDuringSchedulingIgnoredDuringExecution:
74+
- weight: 100
75+
preference:
76+
matchExpressions:
77+
- key: oci.oraclecloud.com/rdma.local_block_id
78+
operator: In
79+
values:
80+
- 5tjxbt5s6ua
81+
- weight: 50
82+
preference:
83+
matchExpressions:
84+
- key: oci.oraclecloud.com/rdma.network_block_id
85+
operator: In
86+
values:
87+
- 7xmzl5p5wba
88+
- weight: 25
89+
preference:
90+
matchExpressions:
91+
- key: oci.oraclecloud.com/rdma.hpc_island_id
92+
operator: In
93+
values:
94+
- af7ubvouuyq
95+
containers:
96+
- name: nginx
97+
image: nginx
98+
ports:
99+
- containerPort: 80
100+
resources:
101+
requests:
102+
cpu: "100m"
103+
memory: "128Mi"
104+
limits:
105+
cpu: "500m"
106+
memory: "256Mi"
107+
```
108+
109+
### Using pod affinity
110+
When using pod affinity, because you're relying on the `topologyKey` instead of node labels, you don't need to provide the values for the `oci.oraclecloud.com/rdma.local_block_id`, `oci.oraclecloud.com/rdma.network_block_id`, and `oci.oraclecloud.com/rdma.hpc_island_id` labels.
111+
112+
> [!NOTE]
113+
> Inter-pod affinity and anti-affinity require substantial amounts of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.
114+
> Pod anti-affinity requires nodes to be consistently labeled, in other words, every node in the cluster must have an appropriate label matching topologyKey. If some or all nodes are missing the specified topologyKey label, it can lead to unintended behavior.
115+
116+
```yaml
117+
apiVersion: apps/v1
118+
kind: Deployment
119+
metadata:
120+
name: pod-affinity-example
121+
spec:
122+
replicas: 3
123+
selector:
124+
matchLabels:
125+
app: pod-affinity-app
126+
template:
127+
metadata:
128+
labels:
129+
app: pod-affinity-app
130+
spec:
131+
affinity:
132+
podAffinity:
133+
preferredDuringSchedulingIgnoredDuringExecution:
134+
- weight: 100
135+
podAffinityTerm:
136+
labelSelector:
137+
matchExpressions:
138+
- key: app
139+
operator: In
140+
values:
141+
- pod-affinity-app
142+
topologyKey: oci.oraclecloud.com/rdma.local_block_id
143+
- weight: 50
144+
podAffinityTerm:
145+
labelSelector:
146+
matchExpressions:
147+
- key: app
148+
operator: In
149+
values:
150+
- pod-affinity-app
151+
topologyKey: oci.oraclecloud.com/rdma.network_block_id
152+
- weight: 25
153+
podAffinityTerm:
154+
labelSelector:
155+
matchExpressions:
156+
- key: app
157+
operator: In
158+
values:
159+
- pod-affinity-app
160+
topologyKey: oci.oraclecloud.com/rdma.hpc_island_id
161+
containers:
162+
- name: nginx
163+
image: nginx
164+
ports:
165+
- containerPort: 80
166+
resources:
167+
requests:
168+
cpu: "100m"
169+
memory: "128Mi"
170+
limits:
171+
cpu: "500m"
172+
memory: "256Mi"
173+
174+
```
175+
176+
### Using `kueue`
177+
You will need to [enable the feature gate](https://kueue.sigs.k8s.io/docs/installation/#change-the-feature-gates-configuration) for [Topology Aware Scheduling (TAS)](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling) in Kueue. Topology Aware Scheduling is currently in alpha state since Kueue v0.9.
178+
179+
The following example uses `node.kubernetes.io/instance-type: "BM.GPU.H100.8"` to select H100s, but you can use any label that exists on all your nodes that you're targeting with the Resource Flavor.
180+
181+
#### Create a Topology
182+
```yaml
183+
apiVersion: kueue.x-k8s.io/v1alpha1
184+
kind: Topology
185+
metadata:
186+
name: "oci-topology"
187+
spec:
188+
levels:
189+
- nodeLabel: "oci.oraclecloud.com/rdma.hpc_island_id"
190+
- nodeLabel: "oci.oraclecloud.com/rdma.network_block_id"
191+
- nodeLabel: "oci.oraclecloud.com/rdma.local_block_id"
192+
- nodeLabel: "kubernetes.io/hostname"
193+
```
194+
195+
#### Create a Resource Flavor
196+
```yaml
197+
kind: ResourceFlavor
198+
apiVersion: kueue.x-k8s.io/v1beta1
199+
metadata:
200+
name: "tas-flavor"
201+
spec:
202+
nodeLabels:
203+
node.kubernetes.io/instance-type: "BM.GPU.H100.8"
204+
topologyName: "oci-topology"
205+
```
206+
207+
#### Create a Cluster Queue
208+
```yaml
209+
apiVersion: kueue.x-k8s.io/v1beta1
210+
kind: ClusterQueue
211+
metadata:
212+
name: "tas-cluster-queue"
213+
spec:
214+
namespaceSelector: {}
215+
resourceGroups:
216+
- coveredResources: ["cpu", "memory"]
217+
flavors:
218+
- name: "tas-flavor"
219+
resources:
220+
- name: "cpu"
221+
nominalQuota: 100
222+
- name: "memory"
223+
nominalQuota: 100Gi
224+
```
225+
226+
#### Create a Local Queue
227+
```yaml
228+
apiVersion: kueue.x-k8s.io/v1beta1
229+
kind: LocalQueue
230+
metadata:
231+
name: tas-user-queue
232+
spec:
233+
clusterQueue: tas-cluster-queue
234+
```
235+
236+
#### Run example job
237+
`kueue.x-k8s.io/podset-preferred-topology` indicates that a PodSet requires Topology Aware Scheduling, but scheduling all pods within pods on nodes within the same topology domain is a preference rather than requirement. The levels are evaluated one-by-one going up from the level indicated by the annotation. If the PodSet cannot fit within a given topology domain then the next topology level up is considered. If the PodSet cannot fit at the highest topology level, then it gets admitted as distributed among multiple topology domains.
238+
239+
```yaml
240+
apiVersion: batch/v1
241+
kind: Job
242+
metadata:
243+
generateName: tas-sample-preferred
244+
labels:
245+
kueue.x-k8s.io/queue-name: tas-user-queue
246+
spec:
247+
parallelism: 2
248+
completions: 2
249+
completionMode: Indexed
250+
template:
251+
metadata:
252+
annotations:
253+
kueue.x-k8s.io/podset-preferred-topology: "oci.oraclecloud.com/rdma.local_block_id"
254+
spec:
255+
containers:
256+
- name: dummy-job
257+
image: registry.k8s.io/e2e-test-images/agnhost:2.53
258+
args: ["pause"]
259+
resources:
260+
requests:
261+
cpu: "1"
262+
memory: "200Mi"
263+
restartPolicy: Never
264+
```
265+
266+

0 commit comments

Comments
 (0)