Skip to content

Commit 3ac3419

Browse files
committed
test: import remaining topology e2e tests (TAS8-TAS16)
Import 9 topology tests from test/tas-e2e-advanced branch: - TAS8: Full hierarchy with cascading constraints - TAS9: PCS + PCLQ constraint - TAS10: PCSG scaling with topology constraints - TAS11: PCSG + PCLQ, no parent constraint - TAS12: Large scaling ratio - TAS13: Insufficient nodes (error case) - TAS14: Multi-replica with rack constraint - TAS15: Disaggregated inference with multiple PCSGs - TAS16: Multi-replica PCS with 3-level hierarchy Refactored to match current test conventions: - Use DeployWorkloadAndGetPods helper - Use createTopologyTestContext for setup - Replace hardcoded strings with constants - Use single-step GetPodGroupForBasePodGangReplica - Follow sequential test numbering (TAS8-TAS16) Also imported 9 YAML manifest files for the new tests. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
1 parent 5fd20af commit 3ac3419

File tree

10 files changed

+1556
-0
lines changed

10 files changed

+1556
-0
lines changed

operator/e2e/tests/topology_test.go

Lines changed: 822 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Workload: Disaggregated Inference - Multi-replica PCS with 3-level topology hierarchy
2+
# Test scenario: PCS (block) with 2 replicas, 2 PCSGs (rack), and PCLQ-level constraint (host)
3+
---
4+
apiVersion: grove.io/v1alpha1
5+
kind: PodCliqueSet
6+
metadata:
7+
name: tas-disagg-inference
8+
labels:
9+
app: tas-disagg-inference
10+
spec:
11+
replicas: 2
12+
template:
13+
topologyConstraint:
14+
packDomain: block
15+
podCliqueScalingGroups:
16+
- name: decoder
17+
replicas: 2
18+
minAvailable: 1
19+
topologyConstraint:
20+
packDomain: rack
21+
cliqueNames:
22+
- dworker
23+
- dleader
24+
- name: prefill
25+
replicas: 2
26+
minAvailable: 1
27+
topologyConstraint:
28+
packDomain: rack
29+
cliqueNames:
30+
- pworker
31+
- pleader
32+
cliques:
33+
- name: dworker
34+
labels:
35+
kai.scheduler/queue: test
36+
spec:
37+
roleName: dworker
38+
replicas: 1
39+
minAvailable: 1
40+
podSpec:
41+
schedulerName: kai-scheduler
42+
affinity:
43+
nodeAffinity:
44+
requiredDuringSchedulingIgnoredDuringExecution:
45+
nodeSelectorTerms:
46+
- matchExpressions:
47+
- key: node_role.e2e.grove.nvidia.com
48+
operator: In
49+
values:
50+
- agent
51+
tolerations:
52+
- key: node_role.e2e.grove.nvidia.com
53+
operator: Equal
54+
value: agent
55+
effect: NoSchedule
56+
containers:
57+
- name: worker
58+
image: registry:5001/nginx:alpine-slim
59+
resources:
60+
requests:
61+
memory: 30Mi
62+
- name: dleader
63+
labels:
64+
kai.scheduler/queue: test
65+
spec:
66+
roleName: dleader
67+
replicas: 1
68+
minAvailable: 1
69+
podSpec:
70+
schedulerName: kai-scheduler
71+
affinity:
72+
nodeAffinity:
73+
requiredDuringSchedulingIgnoredDuringExecution:
74+
nodeSelectorTerms:
75+
- matchExpressions:
76+
- key: node_role.e2e.grove.nvidia.com
77+
operator: In
78+
values:
79+
- agent
80+
tolerations:
81+
- key: node_role.e2e.grove.nvidia.com
82+
operator: Equal
83+
value: agent
84+
effect: NoSchedule
85+
containers:
86+
- name: leader
87+
image: registry:5001/nginx:alpine-slim
88+
resources:
89+
requests:
90+
memory: 30Mi
91+
- name: pworker
92+
topologyConstraint:
93+
packDomain: host
94+
labels:
95+
kai.scheduler/queue: test
96+
spec:
97+
roleName: pworker
98+
replicas: 1
99+
minAvailable: 1
100+
podSpec:
101+
schedulerName: kai-scheduler
102+
affinity:
103+
nodeAffinity:
104+
requiredDuringSchedulingIgnoredDuringExecution:
105+
nodeSelectorTerms:
106+
- matchExpressions:
107+
- key: node_role.e2e.grove.nvidia.com
108+
operator: In
109+
values:
110+
- agent
111+
tolerations:
112+
- key: node_role.e2e.grove.nvidia.com
113+
operator: Equal
114+
value: agent
115+
effect: NoSchedule
116+
containers:
117+
- name: worker
118+
image: registry:5001/nginx:alpine-slim
119+
resources:
120+
requests:
121+
memory: 30Mi
122+
- name: pleader
123+
labels:
124+
kai.scheduler/queue: test
125+
spec:
126+
roleName: pleader
127+
replicas: 1
128+
minAvailable: 1
129+
podSpec:
130+
schedulerName: kai-scheduler
131+
affinity:
132+
nodeAffinity:
133+
requiredDuringSchedulingIgnoredDuringExecution:
134+
nodeSelectorTerms:
135+
- matchExpressions:
136+
- key: node_role.e2e.grove.nvidia.com
137+
operator: In
138+
values:
139+
- agent
140+
tolerations:
141+
- key: node_role.e2e.grove.nvidia.com
142+
operator: Equal
143+
value: agent
144+
effect: NoSchedule
145+
containers:
146+
- name: leader
147+
image: registry:5001/nginx:alpine-slim
148+
resources:
149+
requests:
150+
memory: 30Mi
151+
- name: router
152+
labels:
153+
kai.scheduler/queue: test
154+
spec:
155+
roleName: router
156+
replicas: 2
157+
minAvailable: 2
158+
podSpec:
159+
schedulerName: kai-scheduler
160+
affinity:
161+
nodeAffinity:
162+
requiredDuringSchedulingIgnoredDuringExecution:
163+
nodeSelectorTerms:
164+
- matchExpressions:
165+
- key: node_role.e2e.grove.nvidia.com
166+
operator: In
167+
values:
168+
- agent
169+
tolerations:
170+
- key: node_role.e2e.grove.nvidia.com
171+
operator: Equal
172+
value: agent
173+
effect: NoSchedule
174+
containers:
175+
- name: router
176+
image: registry:5001/nginx:alpine-slim
177+
resources:
178+
requests:
179+
memory: 30Mi
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Workload: Disaggregated Inference - PCS with PCSG and multiple cliques
2+
# Test scenario: PCS (block) with 2 PCSGs (rack) containing disaggregated inference components
3+
---
4+
apiVersion: grove.io/v1alpha1
5+
kind: PodCliqueSet
6+
metadata:
7+
name: tas-disagg-inference
8+
labels:
9+
app: tas-disagg-inference
10+
spec:
11+
replicas: 1
12+
template:
13+
topologyConstraint:
14+
packDomain: block
15+
podCliqueScalingGroups:
16+
- name: decoder
17+
replicas: 2
18+
minAvailable: 1
19+
topologyConstraint:
20+
packDomain: rack
21+
cliqueNames:
22+
- dworker
23+
- dleader
24+
- name: prefill
25+
replicas: 2
26+
minAvailable: 1
27+
topologyConstraint:
28+
packDomain: rack
29+
cliqueNames:
30+
- pworker
31+
- pleader
32+
cliques:
33+
- name: dworker
34+
labels:
35+
kai.scheduler/queue: test
36+
spec:
37+
roleName: dworker
38+
replicas: 1
39+
minAvailable: 1
40+
podSpec:
41+
schedulerName: kai-scheduler
42+
affinity:
43+
nodeAffinity:
44+
requiredDuringSchedulingIgnoredDuringExecution:
45+
nodeSelectorTerms:
46+
- matchExpressions:
47+
- key: node_role.e2e.grove.nvidia.com
48+
operator: In
49+
values:
50+
- agent
51+
tolerations:
52+
- key: node_role.e2e.grove.nvidia.com
53+
operator: Equal
54+
value: agent
55+
effect: NoSchedule
56+
containers:
57+
- name: worker
58+
image: registry:5001/nginx:alpine-slim
59+
resources:
60+
requests:
61+
memory: 30Mi
62+
- name: dleader
63+
labels:
64+
kai.scheduler/queue: test
65+
spec:
66+
roleName: dleader
67+
replicas: 1
68+
minAvailable: 1
69+
podSpec:
70+
schedulerName: kai-scheduler
71+
affinity:
72+
nodeAffinity:
73+
requiredDuringSchedulingIgnoredDuringExecution:
74+
nodeSelectorTerms:
75+
- matchExpressions:
76+
- key: node_role.e2e.grove.nvidia.com
77+
operator: In
78+
values:
79+
- agent
80+
tolerations:
81+
- key: node_role.e2e.grove.nvidia.com
82+
operator: Equal
83+
value: agent
84+
effect: NoSchedule
85+
containers:
86+
- name: leader
87+
image: registry:5001/nginx:alpine-slim
88+
resources:
89+
requests:
90+
memory: 30Mi
91+
- name: pworker
92+
labels:
93+
kai.scheduler/queue: test
94+
spec:
95+
roleName: pworker
96+
replicas: 1
97+
minAvailable: 1
98+
podSpec:
99+
schedulerName: kai-scheduler
100+
affinity:
101+
nodeAffinity:
102+
requiredDuringSchedulingIgnoredDuringExecution:
103+
nodeSelectorTerms:
104+
- matchExpressions:
105+
- key: node_role.e2e.grove.nvidia.com
106+
operator: In
107+
values:
108+
- agent
109+
tolerations:
110+
- key: node_role.e2e.grove.nvidia.com
111+
operator: Equal
112+
value: agent
113+
effect: NoSchedule
114+
containers:
115+
- name: worker
116+
image: registry:5001/nginx:alpine-slim
117+
resources:
118+
requests:
119+
memory: 30Mi
120+
- name: pleader
121+
labels:
122+
kai.scheduler/queue: test
123+
spec:
124+
roleName: pleader
125+
replicas: 1
126+
minAvailable: 1
127+
podSpec:
128+
schedulerName: kai-scheduler
129+
affinity:
130+
nodeAffinity:
131+
requiredDuringSchedulingIgnoredDuringExecution:
132+
nodeSelectorTerms:
133+
- matchExpressions:
134+
- key: node_role.e2e.grove.nvidia.com
135+
operator: In
136+
values:
137+
- agent
138+
tolerations:
139+
- key: node_role.e2e.grove.nvidia.com
140+
operator: Equal
141+
value: agent
142+
effect: NoSchedule
143+
containers:
144+
- name: leader
145+
image: registry:5001/nginx:alpine-slim
146+
resources:
147+
requests:
148+
memory: 30Mi
149+
- name: router
150+
labels:
151+
kai.scheduler/queue: test
152+
spec:
153+
roleName: router
154+
replicas: 2
155+
minAvailable: 2
156+
podSpec:
157+
schedulerName: kai-scheduler
158+
affinity:
159+
nodeAffinity:
160+
requiredDuringSchedulingIgnoredDuringExecution:
161+
nodeSelectorTerms:
162+
- matchExpressions:
163+
- key: node_role.e2e.grove.nvidia.com
164+
operator: In
165+
values:
166+
- agent
167+
tolerations:
168+
- key: node_role.e2e.grove.nvidia.com
169+
operator: Equal
170+
value: agent
171+
effect: NoSchedule
172+
containers:
173+
- name: router
174+
image: registry:5001/nginx:alpine-slim
175+
resources:
176+
requests:
177+
memory: 30Mi

0 commit comments

Comments
 (0)