test: remaining topology e2e tests by Ronkahn21 · Pull Request #383 · ai-dynamo/grove

Ronkahn21 · 2026-02-01T15:12:32Z

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR imports 9 additional topology-aware scheduling (TAS) end-to-end tests.
New Tests Added:

TAS8: Full 3-level hierarchy with cascading constraints (PCS→PCSG→PCLQ)
TAS9: PCS + PCLQ combined constraints
TAS10: PCSG scaling with topology constraints (3 replicas)
TAS11: PCSG + PCLQ constraints without parent PCS constraint
TAS12: Large scaling ratio (replicas=10, minAvailable=3)
TAS13: Error case - insufficient nodes for constraint (all pods pending)
TAS14: Multi-replica PCS with rack constraints
TAS15: Disaggregated inference with multiple PCSGs (decoder + prefill + router)
TAS16: Multi-replica PCS with full 3-level topology hierarchy

Refactoring Applied:

Used `DeployWorkloadAndGetPods` helper for consistency
Used `createTopologyTestContext` for test setup
Replaced hardcoded strings with constants (`setup.TopologyLabel*`, `LabelPodClique*`)
Updated to single-step KAI PodGroup retrieval pattern
Followed sequential test numbering (TAS8-TAS16)
Added missing import: `k8s.io/utils/ptr`

Test Coverage:
All 16 TAS tests pass successfully with 100% success rate, validating topology-aware scheduling across various scenarios from simple constraints to complex multi-replica disaggregated inference workloads.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

All tests follow existing conventions from TAS1-TAS7
Tests are self-contained with their own YAML manifests
No API changes - only test additions
Lint checks pass with 0 issues
All 16 tests verified on 30-node test cluster

Does this PR introduce a API change?

```release-note
NONE
```

Additional documentation e.g., enhancement proposals, usage docs, etc.:

```docs
NONE
```

operator/e2e/tests/topology_test.go

operator/e2e/yaml/tas-large-scale.yaml

operator/e2e/tests/topology_test.go

gflarity

I think the numbering issues predate this PR, might as well just clean them all up at once. Also across all the tests, the verifyLabeledPodsInTopologyDomain will make things a bit DRYer.

operator/e2e/tests/topology_test.go

operator/e2e/yaml/tas-pcsg-scale.yaml

operator/e2e/tests/topology_test.go

operator/e2e/yaml/tas-insuffic.yaml

operator/e2e/yaml/tas-pcs-multi-pcsg.yaml

Import 9 topology tests from test/tas-e2e-advanced branch: - TAS8: Full hierarchy with cascading constraints - TAS9: PCS + PCLQ constraint - TAS10: PCSG scaling with topology constraints - TAS11: PCSG + PCLQ, no parent constraint - TAS12: Large scaling ratio - TAS13: Insufficient nodes (error case) - TAS14: Multi-replica with rack constraint - TAS15: Disaggregated inference with multiple PCSGs - TAS16: Multi-replica PCS with 3-level hierarchy Refactored to match current test conventions: - Use DeployWorkloadAndGetPods helper - Use createTopologyTestContext for setup - Replace hardcoded strings with constants - Use single-step GetPodGroupForBasePodGangReplica - Follow sequential test numbering (TAS8-TAS16) Also imported 9 YAML manifest files for the new tests. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Eliminate 50 lines of duplicated code by replacing separate PCS replica 0 and replica 1 verification blocks with a loop that iterates over both replicas. Maintains identical functionality while improving maintainability. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…ubgroup creation Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Replace duplicated replica verification blocks with loops in 6 tests: - TAS8: nested loops for PCLQ (2x2) and PCSG (2) verifications - TAS10: loop for 3 PCSG replicas - TAS11: loop for 2 PCSG replicas - TAS14: loops for PCS replica pods and PodGroups - TAS15: nested loops for 2 PCSG types × 2 replicas - TAS16: nested loops for PCSG replicas within PCS loop Total: 121 lines eliminated (49% reduction in verification code) Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

…ation Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Reduce code duplication in topology_test.go by introducing reusable helper functions in utils package. Changes: - Add VerifyPCSGReplicasInTopologyDomain helper for PCSG verification - Add VerifyMultiTypePCSGReplicas helper for multi-type PCSG verification - Add GetPodGroupOrFail test helper for cleaner PodGroup retrieval - Update 15 tests (TAS2-TAS16) to use new helpers Impact: - Removes 133 lines of duplicated code - Improves maintainability and consistency - All tests passing Signed-off-by: Ron Kahn <rkahn@nvidia.com>

- Broaden PCS constraint from rack to block in tas-pcsg-scale.yaml - Rename disagg-inference files to use generic PCS-PCSG naming - Update test references for renamed YAML files Signed-off-by: Ron Kahn <rkahn@nvidia.com>

- Replace nginx:alpine-slim with busybox:latest + sleep infinity - Reduce memory: 30Mi → 10Mi, 80Mi → 20Mi - Add CPU limit: 10m per container - Fixes TAS16 CI timeout due to node instability - Reduces total resource pressure by ~66% - Applies to all 14 TAS tests except tas-insuffic.yaml Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Add helper function VerifyScaledPCSGReplicaTopology to verify scaled PCSG replicas topology constraints. Function implements key rule: scaled PodGroup uses PCS constraint only if PCSG has no constraint. Changes: - Add PCSGCliqueConfig and ScaledPCSGConfig types to utils - Add VerifyScaledPCSGReplicaTopology helper function - Update TAS10 to verify scaled replicas 1-2 using lo.ForEach - Update TAS15 to verify decoder and prefill scaled replicas - Use proper types throughout, no inline structs Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Use PCSG constraint when present, otherwise PCS constraint Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Add podAntiAffinity to prevent same-type pods from colocating: - tas-pcsg-scale.yaml: worker clique in PCSG - tas-host-level.yaml: standalone worker clique - tas-pcs-multi-pcsg.yaml: 5 cliques (decoder, prefill, router) Each clique's pods spread across different hosts using grove.io/podclique label selector with kubernetes.io/hostname topology key. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Add podAntiAffinity to 11 remaining TAS test files to prevent same-type pods from colocating on same host: - tas-hierarchy.yaml (2 cliques in PCSG) - tas-indep-clq.yaml (2 standalone cliques) - tas-large-scale.yaml (1 clique in PCSG) - tas-multirep.yaml (1 standalone clique) - tas-no-constraint.yaml (1 clique in PCSG) - tas-pcs-multi-pcsg-multi-replica.yaml (5 cliques) - tas-pcs-pclq.yaml (1 standalone clique) - tas-pcsg-pclq.yaml (1 clique in PCSG) - tas-sl-pcs-only.yaml (2 cliques) - tas-sl-pcsg-only.yaml (2 cliques) - tas-standalone-pclq-only-pcs-zone.yaml (1 standalone clique) Each clique uses grove.io/podclique label selector with kubernetes.io/hostname topology key to spread pods across different hosts. Skipped tas-insuffic.yaml (insufficient nodes test). Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Remove podAntiAffinity constraints that conflict with host-level topology packing requirements. Add cpu: 500m requests to naturally limit pod density per node without blocking required pod colocation. Fixed tests: TAS5, TAS8, TAS9, TAS11, TAS12, TAS16 Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Update CPU requests from 500m to 1000m across multiple TAS YAML files to better align resource allocation with test requirements. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Lowered CPU requests for multiple TAS YAML files to optimize resource usage during testing. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Remove CPU requests and use memory: 51m to limit pod density. With 150m per node, this ensures max 2 pods per node (2×51m=102m fits, 3×51m=153m exceeds capacity). This avoids CPU-related instability on GitHub runners while maintaining node pressure limits through memory constraints. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Adjust memory from 51m to 40m to account for system overhead. With 150m nodes and ~50m system overhead, 40m per pod allows 2 pods (80m) while preventing a 3rd (120m > 100m available). Fixes TAS5 scheduling issues where 51m was too tight. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Change invalid memory unit "40m" to valid "40Mi". The "m" suffix is for CPU millicores, not memory. Memory requires Mi/M/Gi units. With 40Mi per pod on 150MB nodes (~95Mi usable after overhead), this ensures max 2 pods per node while preventing scheduling errors. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

gflarity

Another pass focusing on bugs/logic issues I spotted.

operator/e2e/yaml/tas-standalone-pclq-only-pcs-zone.yaml

operator/e2e/tests/topology_test.go

operator/e2e/utils/kai_topology.go

operator/e2e/yaml/tas-pcsg-scale.yaml

Fix critical bugs and incorrect comments: - Fix memory indentation in tas-standalone (change to 40Mi) - Fix kai_topology.go formula: use minAvailable instead of -1 - Add MinAvailable field to ScaledPCSGConfig - Update TAS10/TAS15 callers to pass MinAvailable - Fix TAS10 block constraint verification (was checking rack) - Fix TAS10 per-replica block verification - Fix tas-large-scale.yaml comment (remove incorrect rack constraint) Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Remove conflicting podAntiAffinity and change memory from 20Mi to 40Mi to match other TAS tests. This ensures max 2 pods per node through memory limits without anti-affinity conflicts. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 requested review from gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners February 1, 2026 15:12

Ronkahn21 changed the title ~~test: import remaining topology e2e tests (TAS8-TAS16)~~ test: remaining topology e2e tests Feb 1, 2026

Ronkahn21 force-pushed the test/import-remaining-topology-tests branch from 456c89d to 3ac3419 Compare February 2, 2026 12:44

gflarity reviewed Feb 2, 2026

View reviewed changes

operator/e2e/tests/topology_test.go Outdated Show resolved Hide resolved

operator/e2e/yaml/tas-large-scale.yaml Outdated Show resolved Hide resolved

operator/e2e/tests/topology_test.go Outdated Show resolved Hide resolved

operator/e2e/tests/topology_test.go Show resolved Hide resolved

gflarity reviewed Feb 3, 2026

View reviewed changes

gflarity previously approved these changes Feb 4, 2026

View reviewed changes

shayasoolin reviewed Feb 5, 2026

View reviewed changes

operator/e2e/yaml/tas-pcsg-scale.yaml Outdated Show resolved Hide resolved

operator/e2e/tests/topology_test.go Outdated Show resolved Hide resolved

operator/e2e/yaml/tas-insuffic.yaml Show resolved Hide resolved

operator/e2e/yaml/tas-pcs-multi-pcsg.yaml Show resolved Hide resolved

Ronkahn21 dismissed gflarity’s stale review via 9af82f7 February 5, 2026 12:10

shayasoolin previously approved these changes Feb 5, 2026

View reviewed changes

danbar2 previously approved these changes Feb 5, 2026

View reviewed changes

Ronkahn21 dismissed stale reviews from danbar2 and shayasoolin via 495b515 February 5, 2026 13:27

danbar2 previously approved these changes Feb 5, 2026

View reviewed changes

shayasoolin previously approved these changes Feb 5, 2026

View reviewed changes

Ronkahn21 dismissed stale reviews from shayasoolin and danbar2 via b3cc590 February 5, 2026 13:49

Ronkahn21 and others added 11 commits February 5, 2026 16:04

Update operator/e2e/yaml/tas-large-scale.yaml

92d9b79

Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>

fix: remove packDomain constraint from topology configuration

147184f

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor: replace name generation functions with utils for expected s…

f7ed72a

…ubgroup creation Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor: simplify expected subgroups creation in topology tests

01cf2d6

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

refactor: update logging messages and unify PodGroup topology verific…

ba26cdf

…ation Signed-off-by: Ron Kahn <rkahn@nvidia.com>

test: address PR feedback on topology tests

f02e7df

- Broaden PCS constraint from rack to block in tas-pcsg-scale.yaml - Rename disagg-inference files to use generic PCS-PCSG naming - Update test references for renamed YAML files Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 added 2 commits February 5, 2026 16:05

test: add lightweight busybox container for test pods

97b7154

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

test: remove merge conflict markers and clean up dependencies.yaml

1673428

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

Ronkahn21 force-pushed the test/import-remaining-topology-tests branch from b3cc590 to 1673428 Compare February 5, 2026 14:11

Ronkahn21 added 12 commits February 5, 2026 17:53

fix: add missing testing package import to kai_topology

5e6e3bb

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

fix: correct scaled PodGroup constraint logic

4a90877

Use PCSG constraint when present, otherwise PCS constraint Signed-off-by: Ron Kahn <rkahn@nvidia.com>

test: remove CPU requirements from TAS test YAMLs

f84dca9

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

test: increase CPU requests to 1000m in TAS YAMLs

951edd5

Update CPU requests from 500m to 1000m across multiple TAS YAML files to better align resource allocation with test requirements. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

test: reduce CPU requests from 1000m to 400m in TAS YAMLs

781f83b

Lowered CPU requests for multiple TAS YAML files to optimize resource usage during testing. Signed-off-by: Ron Kahn <rkahn@nvidia.com>

gflarity reviewed Feb 6, 2026

View reviewed changes

Ronkahn21 added 2 commits February 8, 2026 11:18

danbar2 approved these changes Feb 8, 2026

View reviewed changes

shayasoolin approved these changes Feb 8, 2026

View reviewed changes

Ronkahn21 merged commit 34d129e into ai-dynamo:main Feb 8, 2026
17 of 18 checks passed

Conversation

Ronkahn21 commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gflarity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gflarity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ronkahn21 commented Feb 1, 2026 •

edited

Loading