Skip to content

test: remaining topology e2e tests #383

Merged
Ronkahn21 merged 27 commits intoai-dynamo:mainfrom
Ronkahn21:test/import-remaining-topology-tests
Feb 8, 2026
Merged

test: remaining topology e2e tests #383
Ronkahn21 merged 27 commits intoai-dynamo:mainfrom
Ronkahn21:test/import-remaining-topology-tests

Conversation

@Ronkahn21
Copy link
Contributor

@Ronkahn21 Ronkahn21 commented Feb 1, 2026

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR imports 9 additional topology-aware scheduling (TAS) end-to-end tests.
New Tests Added:

  • TAS8: Full 3-level hierarchy with cascading constraints (PCS→PCSG→PCLQ)
  • TAS9: PCS + PCLQ combined constraints
  • TAS10: PCSG scaling with topology constraints (3 replicas)
  • TAS11: PCSG + PCLQ constraints without parent PCS constraint
  • TAS12: Large scaling ratio (replicas=10, minAvailable=3)
  • TAS13: Error case - insufficient nodes for constraint (all pods pending)
  • TAS14: Multi-replica PCS with rack constraints
  • TAS15: Disaggregated inference with multiple PCSGs (decoder + prefill + router)
  • TAS16: Multi-replica PCS with full 3-level topology hierarchy

Refactoring Applied:

  • Used `DeployWorkloadAndGetPods` helper for consistency
  • Used `createTopologyTestContext` for test setup
  • Replaced hardcoded strings with constants (`setup.TopologyLabel*`, `LabelPodClique*`)
  • Updated to single-step KAI PodGroup retrieval pattern
  • Followed sequential test numbering (TAS8-TAS16)
  • Added missing import: `k8s.io/utils/ptr`

Test Coverage:
All 16 TAS tests pass successfully with 100% success rate, validating topology-aware scheduling across various scenarios from simple constraints to complex multi-replica disaggregated inference workloads.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • All tests follow existing conventions from TAS1-TAS7
  • Tests are self-contained with their own YAML manifests
  • No API changes - only test additions
  • Lint checks pass with 0 issues
  • All 16 tests verified on 30-node test cluster

Does this PR introduce a API change?

```release-note
NONE
```

Additional documentation e.g., enhancement proposals, usage docs, etc.:

```docs
NONE
```

@Ronkahn21 Ronkahn21 changed the title test: import remaining topology e2e tests (TAS8-TAS16) test: remaining topology e2e tests Feb 1, 2026
@Ronkahn21 Ronkahn21 force-pushed the test/import-remaining-topology-tests branch from 456c89d to 3ac3419 Compare February 2, 2026 12:44
Copy link
Contributor

@gflarity gflarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the numbering issues predate this PR, might as well just clean them all up at once. Also across all the tests, the verifyLabeledPodsInTopologyDomain will make things a bit DRYer.

gflarity
gflarity previously approved these changes Feb 4, 2026
shayasoolin
shayasoolin previously approved these changes Feb 5, 2026
danbar2
danbar2 previously approved these changes Feb 5, 2026
@Ronkahn21 Ronkahn21 dismissed stale reviews from danbar2 and shayasoolin via 495b515 February 5, 2026 13:27
danbar2
danbar2 previously approved these changes Feb 5, 2026
shayasoolin
shayasoolin previously approved these changes Feb 5, 2026
@Ronkahn21 Ronkahn21 dismissed stale reviews from shayasoolin and danbar2 via b3cc590 February 5, 2026 13:49
Ronkahn21 and others added 11 commits February 5, 2026 16:04
Import 9 topology tests from test/tas-e2e-advanced branch:
- TAS8: Full hierarchy with cascading constraints
- TAS9: PCS + PCLQ constraint
- TAS10: PCSG scaling with topology constraints
- TAS11: PCSG + PCLQ, no parent constraint
- TAS12: Large scaling ratio
- TAS13: Insufficient nodes (error case)
- TAS14: Multi-replica with rack constraint
- TAS15: Disaggregated inference with multiple PCSGs
- TAS16: Multi-replica PCS with 3-level hierarchy

Refactored to match current test conventions:
- Use DeployWorkloadAndGetPods helper
- Use createTopologyTestContext for setup
- Replace hardcoded strings with constants
- Use single-step GetPodGroupForBasePodGangReplica
- Follow sequential test numbering (TAS8-TAS16)

Also imported 9 YAML manifest files for the new tests.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com>
Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Eliminate 50 lines of duplicated code by replacing separate PCS replica
0 and replica 1 verification blocks with a loop that iterates over both
replicas. Maintains identical functionality while improving maintainability.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ubgroup creation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace duplicated replica verification blocks with loops in 6 tests:
- TAS8: nested loops for PCLQ (2x2) and PCSG (2) verifications
- TAS10: loop for 3 PCSG replicas
- TAS11: loop for 2 PCSG replicas
- TAS14: loops for PCS replica pods and PodGroups
- TAS15: nested loops for 2 PCSG types × 2 replicas
- TAS16: nested loops for PCSG replicas within PCS loop

Total: 121 lines eliminated (49% reduction in verification code)
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reduce code duplication in topology_test.go by introducing reusable
helper functions in utils package.

Changes:
- Add VerifyPCSGReplicasInTopologyDomain helper for PCSG verification
- Add VerifyMultiTypePCSGReplicas helper for multi-type PCSG verification
- Add GetPodGroupOrFail test helper for cleaner PodGroup retrieval
- Update 15 tests (TAS2-TAS16) to use new helpers

Impact:
- Removes 133 lines of duplicated code
- Improves maintainability and consistency
- All tests passing

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Broaden PCS constraint from rack to block in tas-pcsg-scale.yaml
- Rename disagg-inference files to use generic PCS-PCSG naming
- Update test references for renamed YAML files

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Replace nginx:alpine-slim with busybox:latest + sleep infinity
- Reduce memory: 30Mi → 10Mi, 80Mi → 20Mi
- Add CPU limit: 10m per container
- Fixes TAS16 CI timeout due to node instability
- Reduces total resource pressure by ~66%
- Applies to all 14 TAS tests except tas-insuffic.yaml

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the test/import-remaining-topology-tests branch from b3cc590 to 1673428 Compare February 5, 2026 14:11
Add helper function VerifyScaledPCSGReplicaTopology to verify scaled
PCSG replicas topology constraints. Function implements key rule:
scaled PodGroup uses PCS constraint only if PCSG has no constraint.

Changes:
- Add PCSGCliqueConfig and ScaledPCSGConfig types to utils
- Add VerifyScaledPCSGReplicaTopology helper function
- Update TAS10 to verify scaled replicas 1-2 using lo.ForEach
- Update TAS15 to verify decoder and prefill scaled replicas
- Use proper types throughout, no inline structs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Use PCSG constraint when present, otherwise PCS constraint

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add podAntiAffinity to prevent same-type pods from colocating:
- tas-pcsg-scale.yaml: worker clique in PCSG
- tas-host-level.yaml: standalone worker clique
- tas-pcs-multi-pcsg.yaml: 5 cliques (decoder, prefill, router)

Each clique's pods spread across different hosts using
grove.io/podclique label selector with kubernetes.io/hostname
topology key.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add podAntiAffinity to 11 remaining TAS test files to prevent
same-type pods from colocating on same host:

- tas-hierarchy.yaml (2 cliques in PCSG)
- tas-indep-clq.yaml (2 standalone cliques)
- tas-large-scale.yaml (1 clique in PCSG)
- tas-multirep.yaml (1 standalone clique)
- tas-no-constraint.yaml (1 clique in PCSG)
- tas-pcs-multi-pcsg-multi-replica.yaml (5 cliques)
- tas-pcs-pclq.yaml (1 standalone clique)
- tas-pcsg-pclq.yaml (1 clique in PCSG)
- tas-sl-pcs-only.yaml (2 cliques)
- tas-sl-pcsg-only.yaml (2 cliques)
- tas-standalone-pclq-only-pcs-zone.yaml (1 standalone clique)

Each clique uses grove.io/podclique label selector with
kubernetes.io/hostname topology key to spread pods across
different hosts. Skipped tas-insuffic.yaml (insufficient nodes test).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove podAntiAffinity constraints that conflict with host-level
topology packing requirements. Add cpu: 500m requests to naturally
limit pod density per node without blocking required pod colocation.

Fixed tests: TAS5, TAS8, TAS9, TAS11, TAS12, TAS16

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Update CPU requests from 500m to 1000m across multiple TAS YAML files to better align resource allocation with test requirements.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Lowered CPU requests for multiple TAS YAML files to optimize resource usage during testing.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove CPU requests and use memory: 51m to limit pod density.
With 150m per node, this ensures max 2 pods per node
(2×51m=102m fits, 3×51m=153m exceeds capacity).

This avoids CPU-related instability on GitHub runners while
maintaining node pressure limits through memory constraints.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Adjust memory from 51m to 40m to account for system overhead.
With 150m nodes and ~50m system overhead, 40m per pod allows
2 pods (80m) while preventing a 3rd (120m > 100m available).

Fixes TAS5 scheduling issues where 51m was too tight.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Change invalid memory unit "40m" to valid "40Mi". The "m" suffix
is for CPU millicores, not memory. Memory requires Mi/M/Gi units.

With 40Mi per pod on 150MB nodes (~95Mi usable after overhead),
this ensures max 2 pods per node while preventing scheduling errors.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Copy link
Contributor

@gflarity gflarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass focusing on bugs/logic issues I spotted.

Fix critical bugs and incorrect comments:
- Fix memory indentation in tas-standalone (change to 40Mi)
- Fix kai_topology.go formula: use minAvailable instead of -1
- Add MinAvailable field to ScaledPCSGConfig
- Update TAS10/TAS15 callers to pass MinAvailable
- Fix TAS10 block constraint verification (was checking rack)
- Fix TAS10 per-replica block verification
- Fix tas-large-scale.yaml comment (remove incorrect rack constraint)

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove conflicting podAntiAffinity and change memory from 20Mi to
40Mi to match other TAS tests. This ensures max 2 pods per node
through memory limits without anti-affinity conflicts.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 merged commit 34d129e into ai-dynamo:main Feb 8, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants