test: remaining topology e2e tests #383
Merged
Ronkahn21 merged 27 commits intoai-dynamo:mainfrom Feb 8, 2026
Merged
Conversation
456c89d to
3ac3419
Compare
gflarity
reviewed
Feb 2, 2026
gflarity
reviewed
Feb 3, 2026
Contributor
gflarity
left a comment
There was a problem hiding this comment.
I think the numbering issues predate this PR, might as well just clean them all up at once. Also across all the tests, the verifyLabeledPodsInTopologyDomain will make things a bit DRYer.
gflarity
previously approved these changes
Feb 4, 2026
shayasoolin
reviewed
Feb 5, 2026
shayasoolin
previously approved these changes
Feb 5, 2026
danbar2
previously approved these changes
Feb 5, 2026
danbar2
previously approved these changes
Feb 5, 2026
shayasoolin
previously approved these changes
Feb 5, 2026
Import 9 topology tests from test/tas-e2e-advanced branch: - TAS8: Full hierarchy with cascading constraints - TAS9: PCS + PCLQ constraint - TAS10: PCSG scaling with topology constraints - TAS11: PCSG + PCLQ, no parent constraint - TAS12: Large scaling ratio - TAS13: Insufficient nodes (error case) - TAS14: Multi-replica with rack constraint - TAS15: Disaggregated inference with multiple PCSGs - TAS16: Multi-replica PCS with 3-level hierarchy Refactored to match current test conventions: - Use DeployWorkloadAndGetPods helper - Use createTopologyTestContext for setup - Replace hardcoded strings with constants - Use single-step GetPodGroupForBasePodGangReplica - Follow sequential test numbering (TAS8-TAS16) Also imported 9 YAML manifest files for the new tests. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Eliminate 50 lines of duplicated code by replacing separate PCS replica 0 and replica 1 verification blocks with a loop that iterates over both replicas. Maintains identical functionality while improving maintainability. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ubgroup creation Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Replace duplicated replica verification blocks with loops in 6 tests: - TAS8: nested loops for PCLQ (2x2) and PCSG (2) verifications - TAS10: loop for 3 PCSG replicas - TAS11: loop for 2 PCSG replicas - TAS14: loops for PCS replica pods and PodGroups - TAS15: nested loops for 2 PCSG types × 2 replicas - TAS16: nested loops for PCSG replicas within PCS loop Total: 121 lines eliminated (49% reduction in verification code) Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ation Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reduce code duplication in topology_test.go by introducing reusable helper functions in utils package. Changes: - Add VerifyPCSGReplicasInTopologyDomain helper for PCSG verification - Add VerifyMultiTypePCSGReplicas helper for multi-type PCSG verification - Add GetPodGroupOrFail test helper for cleaner PodGroup retrieval - Update 15 tests (TAS2-TAS16) to use new helpers Impact: - Removes 133 lines of duplicated code - Improves maintainability and consistency - All tests passing Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Broaden PCS constraint from rack to block in tas-pcsg-scale.yaml - Rename disagg-inference files to use generic PCS-PCSG naming - Update test references for renamed YAML files Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Replace nginx:alpine-slim with busybox:latest + sleep infinity - Reduce memory: 30Mi → 10Mi, 80Mi → 20Mi - Add CPU limit: 10m per container - Fixes TAS16 CI timeout due to node instability - Reduces total resource pressure by ~66% - Applies to all 14 TAS tests except tas-insuffic.yaml Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
b3cc590 to
1673428
Compare
Add helper function VerifyScaledPCSGReplicaTopology to verify scaled PCSG replicas topology constraints. Function implements key rule: scaled PodGroup uses PCS constraint only if PCSG has no constraint. Changes: - Add PCSGCliqueConfig and ScaledPCSGConfig types to utils - Add VerifyScaledPCSGReplicaTopology helper function - Update TAS10 to verify scaled replicas 1-2 using lo.ForEach - Update TAS15 to verify decoder and prefill scaled replicas - Use proper types throughout, no inline structs Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Use PCSG constraint when present, otherwise PCS constraint Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add podAntiAffinity to prevent same-type pods from colocating: - tas-pcsg-scale.yaml: worker clique in PCSG - tas-host-level.yaml: standalone worker clique - tas-pcs-multi-pcsg.yaml: 5 cliques (decoder, prefill, router) Each clique's pods spread across different hosts using grove.io/podclique label selector with kubernetes.io/hostname topology key. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add podAntiAffinity to 11 remaining TAS test files to prevent same-type pods from colocating on same host: - tas-hierarchy.yaml (2 cliques in PCSG) - tas-indep-clq.yaml (2 standalone cliques) - tas-large-scale.yaml (1 clique in PCSG) - tas-multirep.yaml (1 standalone clique) - tas-no-constraint.yaml (1 clique in PCSG) - tas-pcs-multi-pcsg-multi-replica.yaml (5 cliques) - tas-pcs-pclq.yaml (1 standalone clique) - tas-pcsg-pclq.yaml (1 clique in PCSG) - tas-sl-pcs-only.yaml (2 cliques) - tas-sl-pcsg-only.yaml (2 cliques) - tas-standalone-pclq-only-pcs-zone.yaml (1 standalone clique) Each clique uses grove.io/podclique label selector with kubernetes.io/hostname topology key to spread pods across different hosts. Skipped tas-insuffic.yaml (insufficient nodes test). Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove podAntiAffinity constraints that conflict with host-level topology packing requirements. Add cpu: 500m requests to naturally limit pod density per node without blocking required pod colocation. Fixed tests: TAS5, TAS8, TAS9, TAS11, TAS12, TAS16 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Update CPU requests from 500m to 1000m across multiple TAS YAML files to better align resource allocation with test requirements. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Lowered CPU requests for multiple TAS YAML files to optimize resource usage during testing. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove CPU requests and use memory: 51m to limit pod density. With 150m per node, this ensures max 2 pods per node (2×51m=102m fits, 3×51m=153m exceeds capacity). This avoids CPU-related instability on GitHub runners while maintaining node pressure limits through memory constraints. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Adjust memory from 51m to 40m to account for system overhead. With 150m nodes and ~50m system overhead, 40m per pod allows 2 pods (80m) while preventing a 3rd (120m > 100m available). Fixes TAS5 scheduling issues where 51m was too tight. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Change invalid memory unit "40m" to valid "40Mi". The "m" suffix is for CPU millicores, not memory. Memory requires Mi/M/Gi units. With 40Mi per pod on 150MB nodes (~95Mi usable after overhead), this ensures max 2 pods per node while preventing scheduling errors. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
gflarity
reviewed
Feb 6, 2026
Contributor
gflarity
left a comment
There was a problem hiding this comment.
Another pass focusing on bugs/logic issues I spotted.
Fix critical bugs and incorrect comments: - Fix memory indentation in tas-standalone (change to 40Mi) - Fix kai_topology.go formula: use minAvailable instead of -1 - Add MinAvailable field to ScaledPCSGConfig - Update TAS10/TAS15 callers to pass MinAvailable - Fix TAS10 block constraint verification (was checking rack) - Fix TAS10 per-replica block verification - Fix tas-large-scale.yaml comment (remove incorrect rack constraint) Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Remove conflicting podAntiAffinity and change memory from 20Mi to 40Mi to match other TAS tests. This ensures max 2 pods per node through memory limits without anti-affinity conflicts. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
danbar2
approved these changes
Feb 8, 2026
shayasoolin
approved these changes
Feb 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
/kind cleanup
What this PR does / why we need it:
This PR imports 9 additional topology-aware scheduling (TAS) end-to-end tests.
New Tests Added:
Refactoring Applied:
Test Coverage:
All 16 TAS tests pass successfully with 100% success rate, validating topology-aware scheduling across various scenarios from simple constraints to complex multi-replica disaggregated inference workloads.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a API change?
```release-note
NONE
```
Additional documentation e.g., enhancement proposals, usage docs, etc.:
```docs
NONE
```