KEP-766: DisaggregatedSet implementation#773
KEP-766: DisaggregatedSet implementation#773hasB4K wants to merge 17 commits intokubernetes-sigs:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hasB4K The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @hasB4K. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-lws canceled.
|
df7d25d to
7620d60
Compare
|
/ok-to-test |
| // +kubebuilder:validation:Minimum=0 | ||
| // +kubebuilder:default=1 | ||
| // +optional | ||
| Replicas *int32 `json:"replicas,omitempty"` |
There was a problem hiding this comment.
I am sure you thought you about this, but what is the reason for not using a LeaderWorkerSetSpec instead of explicitly listing the individual parameters?
There was a problem hiding this comment.
I remember why I took the decision of not using LeaderWorkerSetSpec.
DisaggregatedPhaseSpec does not include rolloutStrategy from LeaderWorkerSetSpec because the DisaggregatedSet operator handles rolling updates itself:
Rollout type: We don't want users to pick a rollout strategy (RollingUpdate, OnDelete, etc.). The operator has its own logic to roll all phases together.
Partition: LWS uses partition to roll pods one by one, but this only works inside a single LWS. We can't sync partition across prefill and decode, so pods would end up on different versions. (this I didn't know)
Instead, the operator creates new LWS resources for each revision and scales them up/down together. Services only point to revisions where all phases are ready.
I think the API should be opinionated, and this is typically one of this case when I don't want to expose types (like RollouType or partition) that doesn't makes sense in the DisaggregatedSet operator
There was a problem hiding this comment.
Here is one option: we can use the LeaderWorkerSetSpec and validate that only the supported fields are set. The advantage is that this way the apis are always in sync (we don't need to copy every field), but at the same time we control which fields the new api should support. What do you think?
There was a problem hiding this comment.
Okay fair. I can try to do that, I'll let you know.
7c66541 to
ddb0070
Compare
|
Here what has been achieved since our last sync:
Here what i still need to do:
(I added a global TODO in the head message) cc @ahg-g, @Edwinhr716 |
ddb0070 to
358e48c
Compare
This adds the reference implementation for KEP-766 DisaggregatedSet. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…Service Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…ss portless Services This change simplifies the DisaggregatedSet API by removing the user-facing ServiceTemplate field and automatically creating headless portless Services for pod discovery via EndpointSlices. API Changes: - Remove ServiceTemplate from DisaggregatedSet spec - Services are now created automatically without user configuration Service Behavior: - Headless Services (clusterIP: None) are created for each side (prefill/decode) - Services are portless to enable EndpointSlice-based discovery - Services are only created when both sides have ReadyReplicas >= 1 - Old services are cleaned up only after workloads are fully drained Testing: - Add e2e test verifying automatic Service and EndpointSlice creation - Add Kind cluster cleanup in AfterSuite to prevent image caching issues - Fix .dockerignore to use explicit exclusions Also fixes typo in Makefile helm target path. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…ut codebase This is a breaking change that standardizes terminology: - Label: disaggregatedset.x-k8s.io/side → disaggregatedset.x-k8s.io/phase - Type: DisaggSideConfig → DisaggregatedPhaseSpec - Constants: SidePrefill/SideDecode → PhasePrefill/PhaseDecode - Struct: SideReplicaState → PhaseReplicaState - Map field: Sides → Phases YAML field names (spec.prefill, spec.decode) remain unchanged. Updated files: - API types and CRD manifests - All controller code and tests - E2E tests with updated label selectors - CLI tool (plan-steps) - README documentation - Helm chart CRD Signed-off-by: Mathis Felardos <mathis@mistral.ai>
… flexible phases array
This commit introduces a flexible N-phase architecture for DisaggregatedSet,
replacing the hardcoded prefill/decode fields with a dynamic phases array.
API Changes:
- Replace Prefill/Decode pointer fields with Phases []DisaggregatedPhaseSpec
- Add CEL validation requiring >= 2 phases with unique names
- Update PhaseReplicaState to use dynamic slices instead of fixed arrays
Controller Changes:
- Refactor executor to use dynamic slices for N-phase support
- Update planner to work with arbitrary phase counts
- Update service manager for N-phase workloads
CLI Changes (plan-steps):
- Accept JSON maps for phase config: --source '{"prefill": 6, "decode": 2}'
- Dynamically generate table headers from phase names
- Fully N-phase ready (planner still has 2-phase limitation)
Testing:
- Add 3-phase rolling update e2e test
- Add phase rename e2e test
- Update all existing tests for new API
This is a breaking API change for v1alpha1.
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
… for phase changes Introduces a phasePolicy field (Strict/Flexible) to control how DisaggregatedSet handles phase additions, removals, and renames during updates. API Changes: - Add PhasePolicy type with Strict (default) and Flexible values - Strict: rejects adding, removing, or renaming phases - Flexible: allows phase changes with progressive rollout Controller Changes: - Detect phase changes by comparing spec vs old workload phases - Pass removed phases to Planner with target=0 for progressive drain - Compute allPhaseNames as union of spec phases + old workload phases - Clean up ALL phases belonging to drained revisions, not just spec phases - Refactor executor.go into logical sections with focused helper functions - Move verbose investigation logs to V(1) debug level Testing: - Add unit tests for phasePolicy enforcement (Strict blocks, Flexible allows) - Update e2e tests for progressive drain behavior - Add 3-phase sample manifests File reduced from 658 lines to 507 lines while maintaining all functionality. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…r and fixtures packages Introduces reusable test utilities to simplify e2e test code and improve maintainability. New Packages: - test/utils/kubectl: Fluent builder API for kubectl commands - kubectl.go: Chainable methods (Get, Delete, Apply, Label, JSONPath, etc.) - queries.go: Higher-level helpers (LWSByPhase, CountPods, GetTotalReplicas) - waiters.go: Eventually-based wait helpers (ForPodCount, ForRevisionDrained) - test/utils/fixtures: YAML builders for DisaggregatedSet manifests - fixtures.Phase and fixtures.Config for flexible YAML generation - fixtures.PrefillDecode helper for common 2-phase configs Test Refactoring: - Replace inline helpers with kubectl package imports - Use fluent builder for all kubectl operations - Consolidate YAML builders into fixtures package - Add TrackProgressiveRollout helper for rollout tests Code Reduction: - executor.go: 507 → 456 lines - e2e_test.go: 1669 → 1004 lines (~40% reduction) Signed-off-by: Mathis Felardos <mathis@mistral.ai>
… and maxUnavailable floor Port fixes from commit 7c66541 to the N-phase planner: 1. Source-aware surge baseline: - Change surge constraint from old + new <= target + surge to old + new <= max(source, target) + surge - Ensures scale-down scenarios don't block scale-up since the system already runs source replicas 2. maxUnavailable floor enforcement: - Add minOld constraint: old + new >= target - maxUnavailable - Only enforce when source >= target (scale-down scenario) - Applied to both proportional drain and fallback drain paths Test updates: - Update expected sequences for scale-down and mixed-scale scenarios - Add asymmetric_5_3_surge2 test case All tests pass with 90.2% coverage. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
358e48c to
233b6ad
Compare
| // --surge '{"prefill": 2, "decode": 2}' | ||
| // | ||
| // This helps understand what will happen during a specific rollout in advance. | ||
| // Supports arbitrary phase names and will support N phases when the planner does. |
There was a problem hiding this comment.
So this is a tool that one runs locally, not a controller?
There was a problem hiding this comment.
So this is a tool that one runs locally, not a controller?
cmd/plan-steps/main.go is just a util script that I used for debug. it can also be used to know what will happen before one does a deployment. BUT the rest of the code in disaggregatedset/internal/ is the controller/operator. 😉
There was a problem hiding this comment.
I moved this to hack/plan-steps/ - I think it's clearer that way.
The existing test only verified that labels and annotations were set on the LWS workerTemplate. This extends the test to also verify that they are actually propagated to the running pods by LWS. Adds PodsByPhase helper to kubectl queries for querying pods by phase. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…gy support Adds a metadata field to DisaggregatedPhaseSpec that allows users to set labels and annotations on the LWS CR's ObjectMeta. This enables: - Kueue queue assignment (kueue.x-k8s.io/queue-name label) - LWS exclusive-topology scheduling User-provided labels/annotations are merged onto the LWS CR, with system labels taking precedence over user labels. Includes e2e test verifying metadata propagation to LWS CR. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…est-e2e targets for prow CI Signed-off-by: Mathis Felardos <mathis@mistral.ai>
BREAKING CHANGE: The phasePolicy field has been removed from DisaggregatedSetSpec. The controller now always uses Flexible behavior, allowing phase additions, removals, and renames during rollouts. - Remove PhasePolicy type, constants, and spec field - Remove getPhasePolicy() and rejectPhaseChanges() controller logic - Remove all PhasePolicy unit tests - Update fixtures and sample YAMLs - Remove PhasePolicy from e2e tests Phase changes are now always allowed (Flexible behavior is the default). Signed-off-by: Mathis Felardos <mathis@mistral.ai>
- Add KIND ?= kind default in disaggregatedset Makefile - Pass KIND=$(KIND) from parent Makefile to disaggregatedset test-e2e This fixes the "Kind is not installed" error when running disaggregatedset e2e tests from the parent LWS Makefile. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
7bc4172 to
394f4b8
Compare
Add a dedicated e2e test script that mirrors the LWS pattern for running e2e tests in Prow CI. The script handles the full e2e lifecycle: - Kind cluster creation/deletion with cleanup trap - Docker image building and loading to Kind - LWS controller installation from release manifests - DisaggregatedSet operator deployment via kustomize - Running ginkgo tests with junit output - Log collection on cleanup for debugging Update Makefile to call the hack script with all required environment variables (KIND, KUBECTL, KUSTOMIZE, GINKGO, etc.) and add ginkgo tool. Update e2e tests to skip redundant operations when run via hack script (detected via LWS_INSTALL_SKIP=true), allowing tests to work both standalone and in Prow CI.
394f4b8 to
89cc3e9
Compare
…eNextStep - Change RollingUpdateConfig from slice fields to per-phase structs ([]RollingUpdateConfig instead of RollingUpdateConfig with []int fields) - Simplify extractRollingUpdateConfig by removing unused specPhaseSet param - Split ComputeNextStep into focused helpers: isComplete, isNewAtTarget, canScaleUp, computeMinOld, tryScaleUp, tryProportionalDrain, tryForceDrain - Remove config() test helper in favor of direct struct literals
5a2367a to
e1be1ea
Compare
ahg-g
left a comment
There was a problem hiding this comment.
The API looks good to me. I will review the code when it is merged with the lws binary (Can we create an issue to track this btw?). I will leave the final approval to Edwin.
/lgtm
|
I took a look at the code, nothing major sticks out. Will approve once the last comment is addressed. /lgtm |
|
|
||
| log.Info("Reconciling DisaggregatedSet", "name", disaggregatedSet.Name, "namespace", disaggregatedSet.Namespace) | ||
|
|
||
| // Validate phases are configured (API validation ensures at least 2) |
There was a problem hiding this comment.
Can we move all validation to the disaggregatedset_webhook?
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it
This PR implements KEP-766 by introducing the DisaggregatedSet controller, a Kubernetes operator for managing disaggregated inference deployments. Disaggregated serving separates the prefill and decode phases
of LLM inference onto different infrastructure, and this controller orchestrates multiple LeaderWorkerSets with coordinated lifecycle management.
Key features:
Which issue(s) this PR fixes
Fixes #766
Special notes for your reviewer
The two-dimensional rollout algorithm uses linear interpolation to maintain the prefill/decode ratio throughout updates. A plan-steps CLI tool is included to visualize rollout plans:
Does this PR introduce a user-facing change?
Introduce
DisaggregatedSetAPI for managing disaggregated LLM inference deployments with coordinated prefill/decode rolling updates.Current TODO: