Skip to content

KEP-766: DisaggregatedSet implementation#773

Open
hasB4K wants to merge 17 commits intokubernetes-sigs:mainfrom
hasB4K:mistral-ai/disaggregatedset
Open

KEP-766: DisaggregatedSet implementation#773
hasB4K wants to merge 17 commits intokubernetes-sigs:mainfrom
hasB4K:mistral-ai/disaggregatedset

Conversation

@hasB4K
Copy link

@hasB4K hasB4K commented Mar 9, 2026

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it

This PR implements KEP-766 by introducing the DisaggregatedSet controller, a Kubernetes operator for managing disaggregated inference deployments. Disaggregated serving separates the prefill and decode phases
of LLM inference onto different infrastructure, and this controller orchestrates multiple LeaderWorkerSets with coordinated lifecycle management.

Key features:

  • Unified Management: Manage prefill and decode LeaderWorkerSets as a single resource
  • Two-Dimensional Rolling Updates: Linear interpolation algorithm updates both sides in lockstep while preserving the prefill-to-decode ratio
  • Service Orchestration: Automatically create Services when both sides are ready
  • Stateless Operator: Safe to restart at any point during operations

Which issue(s) this PR fixes

Fixes #766

Special notes for your reviewer

The two-dimensional rollout algorithm uses linear interpolation to maintain the prefill/decode ratio throughout updates. A plan-steps CLI tool is included to visualize rollout plans:

$ go run ./cmd/plan-steps --source '{"prefill": 10, "decode": 2}' --target '{"prefill": 6, "decode": 8}' --surge '{"prefill": 2, "decode": 2}'
Phases: [decode prefill]
Source: decode=2, prefill=10
Target: decode=8, prefill=6
Config: decode(surge=2, unavail=0), prefill(surge=2, unavail=0)

┌──────┬────────────┬─────────────┬────────────┬─────────────┬───────┬───────────────────────────────┐
│ STEP │ OLD DECODE │ OLD PREFILL │ NEW DECODE │ NEW PREFILL │ TOTAL │            ACTION             │
├──────┼────────────┼─────────────┼────────────┼─────────────┼───────┼───────────────────────────────┤
│ 0    │ 2          │ 10          │ 0          │ 0           │ 12    │ initial                       │
│ 1    │ 2          │ 10          │ 2          │ 2           │ 16    │ new decode +2, new prefill +2 │
│ 2    │ 2          │ 8           │ 2          │ 2           │ 14    │ old prefill -2                │
│ 3    │ 2          │ 8           │ 4          │ 3           │ 17    │ new decode +2, new prefill +1 │
│ 4    │ 2          │ 8           │ 5          │ 4           │ 19    │ new decode +1, new prefill +1 │
│ 5    │ 2          │ 6           │ 5          │ 4           │ 17    │ old prefill -2                │
│ 6    │ 2          │ 6           │ 7          │ 5           │ 20    │ new decode +2, new prefill +1 │
│ 7    │ 2          │ 6           │ 8          │ 6           │ 22    │ new decode +1, new prefill +1 │
│ 8    │ 0          │ 0           │ 8          │ 6           │ 14    │ old decode -2, old prefill -6 │
└──────┴────────────┴─────────────┴────────────┴─────────────┴───────┴───────────────────────────────┘

Does this PR introduce a user-facing change?

Introduce DisaggregatedSet API for managing disaggregated LLM inference deployments with coordinated prefill/decode rolling updates.

Current TODO:

  • List based instead of map based phases in the YAML config
  • (that was optional, but realized it was easy with the previous point done): support multiple phases (ie. more than 2) to be able to support encode phase (among other)
  • Remove the Service in the YAML API. We now have a Headless Service that exposes a EndpointSlice (for llm-d EndpointPicker in the future)
  • Massive refactor of the e2e tests.
  • Propagate the annotations to the LWS level to support Kueue topology
  • See if I can use LeaderWorkerSetSpec in the YAML instead of LeaderWorkerTemplate as I do now.
  • Add the CI
  • Update the KEP and the doc.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Mar 9, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hasB4K
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 9, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @hasB4K. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 9, 2026
@netlify
Copy link

netlify bot commented Mar 9, 2026

Deploy Preview for kubernetes-sigs-lws canceled.

Name Link
🔨 Latest commit e1be1ea
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/69bb47ed8969d20009384365

@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch from df7d25d to 7620d60 Compare March 9, 2026 12:24
@yankay
Copy link
Member

yankay commented Mar 9, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 9, 2026
@hasB4K hasB4K changed the title KEP-766: Add DisaggDeployment proposal KEP-766: Add DisaggregatedSet proposal Mar 9, 2026
@hasB4K hasB4K changed the title KEP-766: Add DisaggregatedSet proposal KEP-766: DisaggregatedSet implementation Mar 9, 2026
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the api

// +kubebuilder:validation:Minimum=0
// +kubebuilder:default=1
// +optional
Replicas *int32 `json:"replicas,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sure you thought you about this, but what is the reason for not using a LeaderWorkerSetSpec instead of explicitly listing the individual parameters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g, @Edwinhr716 :

I remember why I took the decision of not using LeaderWorkerSetSpec.
DisaggregatedPhaseSpec does not include rolloutStrategy from LeaderWorkerSetSpec because the DisaggregatedSet operator handles rolling updates itself:

Rollout type: We don't want users to pick a rollout strategy (RollingUpdate, OnDelete, etc.). The operator has its own logic to roll all phases together.
Partition: LWS uses partition to roll pods one by one, but this only works inside a single LWS. We can't sync partition across prefill and decode, so pods would end up on different versions. (this I didn't know)

Instead, the operator creates new LWS resources for each revision and scales them up/down together. Services only point to revisions where all phases are ready.
I think the API should be opinionated, and this is typically one of this case when I don't want to expose types (like RollouType or partition) that doesn't makes sense in the DisaggregatedSet operator

Copy link
Contributor

@ahg-g ahg-g Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one option: we can use the LeaderWorkerSetSpec and validate that only the supported fields are set. The advantage is that this way the apis are always in sync (we don't need to copy every field), but at the same time we control which fields the new api should support. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay fair. I can try to do that, I'll let you know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done!

@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch from 7c66541 to ddb0070 Compare March 16, 2026 14:46
@hasB4K
Copy link
Author

hasB4K commented Mar 16, 2026

Here what has been achieved since our last sync:

  • List based instead of map based phases in the YAML config
  • (that was optional, but realized it was easy with the previous point done): support multiple phases (ie. more than 2) to be able to support encode phase (among other)
  • Remove the Service in the YAML API. We now have a Headless Service that exposes a EndpointSlice (for llm-d EndpointPicker in the future)
  • Massive refactor of the e2e tests.
  • Patched a bug in the planner.

Here what i still need to do:

  • Propagate the annotations to the LWS level to support Kueue topology
  • See if I can use LeaderWorkerSetSpec in the YAML instead of LeaderWorkerTemplate as I do now.
  • Add the CI
  • Update the KEP and the doc.

(I added a global TODO in the head message)

cc @ahg-g, @Edwinhr716

@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch from ddb0070 to 358e48c Compare March 16, 2026 14:58
hasB4K added 8 commits March 16, 2026 16:25
This adds the reference implementation for KEP-766 DisaggregatedSet.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…Service

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…ss portless Services

This change simplifies the DisaggregatedSet API by removing the user-facing
ServiceTemplate field and automatically creating headless portless Services
for pod discovery via EndpointSlices.

API Changes:
- Remove ServiceTemplate from DisaggregatedSet spec
- Services are now created automatically without user configuration

Service Behavior:
- Headless Services (clusterIP: None) are created for each side (prefill/decode)
- Services are portless to enable EndpointSlice-based discovery
- Services are only created when both sides have ReadyReplicas >= 1
- Old services are cleaned up only after workloads are fully drained

Testing:
- Add e2e test verifying automatic Service and EndpointSlice creation
- Add Kind cluster cleanup in AfterSuite to prevent image caching issues
- Fix .dockerignore to use explicit exclusions

Also fixes typo in Makefile helm target path.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…ut codebase

This is a breaking change that standardizes terminology:
- Label: disaggregatedset.x-k8s.io/side → disaggregatedset.x-k8s.io/phase
- Type: DisaggSideConfig → DisaggregatedPhaseSpec
- Constants: SidePrefill/SideDecode → PhasePrefill/PhaseDecode
- Struct: SideReplicaState → PhaseReplicaState
- Map field: Sides → Phases

YAML field names (spec.prefill, spec.decode) remain unchanged.

Updated files:
- API types and CRD manifests
- All controller code and tests
- E2E tests with updated label selectors
- CLI tool (plan-steps)
- README documentation
- Helm chart CRD

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
… flexible phases array

This commit introduces a flexible N-phase architecture for DisaggregatedSet,
replacing the hardcoded prefill/decode fields with a dynamic phases array.

API Changes:
- Replace Prefill/Decode pointer fields with Phases []DisaggregatedPhaseSpec
- Add CEL validation requiring >= 2 phases with unique names
- Update PhaseReplicaState to use dynamic slices instead of fixed arrays

Controller Changes:
- Refactor executor to use dynamic slices for N-phase support
- Update planner to work with arbitrary phase counts
- Update service manager for N-phase workloads

CLI Changes (plan-steps):
- Accept JSON maps for phase config: --source '{"prefill": 6, "decode": 2}'
- Dynamically generate table headers from phase names
- Fully N-phase ready (planner still has 2-phase limitation)

Testing:
- Add 3-phase rolling update e2e test
- Add phase rename e2e test
- Update all existing tests for new API

This is a breaking API change for v1alpha1.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
… for phase changes

Introduces a phasePolicy field (Strict/Flexible) to control how DisaggregatedSet
handles phase additions, removals, and renames during updates.

API Changes:
- Add PhasePolicy type with Strict (default) and Flexible values
- Strict: rejects adding, removing, or renaming phases
- Flexible: allows phase changes with progressive rollout

Controller Changes:
- Detect phase changes by comparing spec vs old workload phases
- Pass removed phases to Planner with target=0 for progressive drain
- Compute allPhaseNames as union of spec phases + old workload phases
- Clean up ALL phases belonging to drained revisions, not just spec phases
- Refactor executor.go into logical sections with focused helper functions
- Move verbose investigation logs to V(1) debug level

Testing:
- Add unit tests for phasePolicy enforcement (Strict blocks, Flexible allows)
- Update e2e tests for progressive drain behavior
- Add 3-phase sample manifests

File reduced from 658 lines to 507 lines while maintaining all functionality.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…r and fixtures packages

Introduces reusable test utilities to simplify e2e test code and improve maintainability.

New Packages:
- test/utils/kubectl: Fluent builder API for kubectl commands
  - kubectl.go: Chainable methods (Get, Delete, Apply, Label, JSONPath, etc.)
  - queries.go: Higher-level helpers (LWSByPhase, CountPods, GetTotalReplicas)
  - waiters.go: Eventually-based wait helpers (ForPodCount, ForRevisionDrained)
- test/utils/fixtures: YAML builders for DisaggregatedSet manifests
  - fixtures.Phase and fixtures.Config for flexible YAML generation
  - fixtures.PrefillDecode helper for common 2-phase configs

Test Refactoring:
- Replace inline helpers with kubectl package imports
- Use fluent builder for all kubectl operations
- Consolidate YAML builders into fixtures package
- Add TrackProgressiveRollout helper for rollout tests

Code Reduction:
- executor.go: 507 → 456 lines
- e2e_test.go: 1669 → 1004 lines (~40% reduction)

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
… and maxUnavailable floor

Port fixes from commit 7c66541 to the N-phase planner:

1. Source-aware surge baseline:
   - Change surge constraint from old + new <= target + surge to
     old + new <= max(source, target) + surge
   - Ensures scale-down scenarios don't block scale-up since the system
     already runs source replicas

2. maxUnavailable floor enforcement:
   - Add minOld constraint: old + new >= target - maxUnavailable
   - Only enforce when source >= target (scale-down scenario)
   - Applied to both proportional drain and fallback drain paths

Test updates:
- Update expected sequences for scale-down and mixed-scale scenarios
- Add asymmetric_5_3_surge2 test case

All tests pass with 90.2% coverage.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch from 358e48c to 233b6ad Compare March 16, 2026 15:25
@hasB4K hasB4K requested review from Edwinhr716 and ahg-g March 16, 2026 19:03
// --surge '{"prefill": 2, "decode": 2}'
//
// This helps understand what will happen during a specific rollout in advance.
// Supports arbitrary phase names and will support N phases when the planner does.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a tool that one runs locally, not a controller?

Copy link
Author

@hasB4K hasB4K Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a tool that one runs locally, not a controller?

cmd/plan-steps/main.go is just a util script that I used for debug. it can also be used to know what will happen before one does a deployment. BUT the rest of the code in disaggregatedset/internal/ is the controller/operator. 😉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to hack/plan-steps/ - I think it's clearer that way.

hasB4K added 5 commits March 17, 2026 10:52
The existing test only verified that labels and annotations were set on the
LWS workerTemplate. This extends the test to also verify that they are
actually propagated to the running pods by LWS.

Adds PodsByPhase helper to kubectl queries for querying pods by phase.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…gy support

Adds a metadata field to DisaggregatedPhaseSpec that allows users to set
labels and annotations on the LWS CR's ObjectMeta. This enables:
- Kueue queue assignment (kueue.x-k8s.io/queue-name label)
- LWS exclusive-topology scheduling

User-provided labels/annotations are merged onto the LWS CR, with system
labels taking precedence over user labels.

Includes e2e test verifying metadata propagation to LWS CR.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
…est-e2e targets for prow CI

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
BREAKING CHANGE: The phasePolicy field has been removed from
DisaggregatedSetSpec. The controller now always uses Flexible behavior,
allowing phase additions, removals, and renames during rollouts.

- Remove PhasePolicy type, constants, and spec field
- Remove getPhasePolicy() and rejectPhaseChanges() controller logic
- Remove all PhasePolicy unit tests
- Update fixtures and sample YAMLs
- Remove PhasePolicy from e2e tests

Phase changes are now always allowed (Flexible behavior is the default).

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
- Add KIND ?= kind default in disaggregatedset Makefile
- Pass KIND=$(KIND) from parent Makefile to disaggregatedset test-e2e

This fixes the "Kind is not installed" error when running
disaggregatedset e2e tests from the parent LWS Makefile.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch 2 times, most recently from 7bc4172 to 394f4b8 Compare March 17, 2026 10:56
Add a dedicated e2e test script that mirrors the LWS pattern for running
e2e tests in Prow CI. The script handles the full e2e lifecycle:

- Kind cluster creation/deletion with cleanup trap
- Docker image building and loading to Kind
- LWS controller installation from release manifests
- DisaggregatedSet operator deployment via kustomize
- Running ginkgo tests with junit output
- Log collection on cleanup for debugging

Update Makefile to call the hack script with all required environment
variables (KIND, KUBECTL, KUSTOMIZE, GINKGO, etc.) and add ginkgo tool.

Update e2e tests to skip redundant operations when run via hack script
(detected via LWS_INSTALL_SKIP=true), allowing tests to work both
standalone and in Prow CI.
@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch from 394f4b8 to 89cc3e9 Compare March 17, 2026 11:13
@hasB4K hasB4K requested a review from ahg-g March 17, 2026 13:18
hasB4K added 3 commits March 19, 2026 01:47
…eNextStep

- Change RollingUpdateConfig from slice fields to per-phase structs
  ([]RollingUpdateConfig instead of RollingUpdateConfig with []int fields)
- Simplify extractRollingUpdateConfig by removing unused specPhaseSet param
- Split ComputeNextStep into focused helpers: isComplete, isNewAtTarget,
  canScaleUp, computeMinOld, tryScaleUp, tryProportionalDrain, tryForceDrain
- Remove config() test helper in favor of direct struct literals
@hasB4K hasB4K force-pushed the mistral-ai/disaggregatedset branch from 5a2367a to e1be1ea Compare March 19, 2026 00:48
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks good to me. I will review the code when it is merged with the lws binary (Can we create an issue to track this btw?). I will leave the final approval to Edwin.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2026
@Edwinhr716
Copy link
Contributor

I took a look at the code, nothing major sticks out. Will approve once the last comment is addressed.

/lgtm


log.Info("Reconciling DisaggregatedSet", "name", disaggregatedSet.Name, "namespace", disaggregatedSet.Namespace)

// Validate phases are configured (API validation ensures at least 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all validation to the disaggregatedset_webhook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DisaggDeployment CRD for coordinated prefill/decode workload management

5 participants