Skip to content

Commit 5eb8a46

Browse files
committed
KEP-766: Add Alternative 4 comparing partition-based approach
Address reviewer feedback explaining why we use multiple LWS per revision instead of the LWS partition field: - Revision-aware traffic routing for LLM-d Endpoint Picker - LWS as read-only resource (like Deployment/ReplicaSet pattern) - Simpler ops observability during rolling updates
1 parent 95941ed commit 5eb8a46

File tree

1 file changed

+29
-0
lines changed

1 file changed

+29
-0
lines changed

keps/766-DisaggDeployment/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ workload primitive.
3333
- [Alternative 1: Extend LeaderWorkerSet with Multi-Template Support](#alternative-1-extend-leaderworkerset-with-multi-template-support)
3434
- [Alternative 2: Helm Chart or Kustomize Overlay](#alternative-2-helm-chart-or-kustomize-overlay)
3535
- [Alternative 3: External Controller Without CRD](#alternative-3-external-controller-without-crd)
36+
- [Alternative 4: Use LWS Partition Field Instead of Multiple LWS per Revision](#alternative-4-use-lws-partition-field-instead-of-multiple-lws-per-revision)
3637
<!-- /toc -->
3738

3839
## Summary
@@ -387,3 +388,31 @@ Build an external controller that watches LeaderWorkerSets with specific labels
387388
- Poor user experience (no single resource to manage)
388389
- Harder to discover and use
389390
- State management would be complex without a CRD
391+
392+
### Alternative 4: Use LWS Partition Field Instead of Multiple LWS per Revision
393+
394+
Instead of creating separate LeaderWorkerSets per revision (resulting in up to 4 LWS during updates: old-prefill, old-decode, new-prefill, new-decode), use the LWS `partition` field to perform in-place updates within a single LWS per side.
395+
396+
**How it would work**:
397+
- DisaggDeployment creates exactly 2 LWS: `{name}-prefill` and `{name}-decode`
398+
- Rolling updates manipulate the `partition` field on both LWS to progressively update groups
399+
- Groups with ordinal `>= partition` get the new template; groups `< partition` remain on old
400+
401+
**Why we chose multiple LWS per revision instead**:
402+
403+
1. **Revision-aware traffic routing**: DisaggDeployment is designed for disaggregated inference, where a load balancer (e.g., a modified LLM-d Endpoint Picker) must route prefill requests to backends whose decode counterparts are on the **same revision**. With separate LWS (and Service) per revision, each pod's revision is explicit via labels (`disaggdeployment.x-k8s.io/revision`). The load balancer can count backends per revision across both pools and distribute traffic proportionally. With partition-based updates, pods within the same LWS have different templates based on ordinal, making revision-aware routing significantly more complex. The goal is to avoid having a prefill talk to an incompatible decode—both sides must be treated as totally incompatible. This special routing requires work on the LLM-d side.
404+
405+
2. **LWS as a read-only resource**: Treating LWS as a read-only resource (similar to how Deployment treats ReplicaSet) makes more sense for this use case. During a coordinated rollout, you want to update prefill and decode at different paces depending on the step you are at—it's a tied update across two dimensions. This level of control is difficult to achieve with partition, which operates on a single LWS independently.
406+
407+
3. **Ops observability**: Separate LWS per revision is simpler for ops observability. You can see directly at which stage your update is, since you can see the version right away during updates (e.g., "old-prefill: 2 replicas, new-prefill: 3 replicas") rather than inspecting partition boundaries within a single LWS.
408+
409+
**Trade-offs acknowledged**:
410+
- **Resource overhead**: Up to 4 LWS exist during updates vs. 2. However, LWS is a lightweight coordination resource; the actual pod count remains the same.
411+
- **Complexity**: The two-dimensional rolling update algorithm is more complex than coordinating two partition values. However, this complexity is encapsulated in the DisaggDeployment controller.
412+
413+
**Potential LWS improvements that could enable partition-based approach**:
414+
- Pod-level revision labels (independent of LWS name) would help with traffic routing
415+
- Revision-aware service selectors at the LWS level
416+
- See also: [#710](https://github.com/kubernetes-sigs/lws/issues/710) for related discussion on revision tracking
417+
418+
**Conclusion**: The multiple-LWS-per-revision approach was chosen primarily to enable revision-aware traffic routing, which is critical for disaggregated inference correctness. We are open to revisiting this if LWS gains features that make partition-based coordination viable for this use case.

0 commit comments

Comments
 (0)