Skip to content

Commit 34abd36

Browse files
committed
Update PodReplacementPolicy for Deployments KEP
- add a Deployment Scaling Changes and a New Annotation for ReplicaSets section
1 parent e9ce75a commit 34abd36

File tree

2 files changed

+139
-3
lines changed

2 files changed

+139
-3
lines changed

keps/sig-apps/3973-consider-terminating-pods-deployment/README.md

Lines changed: 137 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,13 @@
1414
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1515
- [Consideration for Other Controllers](#consideration-for-other-controllers)
1616
- [Risks and Mitigations](#risks-and-mitigations)
17+
- [Feature Impact](#feature-impact)
18+
- [kubectl Skew](#kubectl-skew)
1719
- [Design Details](#design-details)
1820
- [Deployment Behavior Changes](#deployment-behavior-changes)
1921
- [ReplicaSet Status and Deployment Status Changes](#replicaset-status-and-deployment-status-changes)
22+
- [Deployment Scaling Changes and a New Annotation for ReplicaSets](#deployment-scaling-changes-and-a-new-annotation-for-replicasets)
23+
- [kubectl Changes](#kubectl-changes)
2024
- [API](#api)
2125
- [Test Plan](#test-plan)
2226
- [Prerequisite testing updates](#prerequisite-testing-updates)
@@ -210,12 +214,30 @@ This feature is already implemented for Jobs ([KEP-3939](https://github.com/kube
210214

211215
### Risks and Mitigations
212216

217+
#### Feature Impact
218+
213219
Deployment rollouts might be slower when using the `TerminationComplete` PodReplacementPolicy.
214220

215221
Deployment rollouts might consume excessive resources when using the `TerminationStarted` PodReplacementPolicy.
216222

217223
This is mitigated by making this feature opt-in.
218224

225+
#### kubectl Skew
226+
The `deployment.kubernetes.io/replicaset-replicas-before-scale` annotation should be removed during
227+
deployment rollback when annotations are copied from the ReplicaSet to the Deployment. Support for
228+
this removal will be added to kubectl in the same release as this feature. Therefore, rollback using
229+
an older kubectl will not be supported until one minor release after the feature first reaches
230+
alpha. The documentation for Deployments will include a notice about this.
231+
232+
If an older kubectl version is used, the impact should be minimal. The deployment may end up with an
233+
unnecessary `deployment.kubernetes.io/replicaset-replicas-before-scale` annotation. The deployment
234+
controller then synchronizes Deployment annotations back to the ReplicaSet. This is done by the
235+
Deployment controller, which will ignore this new annotations if the feature gate is on.
236+
237+
The bug should be mainly visual (extra annotation in the Deployment), unless the feature is turned
238+
on and off in a succession. In this case, incorrect annotations could end up on a ReplicaSet, which
239+
would affect the scaling proportions during a rollout.
240+
219241
## Design Details
220242

221243
### Deployment Behavior Changes
@@ -242,7 +264,8 @@ Scaling logic:
242264
ReplicaSets as well and not spawn replicas that would be higher than Deployment's
243265
`.spec.replicas + .spec.strategy.rollingUpdate.maxSurge`. This will be implemented by
244266
checking ReplicaSet's `.spec.replicas`, `.status.replicas` and `.status.terminatingReplicas`
245-
to determine the number of pods.
267+
to determine the number of pods. See [Deployment Scaling Changes and a New Annotation for ReplicaSets](#deployment-scaling-changes-and-a-new-annotation-for-replicasets)
268+
for more details.
246269
- Changing scaling down logic is not necessary, and we can scale down as many pods as we want
247270
because the policy does not affect this since we are not replacing the pods.
248271

@@ -256,6 +279,101 @@ To satisfy the requirement for tracking terminating pods, and for implementation
256279
we propose a new field `.status.terminatingReplicas` to the ReplicaSet's and Deployment's
257280
status.
258281

282+
### Deployment Scaling Changes and a New Annotation for ReplicaSets
283+
284+
Currently, scaling is done proportionally over all ReplicaSets to mitigate the risk of losing
285+
availability during a rolling update.
286+
287+
To calculate the new ReplicaSet size, we need to know
288+
- `replicasBeforeScale`: The `.spec.replicas` of the ReplicaSet before the scaling began.
289+
- `deploymentMaxReplicas`: Equals to `.spec.replicas + .spec.strategy.rollingUpdate.maxSurge` of
290+
the current Deployment.
291+
- `deploymentMaxReplicasBeforeScale`: Equals to
292+
`.spec.replicas + .spec.strategy.rollingUpdate.maxSurge` of the old Deployment. This information
293+
is stored in the `deployment.kubernetes.io/max-replicas` annotation in each ReplicaSet.
294+
295+
Then we can calculate a new size for each ReplicaSet proportionally as follows:
296+
297+
$$
298+
newReplicaSetReplicas = replicasBeforeScale * \frac{deploymentMaxReplicas}{deploymentMaxReplicasBeforeScale}
299+
$$
300+
301+
This is currently done in the [getReplicaSetFraction](https://github.com/kubernetes/kubernetes/blob/1cfaa95cab0f69ecc62ad9923eec2ba15f01fc2a/pkg/controller/deployment/util/deployment_util.go#L492-L512)
302+
function. The leftover pods are added to the newest ReplicaSet.
303+
304+
This results in the following scaling behavior.
305+
306+
The first scale operation occurs at T2 and the second scale at T3.
307+
308+
| Time | Terminating Pods | RS1 Replicas | RS2 Replicas | RS3 Replicas | All RS Total | Deployment .spec.replicas | Deployment .spec.replicas + MaxSurge | Scale ratio |
309+
|------|------------------|--------------|--------------|--------------|--------------|---------------------------|--------------------------------------|-------------|
310+
| T1 | any amount | 50 | 30 | 20 | 100 | 100 | 110 | - |
311+
| T2 | any amount | 61 | 35 | 24 | 120 | 120 | 130 | 1.182 |
312+
| T3 | any amount | 66 | 38 | 26 | 130 | 130 | 140 | 1.077 |
313+
314+
With the `TerminationComplete` PodReplacementPolicy, scaling cannot proceed immediately if there
315+
are terminating pods present, in order to adhere to the Deployment constraints. We need to scale
316+
some ReplicaSets fully and some partially. And we have to postpone scaling to the future when
317+
terminating pods disappear.
318+
319+
A single scale operation occurs at T2.
320+
321+
| Time | Terminating Pods | RS1 Replicas | RS2 Replicas | RS3 Replicas | All RS Total | Deployment .spec.replicas | Deployment .spec.replicas + MaxSurge | Scale ratio |
322+
|------|------------------|--------------|--------------|--------------|--------------|---------------------------|--------------------------------------|-------------|
323+
| T1 | 15 | 50 | 30 | 20 | 100 | 100 | 110 | - |
324+
| T2 | 15 | 61 | 34 | 20 | 115 | 120 | 130 | 1.182 |
325+
| T3 | 5 | 61 | 35 | 24 | 120 | 120 | 130 | - |
326+
| T4 | 0 | 61 | 35 | 24 | 120 | 120 | 130 | - |
327+
328+
To proceed with the scaling in the future (T3), we need to remember both `replicasBeforeScale` and
329+
`deploymentMaxReplicasBeforeScale` to calculate the original scale ratio. The terminating pods can
330+
take a long time to terminate and there can be many steps and ReplicaSet updates between T2 and T3.
331+
If we were to use the current number of ReplicaSet or Deployment replicas in any of these steps
332+
(including T3), we would calculate an incorrect scale ratio.
333+
334+
- `deploymentMaxReplicasBeforeScale` is already stored in the `deployment.kubernetes.io/max-replicas`
335+
ReplicaSet annotation. The main change is that we need to keep the old Deployment max replicas
336+
value in the annotation until the partial scale for a ReplicaSet is complete.
337+
- To remember `replicasBeforeScale`, we will introduce a new annotation called
338+
`deployment.kubernetes.io/replicaset-replicas-before-scale`, which will be added to the
339+
Deployment's ReplicaSets that are being partially scaled. This annotation will be removed once
340+
the partial scaling is complete. This annotation will be added and managed by the deployment
341+
controller.
342+
343+
These two ReplicaSet annotation will be used to calculate the original scale ratio for the partial
344+
scaling.
345+
346+
The following example shows a first scale at T2 and a second scale at T3.
347+
348+
| Time | Terminating Pods | RS1 Replicas | RS2 Replicas | RS3 Replicas | All RS Total | Deployment .spec.replicas | Deployment .spec.replicas + MaxSurge | Scale ratio |
349+
|------|------------------|--------------|--------------|--------------|--------------|---------------------------|--------------------------------------|-----------------------|
350+
| T1 | 15 | 50 | 30 | 20 | 100 | 100 | 110 | - |
351+
| T2 | 15 | 61 | 34 | 20 | 115 | 120 | 130 | 1.182 |
352+
| T3 | 15 | 66 | 38 | 21 | 125 | 130 | 140 | 1.077 (1.273 from T1) |
353+
| T4 | 5 | 67 | 38 | 25 | 130 | 130 | 140 | - |
354+
| T5 | 0 | 67 | 38 | 25 | 130 | 130 | 140 | - |
355+
356+
- At T2, a ful scale was done for RS1 with a ratio of 1.182. RS1 can then use the new scale ratio
357+
at T3 with a value of 1.077.
358+
- RS2 has been partially scaled (1.182 ratio) and RS3 has not been scaled at all at T2 due to the
359+
terminating pods. When a new scale occurs at T3, RS2 and RS3 have not yet completed the first
360+
scale. So their annotations still point to the T1 state. A new ratio of 1.273 is calculated and
361+
used for the second scale.
362+
363+
As we can see, we will get a slightly different result when compared to the first table. This is
364+
due to the consecutive scales and the fact that the last scale is not yet fully completed.
365+
366+
The consecutive partial scaling behavior is a best effort. We still adhere to all deployment
367+
constraints and have a bias toward scaling the newest ReplicaSet. To implement this properly we
368+
would have to introduce a full scaling history, which is probably not worth the added complexity.
369+
370+
### kubectl Changes
371+
372+
Similar to `deployment.kubernetes.io/max-replicas`, we have to remove
373+
`deployment.kubernetes.io/replicaset-replicas-before-scale` annotations from [annotationsToSkip](https://github.com/kubernetes/kubernetes/blob/9e2075b3c87061d25759b0ad112266f03601afd8/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollback.go#L184)
374+
to support rollbacks.
375+
See [kubectl Skew](#kubectl-skew) for more details.
376+
259377
### API
260378

261379
```golang
@@ -493,6 +611,10 @@ test/e2e/storage/utils/deployment.go
493611
- Test that a Deployment with `RollingUpdate` strategy and a `TerminationComplete` PodReplacementPolicy does not
494612
exceed the amount of pods specified by `spec.replicas + .spec.strategy.rollingUpdate.maxSurge` when
495613
rolling out new revisions and/or scaling the deployment at any point in time.
614+
- Test scaling of Deployments that are in the middle of a rollout (even with more than 2 revisions).
615+
Verify that scaling is done proportionally across all ReplicaSets when terminating pods are
616+
present. Scale these deployments in a succession, even when the previous scale has not yet
617+
completed.
496618

497619
<!--
498620
- <test>: <link to test coverage>
@@ -504,6 +626,7 @@ test/e2e/storage/utils/deployment.go
504626

505627
- Feature gate disabled by default.
506628
- Unit, enablement/disablement, e2e, and integration tests implemented and passing.
629+
- Document [kubectl Skew](#kubectl-skew) for alpha.
507630

508631

509632
#### Beta
@@ -512,6 +635,7 @@ test/e2e/storage/utils/deployment.go
512635
- `.spec.podReplacementPolicy` is nil by default and preserves the original behavior.
513636
- E2e and integration tests are in Testgrid and linked in the KEP.
514637
- add new metrics to `kube-state-metrics`
638+
- Remove documentation for [kubectl Skew](#kubectl-skew) that was introduced in alpha.
515639

516640

517641
#### GA
@@ -535,6 +659,7 @@ If the feature is not enabled on the apiserver, and it is enabled in the kube-co
535659
- The feature cannot be used for new workloads.
536660
- Workloads that have the `.spec.podReplacementPolicy` field set will use the new behavior.
537661

662+
Also, as mentioned in [kubectl Skew](#kubectl-skew), kubectl skew is not supported in the alpha version.
538663

539664
## Production Readiness Review Questionnaire
540665

@@ -585,11 +710,21 @@ By disabling the feature:
585710
- Actors reading `.status.TerminatingReplicas` for ReplicaSet and Deployments will see the field to
586711
be omitted (observe 0 pods), once the status is reconciled by the controllers.
587712

713+
As mentioned in [kubectl Skew](#kubectl-skew), kubectl skew is not supported in alpha. If an older
714+
unsupported version of kubectl was used, it is important to remove the
715+
`deployment.kubernetes.io/replicaset-replicas-before-scale` annotation from all Deployments and
716+
ReplicaSets after disabling this feature. This should prevent any unexpected behavior on the next
717+
enablement.
718+
588719
###### What happens if we reenable the feature if it was previously rolled back?
589720

590721
The ReplicaSet and Deployment controllers will start reconciling the `.status.TerminatingReplicas`
591722
and behave according to the `.spec.podReplacementPolicy`.
592723

724+
Similar to the section above, it is important to make sure that the
725+
`deployment.kubernetes.io/replicaset-replicas-before-scale` annotation is removed from all
726+
Deployments and ReplicaSets before the re-enablement.
727+
593728
###### Are there any tests for feature enablement/disablement?
594729

595730
<!--
@@ -793,6 +928,7 @@ deployment and replicaset controllers.
793928

794929
- 2023-05-01: First version of the KEP opened (https://github.com/kubernetes/enhancements/pull/3974).
795930
- 2023-12-12: Second version of the KEP opened (https://github.com/kubernetes/enhancements/pull/4357).
931+
- 2024-29-05: Added a Deployment Scaling Changes and a New Annotation for ReplicaSets section (https://github.com/kubernetes/enhancements/pull/4670).
796932

797933
## Drawbacks
798934

keps/sig-apps/3973-consider-terminating-pods-deployment/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ stage: alpha
2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.30"
25+
latest-milestone: "v1.31"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:
29-
alpha: "v1.30"
29+
alpha: "v1.31"
3030
beta: ""
3131
stable: ""
3232

0 commit comments

Comments
 (0)