Skip to content

Commit 922efbc

Browse files
authored
Merge pull request kubernetes#2688 from krmayankk/maxun
Add PRR information for maxUnavailable
2 parents a9c9f99 + 11b2f1f commit 922efbc

File tree

3 files changed

+152
-19
lines changed

3 files changed

+152
-19
lines changed

keps/prod-readiness/sig-apps/961.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 961
2+
alpha:
3+
approver: "@wojtek-t"

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

Lines changed: 141 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,19 @@
1212
- [Story 1](#story-1)
1313
- [Implementation Details](#implementation-details)
1414
- [API Changes](#api-changes)
15-
- [Recommended Choice](#recommended-choice)
1615
- [Implementation](#implementation)
1716
- [Risks and Mitigations](#risks-and-mitigations)
1817
- [Upgrades/Downgrades](#upgradesdowngrades)
1918
- [Tests](#tests)
19+
- [Test Plan](#test-plan)
2020
- [Graduation Criteria](#graduation-criteria)
21+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
22+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
23+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
24+
- [Monitoring Requirements](#monitoring-requirements)
25+
- [Dependencies](#dependencies)
26+
- [Scalability](#scalability)
27+
- [Troubleshooting](#troubleshooting)
2128
- [Implementation History](#implementation-history)
2229
- [Drawbacks](#drawbacks)
2330
- [Alternatives](#alternatives)
@@ -150,15 +157,13 @@ complicated.
150157
Choice 4 provides a choice to the users and hence takes the guessing out of the picture on what they
151158
will expect. Implementing Choice 4 using PMP would be the easiest.
152159

153-
##### Recommended Choice
154-
155-
I recommend Choice 4, using PMP=Parallel for the first Alpha Phase. This would give the users fast
156-
rollouts without having them to second guess what the behavior should be. This choice also allows for
157-
easily extending the behavior with PMP=OrderedReady in future to choose either behavior 1 or 3.
158-
159160
#### Implementation
160161

161-
TBD: Will be updated after we have agreed on the semantics being discussed above.
162+
The alpha release we are going with Choice 4 with support for both PMP=Parallel and PMP=OrderedReady.
163+
For PMP=Parallel, we will use Choice 2
164+
For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also
165+
making sure the rolling updates are fast.
166+
162167

163168
https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset/stateful_set_control.go#L504
164169
```go
@@ -234,13 +239,17 @@ tried this feature in Alpha, we would have time to fix issues.
234239

235240
### Upgrades/Downgrades
236241

237-
- Upgrades
238-
When upgrading from a release without this feature, to a release with maxUnavailable, we will set maxUnavailable to 1. This would give users the same default
239-
behavior they have to come to expect of in previous releases
240-
- Downgrades
241-
When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases
242-
-- if maxUnavailable is greater than 1 -- in this case user can see unexpected behavior(Find out what is the recommendation here(Warning, disable upgrade, drop field, etc? )
243-
-- if maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior
242+
We will default to 1 for maxUnavailable field in StatefulSet for backward compatibility
243+
244+
Downgrades
245+
246+
When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases
247+
- If maxUnavailable is greater than 1, there are two more cases:-
248+
- If you're rolling back to a release that doesn't have this field - then there is even no way to discover it
249+
- If you're just disabling the feature (either together with downgrade to a release that has a field or without downgrade),the field should remain set
250+
(unless someone will explicitly delete it later), but controller should ignore its behavior (and there shouldn't be a way to set it if the feature gate
251+
is switched off).
252+
- If maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior
244253

245254
### Tests
246255

@@ -254,11 +263,126 @@ tried this feature in Alpha, we would have time to fix issues.
254263
- maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable
255264
- maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas
256265

266+
## Test Plan
267+
For `Alpha`, unit tests and e2e tests will be added to test functionality at both
268+
with feature flag enabled and disabled. Defaults will be verified so that users
269+
who donot set this flag are not surprised at all.
270+
271+
257272
## Graduation Criteria
258273

259-
- Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default.
260-
- Beta: Enabled by default with default value of 1.
274+
- Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default with default value of 1.
275+
- Beta: Enabled by default with default value of 1 with upgrade downgrade testedd at least manually.
276+
277+
278+
## Production Readiness Review Questionnaire
279+
280+
### Feature Enablement and Rollback
281+
282+
###### How can this feature be enabled / disabled in a live cluster?
283+
284+
- [x] Feature gate (also fill in values in `kep.yaml`)
285+
- Feature gate name: MaxUnavailableStatefulSet
286+
- Components depending on the feature gate: kube-apiserver and kube-controller-manager
287+
288+
###### Does enabling the feature change any default behavior?
289+
290+
No, the default behavior remains the same.
291+
292+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
293+
294+
Yes this feature can be disabled. Once disabled, all existing StatefulSet will
295+
revert to the old behavior where rolling update will proceed one pod at a time.
296+
297+
###### What happens if we reenable the feature if it was previously rolled back?
298+
299+
We will restore the desired behavior for StatefulSets for which the maxunavailable field wasn't deleted after
300+
the feature gate was disabled.
301+
302+
###### Are there any tests for feature enablement/disablement?
303+
yes, there are unit tests which make sure the field is correctly dropped
304+
on feature enable and disabled
305+
306+
### Rollout, Upgrade and Rollback Planning
307+
308+
###### How can a rollout or rollback fail? Can it impact already running workloads?
309+
310+
A rollout or rollback of this feature can fail if there is a bug which causes the kube-apiserver or
311+
the kube-controller-manager to start crashing when the feature flag is enabled.
312+
313+
314+
Yes, it can impact already running workloads.
315+
316+
If a rolling update is in progress for a StatefulSet, while this feature is being enabled in kube-apiserver
317+
and kube-controller-manager, the StatefulSet controller can run into corner cases where it will take longer
318+
for the controller to converge. This will only happen if after enabling the feature, the customer also sets
319+
maxUnavailable to a number greater than 1, but the invariants and the logic will ensure that there are never more than
320+
maxUnavailable pods with the same identity and never more than maxUnavailable being deleted.
321+
322+
###### What specific metrics should inform a rollback?
323+
TODO when we reach Beta
261324

325+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
326+
Will be tested when graduating to Beta.
327+
328+
329+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
330+
No
331+
332+
### Monitoring Requirements
333+
334+
###### How can an operator determine if the feature is in use by workloads?
335+
If their StatefulSet rollingUpdate section has the field maxUnavailable specified with
336+
a value different than 1.
337+
The below command should show maxUnavailable value:
338+
```
339+
kubectl get statefulsets -o yaml | grep maxUnavailable
340+
```
341+
342+
###### How can someone using this feature know that it is working for their instance?
343+
TODO when we reach Beta
344+
345+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
346+
347+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
348+
349+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
350+
351+
### Dependencies
352+
353+
###### Does this feature depend on any specific services running in the cluster?
354+
NA
355+
356+
### Scalability
357+
358+
###### Will enabling / using this feature result in any new API calls?
359+
It doesnt make any extra API calls.
360+
361+
###### Will enabling / using this feature result in introducing new API types?
362+
No
363+
364+
###### Will enabling / using this feature result in any new calls to the cloud provider?
365+
No
366+
367+
368+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
369+
A struct gets added to every StatefulSet object which has three fields, one 32 bit integer and two fields of type string.
370+
The struct in question is IntOrString.
371+
372+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
373+
No
374+
375+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
376+
The controller-manager will see very negligible and almost un-notoceable increase in cpu usage.
377+
378+
### Troubleshooting
379+
380+
###### How does this feature react if the API server and/or etcd is unavailable?
381+
The RollingUpdate will fail or will not be able to proceed if etcd or apiserver is unavailable and
382+
hence this feature will also be not be able to be used.
383+
384+
###### What are other known failure modes?
385+
NA
262386

263387
## Implementation History
264388

keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,19 @@ approvers:
1313
- "@kow3ns"
1414
editor: TBD
1515
creation-date: 2018-12-29
16-
last-updated: 2019-08-10
16+
last-updated: 2021-05-06
1717
status: implementable
1818
see-also:
1919
- n/a
2020
replaces:
2121
- n/a
2222
superseded-by:
2323
- n/a
24-
latest-milestone: "1.20"
24+
25+
# The milestone at which this feature was, or is targeted to be, at each stage.
26+
milestone:
27+
alpha: "v1.22"
28+
beta: "v1.23"
29+
stable: "v1.24"
30+
latest-milestone: "v1.22"
2531
stage: "alpha"

0 commit comments

Comments
 (0)