Skip to content

Commit da9573a

Browse files
committed
review update
1 parent cf6306d commit da9573a

File tree

2 files changed

+42
-58
lines changed

2 files changed

+42
-58
lines changed

keps/sig-apps/2804-consolidate-workload-controllers-status/README.md

Lines changed: 36 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ If none of those approvers are still appropriate, then changes to that list
5858
should be approved by the remaining approvers and/or the owning SIG (or
5959
SIG Architecture for cross-cutting KEPs).
6060
-->
61-
# KEP-2804: Consolidate Workload controllers life cycle state
61+
# KEP-2804: Consolidate Workload controllers life cycle status
6262

6363
<!--
6464
This is the title of your KEP. Keep it short, simple, and descriptive. A good
@@ -87,13 +87,13 @@ tags, and then generate with `hack/update-toc.sh`.
8787
- [Story 1](#story-1)
8888
- [Story 2](#story-2)
8989
- [Overview of all conditions](#overview-of-all-conditions)
90+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
9091
- [Proposed Conditions](#proposed-conditions)
9192
- [Progressing](#progressing)
9293
- [Complete](#complete)
9394
- [Failed](#failed)
9495
- [Available](#available)
9596
- [Batch Workloads Conditions: Waiting &amp; Running](#batch-workloads-conditions-waiting--running)
96-
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
9797
- [Risks and Mitigations](#risks-and-mitigations)
9898
- [Design Details](#design-details)
9999
- [Test Plan](#test-plan)
@@ -209,14 +209,14 @@ This enhancement extends the scope of those and other conditions to other contro
209209
### Goals
210210

211211
The current status of a workload can be depicted via its conditions. It can be a subset of:
212-
- Progressing
213-
- Available
214-
- ReplicaFailure
215-
- Suspended
216-
- Complete
217-
- Failed (to progress)
218-
- Waiting (Job only)
219-
- Running (Job only)
212+
- Progressing: designates the state of the latest rollout.
213+
- Available: designates if the required number of available replicas is `available`.
214+
- ReplicaFailure: ReplicaSet failed to create/delete a Pod.
215+
- Suspended: execution of a Job is suspended.
216+
- Complete: Job run to a completion, or rollout completed (via Progressing condition).
217+
- Failed: Job failed to complete, or rollout failed to progress (via Progressing condition).
218+
- Waiting (Job only): waiting for at least one Pod to run.
219+
- Running (Job only): at least one Pod of a Job is running.
220220

221221
Workload controllers should have above conditions (when applicable) to reflect their states.
222222

@@ -228,10 +228,11 @@ and make progress.
228228
-->
229229

230230
- Modifying the existing states of deployment controller
231-
- Changing the definition of States
231+
- Changing the definition of statuses
232232
- Introduce new api for existing conditions
233-
- To properly implement Progressing condition, `.spec.progressDeadlineSeconds` field has to be introduced in
233+
- To properly implement Progressing condition. `.spec.progressDeadlineSeconds` field has to be introduced as part of an additional KEP in
234234
DaemonSet and StatefulSet to describe the time when the controllers should declare the workload as `failed`.
235+
- consider adding Conditions field to CronJob
235236

236237
## Proposal
237238

@@ -256,11 +257,11 @@ bogged down.
256257
-->
257258

258259
#### Story 1
259-
As an end-user of Kubernetes, I'd like all my workload controllers to have consistent states
260+
As an end-user of Kubernetes, I'd like all my workload controllers to have consistent statuses.
260261

261262
#### Story 2
262263
As an developer building Kubernetes Operators, I'd like all my operators deployed to have
263-
consistent states.
264+
consistent statuses.
264265

265266

266267
### Overview of all conditions
@@ -280,6 +281,14 @@ The following table gives an overview on what conditions each of the workload re
280281

281282
**\*\* CronJob does not even have Conditions field in its Status**
282283

284+
### Notes/Constraints/Caveats (Optional)
285+
286+
As observed in some issues (https://github.com/kubernetes/website/pull/31226) and talking to the users, there is a misunderstanding about the meaning of the `Progressing` condition. These include:
287+
- Thinking that the `Progressing` condition reflects the state of the current Deployment instead of the last rollout. Which includes expectation that the `Progressing` condition will keep checking availability of replicas and revert to `progressing`/`failed` state even after the `complete` state is reached. And that the progressing condition will thus also reflect any changes in scaling.
288+
- Confusion that ProgressDeadlineExceeded does not occur after the Deployment rollout completes when the availability of pods degrades before the `.spec.progressDeadlineSeconds` times out.
289+
290+
To summarize, the name of the `Progressing` condition doesn't indicate its true meaning that its main responsibility is the indication of rollouts, and it confuses the users.
291+
283292
### Proposed Conditions
284293

285294
<!--
@@ -322,7 +331,7 @@ Kubernetes marks a DaemonSet, ReplicaSet or Stateful as `complete` when it has t
322331
- No old or mischeduled replicas/pods for the DaemonSet, ReplicaSet or Stateful are running.
323332

324333
#### Failed
325-
In order to introduce this condition we need to come up with a new field called `.spec.progressDeadlineSeconds` which denotes the number of seconds the controller waits before indicating(in the workload controller status) that the controller progress has stalled.
334+
In order to introduce this condition we need to come up with a new field called `.spec.progressDeadlineSeconds` (additional KEP) which denotes the number of seconds the controller waits before indicating(in the workload controller status) that the controller progress has stalled.
326335

327336
There are many factors that can cause failure to happen like:
328337
- Insufficient quota
@@ -332,7 +341,7 @@ There are many factors that can cause failure to happen like:
332341

333342
See the [Kubernetes API Conventions](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) for more information on status conditions
334343

335-
Because of the number of changes that are involved as part of this effort, we are thinking of a phased approach where we introduce these conditions to DaemonSet controller behind a featuregate flag first in one release and then make similar changes to ReplicaSet and StatefulSet controller.
344+
Because of the number of changes that are involved as part of this effort, we are thinking of a phased approach where we introduce these conditions to DaemonSet controller first and then make similar changes to ReplicaSet and StatefulSet controller. We would graduate ExtraWorkloadConditions to beta once all the features and `progressDeadlineSeconds` KEP are implemented.
336345
This also needs some code refactoring of existing conditions for Deployment controller.
337346

338347
#### Available
@@ -353,21 +362,12 @@ Batch workloads behaviour does not properly map to the other workloads that are
353362
Kubernetes marks a Job as `waiting` if one of the following conditions is true:
354363

355364
- There are no Pods with phase `Running` and there is at least one Pod with phase `Pending`.
356-
- The Job is suspended.
357365

358366
Kubernetes marks a Job as `running` if there is at least one Pod with phase `Running`.
359367

360368
This KEP does not introduce CronJob conditions as it is difficult to define conditions that would describe CronJob behaviour in useful manner.
361369
In case the user is interested if there are any running Jobs, `.status.active` field should be sufficient.
362370

363-
### Notes/Constraints/Caveats (Optional)
364-
365-
As observed in some issues (https://github.com/kubernetes/website/pull/31226) and talking to the users, there is a misunderstanding about the meaning of the `Progressing` condition. These include:
366-
- Thinking that the `Progressing` condition reflects the state of the current Deployment instead of the last rollout. Which includes expectation that the `Progressing` condition will keep checking availability of replicas and revert to `progressing`/`failed` state even after the `complete` state is reached. And that the progressing condition will thus also reflect any changes in scaling.
367-
- Confusion that ProgressDeadlineExceeded does not occur after the Deployment rollout completes when the availability of pods degrades before the `.spec.progressDeadlineSeconds` times out.
368-
369-
To summarize, the name of the `Progressing` condition doesn't indicate its true meaning that its main responsibility is the indication of rollouts, and it confuses the users.
370-
371371
### Risks and Mitigations
372372

373373
<!--
@@ -381,7 +381,7 @@ How will UX be reviewed, and by whom?
381381
382382
Consider including folks who also work outside the SIG or subproject.
383383
-->
384-
We are proposing a new field called `.spec.progressDeadlineSeconds` to DaemonSet and StatefulSet whose default value will be set to the max value of int32 (i.e. 2147483647) by default, which means "no deadline".
384+
We are proposing a new field called `.spec.progressDeadlineSeconds` to DaemonSet and StatefulSet as part of a additional KEP whose default value will be set to the max value of int32 (i.e. 2147483647) by default, which means "no deadline".
385385
In this mode, DaemonSet and StatefulSet controllers will behave exactly like its current behavior but with no `Failed` mode as they're either `Progressing` or `Complete`.
386386
This is to ensure backward compatibility with current behavior. We will default to reasonable values for the controllers in a future release.
387387
Since a DaemonSet can make progress no faster than "healthy but not ready nodes", the default value for `progressDeadlineSeconds` can be set to 30 minutes but this value can vary depending on the node count in the cluster.
@@ -391,12 +391,9 @@ It is possible that we introduce a bug in the implementation. The bug can cause:
391391
- DaemonSet and StatefulSet controllers can be marked `Failed` even though rollout is in progress
392392
- The states could be misrepresented, for example a DaemonSet or StatefulSet can be marked `Progressing` when actually it is complete
393393

394-
The mitigation currently is that these features will be in Alpha stage behind featuregates for people to try out and give feedback. In Beta phase when
394+
The mitigation currently is that these features will be in Alpha stage behind `ExtraWorkloadConditions` featuregate for people to try out and give feedback. In Beta phase when
395395
these are enabled by default, people will only see issues or bugs when `progressDeadlineSeconds` is set to something greater than 0 and we choose default values for that field.
396396
Since people would have tried this feature in Alpha, we would have had time to fix issues.
397-
The featuregates we use are `DaemonSetConditions` for DaemonSet controller, `StatefulSetConditions` for StatefulSet controller.
398-
399-
400397

401398

402399
## Design Details
@@ -410,7 +407,7 @@ proposal will be implemented, this is the place to discuss them.
410407

411408
### Test Plan
412409
Unit and E2E tests will be added to cover the
413-
API validation, behavioral change of DaemonSet and StatefulSet with feature gate enabled and disabled.
410+
API validation, behavioral change of controllers with feature gate enabled and disabled.
414411

415412
- Validating all possible states of old and new conditions. Checking that the changes in underlying Pod statuses correspond to the conditions.
416413
- Testing `progressDeadlineSeconds` and feature gates.
@@ -442,6 +439,7 @@ when drafting this test plan.
442439
#### Alpha -> Beta Graduation
443440
- Gather feedback from end users
444441
- Tests are in Testgrid and linked in KEP
442+
- all new features in the following controllers should be implemented: ReplicaSet & ReplicationController, StatefulSet, DaemonSet and Job. To fully support `failed` state of a progressing condition, `progressDeadlineSeconds` KEP should be also fully implemented.
445443

446444
#### Beta -> GA Graduation
447445
- 2 examples of end users using this field
@@ -519,20 +517,11 @@ enhancement:
519517
cluster required to make on upgrade, in order to make use of the enhancement?
520518
-->
521519
- Upgrades
522-
When upgrading from a release without this feature, to a release with
523-
`progressDeadlineSeconds`, we will set `progressDeadlineSeconds` to max value of int32 (i.e. 2147483647). This would give users
524-
the same default behavior.
520+
When upgrading from a release without this feature, to a release with `ExtraWorkloadConditions` feature,
521+
we will set new conditions on the mentioned workloads.
525522
- Downgrades
526-
When downgrading from a release with this feature, to a release without
527-
`progressDeadlineSeconds`, there are two cases
528-
- If `progressDeadlineSeconds` is greater than 0 -- in this case kube-apiserver
529-
clears the `progressDeadlineSeconds` and the existing StatefulSet and DaemonSet controllers wouldn't honor
530-
`progressDeadlineSeconds` which is expected.
531-
- If `progressDeadlineSeconds` is equal to 0 -- in this case user wont see any
532-
difference in behavior.
533-
534-
We will ensure that the `progressDeadlineSeconds` field is properly validated
535-
before persisting. The validation includes checking for positive number for the `progressDeadlineSeconds` field.
523+
When downgrading from a release with this feature, to a release without,
524+
we will remove the extra conditions from workload objects.
536525

537526
### Version Skew Strategy
538527

@@ -588,15 +577,12 @@ Pick one of these and delete the rest.
588577
-->
589578

590579
- [x] Feature gate (also fill in values in `kep.yaml`)
591-
- Feature gate name: DaemonSetConditions, StatefulSetConditions
580+
- Feature gate name: ExtraWorkloadConditions
592581
- Components depending on the feature gate:
593582
- kube-controller-manager
594-
- kube-apiserver
595583

596584
###### Does enabling the feature change any default behavior?
597-
No. The default behavior won't change.
598-
The default values for newly introduced `progressDeadlineSeconds` fields might change when graduating it after getting feedback from users.
599-
Only new conditions will be added with no effect on existing conditions.
585+
No. The default behavior won't change. Only new conditions will be added with no effect on existing conditions.
600586

601587
<!--
602588
Any change of default behavior may be surprising to users or break existing
@@ -615,7 +601,7 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
615601

616602
###### What happens if we reenable the feature if it was previously rolled back?
617603

618-
The DaemonSet, StatefulSet controller starts respecting the `progressDeadlineSeconds` again.
604+
The controllers start updating the new workload conditions again.
619605

620606
###### Are there any tests for feature enablement/disablement?
621607

@@ -812,7 +798,6 @@ Describe them, providing:
812798
Yes.
813799
API type(s): DaemonSet, Deployment, Job, ReplicaSet, ReplicationController, StatefulSet
814800
- Estimated increase in size:
815-
- New field in DaemonSet and StatefulSet spec about 4 bytes
816801
- Add new conditions in Deployment, DaemonSet, Job, ReplicaSet, ReplicationController, StatefulSet status
817802
<!--
818803
Describe them, providing:

keps/sig-apps/2804-consolidate-workload-controllers-status/kep.yaml

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
title: Consolidate Workload controllers life cycle state
1+
title: Consolidate Workload controllers life cycle status
22
kep-number: 2804
33
authors:
44
- "@ravisantoshgudimetla"
@@ -23,21 +23,20 @@ stage: alpha
2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.23"
26+
latest-milestone: "v1.24"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:
30-
alpha: "v1.23"
31-
beta: "v1.24"
32-
stable: "v1.25"
30+
alpha: "v1.24"
31+
beta: "v1.26"
32+
stable: "v1.27"
3333

3434
# The following PRR answers are required at alpha release
3535
# List the feature gate name and the components for which it must be enabled
3636
feature-gates:
37-
- name: DaemonSetConditions, StatefulSetConditions
37+
- name: ExtraWorkloadConditions
3838
components:
3939
- kube-controller-manager
40-
- kube-apiserver
4140
disable-supported: true
4241

4342
# The following PRR answers are required at beta release

0 commit comments

Comments
 (0)