You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today only deployment controller has [status](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#deployment-status) to reflect the state during it's lifecycle. This enhancement extends the scope of
189
-
those of conditions to other controllers(DaemonSet and StatefulSet).
205
+
Today only deployment controller has a [status](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#deployment-status) to fully reflect the state during its lifecycle.
206
+
This enhancement extends the scope of those and other conditions to other controllers(DaemonSet, Job, ReplicaSet & ReplicationController, StatefulSet).
190
207
191
208
### Goals
192
209
193
-
The current status of a workload can be depicted via its conditions. It can be
210
+
The current status of a workload can be depicted via its conditions. It can be a subset of:
194
211
- Progressing
212
+
- Available
213
+
- ReplicaFailure
214
+
- Suspended
195
215
- Complete
196
-
- Failed to Progress
216
+
- Failed (to progress)
217
+
- Waiting (Job only)
218
+
- Running (Job only)
197
219
198
-
All workload controllers should have above conditions to reflect their states
220
+
Workload controllers should have above conditions (when applicable) to reflect their states.
199
221
200
222
### Non-Goals
201
223
@@ -239,6 +261,24 @@ As an end-user of Kubernetes, I'd like all my workload controllers to have consi
239
261
As an developer building Kubernetes Operators, I'd like all my operators deployed to have
240
262
consistent states.
241
263
264
+
265
+
### Overview of all conditions
266
+
267
+
The following table gives an overview on what conditions each of the workload resources support.
| Deployment | True when scaling replicas / creating-updating new ReplicaSet / successfully finished progressing (Pods ready or available for MinReadySeconds). False when failed creating ReplicaSet / reached progressDeadlineSeconds. Unknown when rollout paused | True if if required number of replicas is available (takes MaxSurge and MaxUnavailable into account) | failure propagated from new or old ReplicaSet | - | -*| -*|
**\* Success of the rollout is instead represented by a Progressing condition (status and reason)**
279
+
280
+
**\*\* CronJob does not even have Conditions field in its Status**
281
+
242
282
### Notes/Constraints/Caveats (Optional)
243
283
244
284
<!--
@@ -247,11 +287,25 @@ What are some important details that didn't come across above?
247
287
Go in to as much detail as necessary here.
248
288
This might be a good place to talk about core concepts and how they relate.
249
289
-->
250
-
We are proposing 3 new conditions called
251
-
- Progressing
252
-
- Complete
253
-
- Failed
254
-
to be added to DaemonSet and Stateful controllers. The definitions are similar to what we have for [Deployment controller](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#deployment-status).
290
+
We are proposing 4 new conditions to be added to the following controllers:
291
+
- Available (DaemonSet, ReplicaSet & ReplicationController, StatefulSet)
292
+
- Progressing (DaemonSet, StatefulSet)
293
+
- Waiting (Job)
294
+
- Running (Job)
295
+
296
+
The definitions for Progressing condition (including Failed/Complete) are similar to what we have for [Deployment controller](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#deployment-status).
297
+
298
+
299
+
The following table is indicating what conditions are currently available (`X`) and what conditions will be added (`A`).
Kubernetes marks a DaemonSet or Stateful as `progressing` when:
@@ -260,11 +314,11 @@ Kubernetes marks a DaemonSet or Stateful as `progressing` when:
260
314
- New DaemonSet or StatefulSet pods become Ready or available
261
315
262
316
#### Complete
263
-
Kubernetes marks a DaemonSetor Stateful as complete when it has the following characteristics:
317
+
Kubernetes marks a DaemonSet, ReplicaSet or Stateful as `complete` when it has the following characteristics:
264
318
265
-
- All of the replicas associated with the DaemonSet or StatefulSet have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
266
-
- All of the replicas associated with the DaemonSet or StatefulSet are available.
267
-
- No old replicas for the DaemonSet or Stateful are running.
319
+
- All of the replicas/pods associated with the DaemonSet or StatefulSet have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
320
+
- All of the replicas/pods associated with the DaemonSet, ReplicaSet or StatefulSet are available.
321
+
- No old or mischeduled replicas/pods for the DaemonSet, ReplicaSet or Stateful are running.
268
322
269
323
#### Failed
270
324
In order to introduce this condition we need to come up with a new field called `.spec.progressDeadlineSeconds` which denotes the number of seconds the controller waits before indicating(in the workload controller status) that the controller progress has stalled.
@@ -273,11 +327,37 @@ There are many factors that can cause failure to happen like:
273
327
- Insufficient quota
274
328
- Readiness probe failures
275
329
- Image pull errors
276
-
330
+
- Failed to create/delete pod
277
331
278
332
See the [Kubernetes API Conventions](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) for more information on status conditions
279
333
280
-
Because of the number of changes that are involved as part of this effort, we are thinking of a phased approach where we introduce these conditions to DaemonSet controller behind a featuregate flag first in one release and then make similar changes to StatefulSet controller. This also needs some code refactoring of existing conditions for Deployment controller.
334
+
Because of the number of changes that are involved as part of this effort, we are thinking of a phased approach where we introduce these conditions to DaemonSet controller behind a featuregate flag first in one release and then make similar changes to ReplicaSet and StatefulSet controller.
335
+
This also needs some code refactoring of existing conditions for Deployment controller.
336
+
337
+
#### Available
338
+
Kubernetes marks a ReplicaSet, StatefulSet as `available` when number of available replicas reaches number of replicas.
339
+
- This could be confusing in ReplicaSets a bit since Deployment could get available sooner than its ReplicaSet due `maxUnavailable`.
340
+
- Available replicas is alpha feature guarded by StatefulSetMinReadySeconds gate in StatefulSets, but the value defaults to ReadyReplicas when the feature gate is disabled so using it shouldn't be an issue.
341
+
342
+
Kubernetes marks DaemonSet as `available` when `numberUnavailable` or `desiredNumberScheduled - numberAvailable` is zero.
Batch workloads behaviour does not properly map to the other workloads that are expected to be always running (eg. `Progressing` condition and its behaviour).
347
+
- Jobs are indicating a `Failed`/`Complete` state in a standalone condition compared to `Progressing` condition in other workloads.
348
+
-`.spec.activeDeadlineSeconds` variable, is similar to `progressDeadlineSeconds`, but does not have a default value.
349
+
It also resets on suspension, so its behaviour is a bit different.
350
+
351
+
352
+
Kubernetes marks a Job as `waiting` if one of the following conditions is true:
353
+
354
+
- There are no Pods with phase `Running` and there is at least one Pod with phase `Pending`.
355
+
- The Job is suspended.
356
+
357
+
Kubernetes marks a Job as `running` if there is at least one Pod with phase `Running`.
358
+
359
+
This KEP does not introduce CronJob conditions as it is difficult to define conditions that would describe CronJob behaviour in useful manner.
360
+
In case the user is interested if there are any running Jobs, `.status.active` field should be sufficient.
281
361
282
362
### Risks and Mitigations
283
363
@@ -292,14 +372,20 @@ How will UX be reviewed, and by whom?
292
372
293
373
Consider including folks who also work outside the SIG or subproject.
294
374
-->
295
-
We are proposing a new field called `.spec.progressDeadlineSeconds` to both DaemonSet and StatefulSet whose default value will be set to the max value of int32 (i.e. 2147483647) by default, which means "no deadline". In this mode, DaemonSet and StatefulSet controllers will behave exactly like its current behavior but with no `Failed` mode as they're either `Progressing` or `Complete`. This is to ensure backward compatibility with current behavior. We will default to reasonable values for both the controllers in a future release. Since a DaemonSet can make progress no faster than "healthy but not ready nodes", the default value for `progressDeadlineSeconds` can be set to 30 minutes but this value can vary depending on the node count in the cluster. The value for StatefulSet can be longer than 10 minutes since it also involves provisioning storage and binding. This default value can be set to 15 minutes in case of StatefulSet.
375
+
We are proposing a new field called `.spec.progressDeadlineSeconds` to DaemonSet and StatefulSet whose default value will be set to the max value of int32 (i.e. 2147483647) by default, which means "no deadline".
376
+
In this mode, DaemonSet and StatefulSet controllers will behave exactly like its current behavior but with no `Failed` mode as they're either `Progressing` or `Complete`.
377
+
This is to ensure backward compatibility with current behavior. We will default to reasonable values for the controllers in a future release.
378
+
Since a DaemonSet can make progress no faster than "healthy but not ready nodes", the default value for `progressDeadlineSeconds` can be set to 30 minutes but this value can vary depending on the node count in the cluster.
379
+
The value for StatefulSet can be longer than 10 minutes since it also involves provisioning storage and binding. This default value can be set to 15 minutes in case of StatefulSet.
296
380
297
381
It is possible that we introduce a bug in the implementation. The bug can cause:
298
-
- DaemonSet and StatefulSet controllers can be marked `Failed`eventhough rollout is progress
382
+
- DaemonSet and StatefulSet controllers can be marked `Failed`even though rollout is in progress
299
383
- The states could be misrepresented, for example a DaemonSet or StatefulSet can be marked `Progressing` when actually it is complete
300
384
301
-
The mitigation currently is that these features will be in Alpha stage behind featuregates for people to try out and give feedback. In Beta phase when
302
-
these are enabled by default, people will only see issues or bugs when `progressDeadlineSeconds` is set to something greater than 0 and we choose default values for that field. Since people would have tried this feature in Alpha, we would have had time to fix issues. The featuregates we use are `DaemonSetConditions` for DaemonSet controller and `StatefulSetConditions` for StatefulSet controller.
385
+
The mitigation currently is that these features will be in Alpha stage behind featuregates for people to try out and give feedback. In Beta phase when
386
+
these are enabled by default, people will only see issues or bugs when `progressDeadlineSeconds` is set to something greater than 0 and we choose default values for that field.
387
+
Since people would have tried this feature in Alpha, we would have had time to fix issues.
388
+
The featuregates we use are `DaemonSetConditions` for DaemonSet controller, `StatefulSetConditions` for StatefulSet controller.
303
389
304
390
305
391
@@ -454,7 +540,7 @@ enhancement:
454
540
CRI or CNI may require updating that component before the kubelet.
455
541
-->
456
542
457
-
TBD
543
+
None identified yet.
458
544
459
545
## Production Readiness Review Questionnaire
460
546
@@ -523,7 +609,7 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
523
609
The DaemonSet, StatefulSet controller starts respecting the `progressDeadlineSeconds` again.
524
610
525
611
###### Are there any tests for feature enablement/disablement?
526
-
612
+
527
613
Yes, unit, integration and e2e tests for feature enabled, disabled
528
614
529
615
<!--
@@ -679,8 +765,8 @@ previous answers based on experience in the field.
679
765
680
766
###### Will enabling / using this feature result in any new API calls?
681
767
Yes
682
-
- Update of DaemonSet, StatefulSet status
683
-
- Controllers could make additional update calls when syncing the resources
768
+
- Update of DaemonSet, Job, ReplicaSet, ReplicationController, StatefulSet status
769
+
- Controllers could potentially make additional update calls when syncing the resources
684
770
<!--
685
771
Describe them, providing:
686
772
- API call type (e.g. PATCH pods)
@@ -715,10 +801,10 @@ Describe them, providing:
715
801
716
802
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
717
803
Yes.
718
-
API type(s): DaemonSet, StatefulSet
719
-
- Estimated increase in size:
720
-
- New field in DaemonSet, StatefulSet spec about 4 bytes
721
-
- Add new condition in DaemonSet, StatefulSet status
804
+
API type(s): DaemonSet, Deployment, Job, ReplicaSet, ReplicationController, StatefulSet
805
+
- Estimated increase in size:
806
+
- New field in DaemonSet and StatefulSet spec about 4 bytes
807
+
- Add new conditions in Deployment, DaemonSet, Job, ReplicaSet, ReplicationController, StatefulSet status
722
808
<!--
723
809
Describe them, providing:
724
810
- API type(s):
@@ -728,7 +814,7 @@ Describe them, providing:
728
814
729
815
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
730
816
731
-
No.
817
+
TBD.
732
818
<!--
733
819
Look at the [existing SLIs/SLOs].
734
820
@@ -794,14 +880,14 @@ Major milestones might include:
794
880
-->
795
881
796
882
## Drawbacks
797
-
Adds more complexity to DaemonSet, StatefulSet controllers in terms of checking conditions and updating the conditions continuously.
883
+
Adds more complexity to Deployment, DaemonSet, Job, ReplicaSet, ReplicationController, StatefulSet controllers in terms of checking conditions and updating the conditions continuously.
798
884
799
885
<!--
800
886
Why should this KEP _not_ be implemented?
801
887
-->
802
888
803
889
## Alternatives
804
-
Continue to check AvailableReplicas, Replicas and other fields instead of having explict conditions. This is not always fool proof and can cause problems.
890
+
Continue to check AvailableReplicas, Replicas and other fields instead of having explicit conditions. This is not always foolproof and can cause problems.
805
891
<!--
806
892
What other approaches did you consider, and why did you rule them out? These do
807
893
not need to be as detailed as the proposal, but should include enough
0 commit comments