You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Kubeflow Training Operator](https://github.com/kubeflow/training-operator/blob/da11d1116c29322c481d0b8f174df8d6f05004aa/pkg/apis/kubeflow.org/v1/common_types.go#L238-L239).
215
+
216
+
These projects for now follow the decision taken in the core k8s to make the
217
+
field immutable to avoid complication of the support for mutability.
218
+
219
+
All together, we decide to keep the field immutable.
212
220
213
221
#### Use for MultiKueue
214
222
@@ -358,17 +366,15 @@ We skip synchronization of the Jobs with the "managedBy" field, if it has any
358
366
different value than `kubernetes.io/job-controller`. When the synchronization is skipped,
359
367
the name of the controller managing the Job object is logged.
360
368
361
-
We leave the particular place at which the synchronization is skipped as
362
-
implementation detail which can be determined during the implementation phase,
363
-
however, two candidate places are:
364
-
1. inside `syncJob` function
365
-
2. inside `enqueueSyncJobInternal` function
369
+
We skip the reconciliation inside the `syncJob` function
370
+
(see [here](https://github.com/kubernetes/kubernetes/blob/15d08bf7c8813b0533dc147a03d9f42aae735ecd/pkg/controller/job/job_controller.go#L819-L822)).
366
371
367
-
Note that, if we skip inside `enqueueSyncJobInternal` we may save on some memory
368
-
needed to needlessly enqueue the Job keys.
372
+
We will re-evaluate for [GA](#ga) to also skip the reconciliation within the
373
+
`enqueueSyncJobInternal` for optimal performance. See discussion in the
374
+
[Skip reconciliation in the event handler](#skip-reconciliation-in-the-event-handler).
369
375
370
-
There is no validation for the values of the field beyond that of standard
371
-
permitted field values.
376
+
There is no validation for a value of the field beyond its format as described
377
+
in the [API](#API) comment above.
372
378
373
379
#### Job status validation
374
380
@@ -393,7 +399,7 @@ For that we plan to follow the approach described [below](#terminating-pods-and-
393
399
which extend the scope of the interim `FailureTarget` and `SuccessCriteriaMet`
394
400
conditions. We will also validate that the transition to `Failed` or `Complete`
395
401
condition is preceded by adding the `FailureTarget` or `SuccessCriteriaMet`
396
-
condition, respecively.
402
+
condition, respectively.
397
403
398
404
Additionally, we are going to introduce a validation rule that the count of
399
405
ready `status.ready` pods is lower or equal than the number of active `status.active`
@@ -480,21 +486,21 @@ The following scenarios related to [Terminating pods and terminal Job conditions
480
486
##### Integration tests
481
487
482
488
The following scenarios are covered:
483
-
- the Job controller reconciles jobs with the "managedBy" field equal to `kubernetes.io/job-controller`
484
-
- the Job controller reconciles jobs without the "managedBy" field
489
+
- the Job controller reconciles jobs with the "managedBy" field equal to `kubernetes.io/job-controller` ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2016))
490
+
- the Job controller reconciles jobs without the "managedBy" field ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000))
485
491
- the Job controller does not reconcile a job with any other value of the "managedBy" field. In particular:
486
-
- it does not reset the status for a Job with `.spec.suspend=false`,
487
-
- it does not add the Suspended condition for a Job with `.spec.suspend=true`.
488
-
- the Job controller reconciles jobs with custom "managedBy" field when the feature gate is disabled
492
+
- it does not reset the status for a Job with `.spec.suspend=false` ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2044)),
493
+
- it does not add the Suspended condition for a Job with `.spec.suspend=true` ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2059)).
494
+
- the Job controller reconciles jobs with custom "managedBy" field when the feature gate is disabled ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2030))
489
495
- the Job controller handles correctly re-enablement of the feature gate [link](https://github.com/kubernetes/kubernetes/blob/169a952720ebd75fcbcb4f3f5cc64e82fdd3ec45/test/integration/job/job_test.go#L1691)
490
-
- the `job_by_external_controller_total` metric is incremented when a new Job with custom "managedBy" is created
491
-
- the `job_by_external_controller_total` metric is not incremented for a new Job without "managedBy" or with default value
492
-
- the `job_by_external_controller_total` metric is not incremented for Job updates (regardless of the "managedBy")
496
+
- the `job_by_external_controller_total` metric is incremented when a new Job with custom "managedBy" is created ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2044-L2058))
497
+
- the `job_by_external_controller_total` metric is not incremented for a new Job without "managedBy" or with default value ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029))
498
+
- the `job_by_external_controller_total` metric is not incremented for Job updates (regardless of the "managedBy") (tested indirectly as [here](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029) the Job controller updates the Job status)
493
499
494
500
The following scenarios related to [Terminating pods and terminal Job conditions](#terminating-pods-and-terminal-job-conditions) are covered:
495
-
-`Failed` or `Complete` conditions are not added while there are still terminating pods
496
-
-`FailureTarget` is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded
497
-
-`SuccessCriteriaMet` is added when the `completions` are satisfied
501
+
-`Failed` or `Complete` conditions are not added while there are still terminating pods ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L1183))
502
+
-`FailureTarget` is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded ([link](https://github.com/kubernetes/kubernetes/blob/master/test/integration/job/job_test.go#L1253))
503
+
-`SuccessCriteriaMet` is added when the `completions` are satisfied ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L1355))
498
504
499
505
During the implementation more scenarios might be covered.
500
506
@@ -541,8 +547,8 @@ Second Alpha (1.31):
541
547
542
548
- Address reviews and bug reports from Beta users
543
549
- Re-evaluate the ideas of improving debuggability (like [extended `kubectl`](#debuggability), [dedicated condition](#condition-to-indicated-job-is-skipped), or [events](#event-indicating-the-job-is-skipped))
544
-
- Re-evaluate the support for mutability of the field
545
-
-Asses the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.
550
+
- Re-evaluate the need to skip reconciliation in the event handlers to optimize performance
551
+
-Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.
546
552
- Lock the feature gate
547
553
548
554
#### Deprecation
@@ -775,10 +781,10 @@ Describe manual testing that was done and the outcomes.
775
781
Longer term, we may want to require automated upgrade/rollback tests, but we
776
782
are missing a bunch of machinery and tooling and can't do that now.
777
783
-->
778
-
The Upgrade->downgrade->upgrade testing will be done manually prior to release
779
-
as Beta, with the following steps:
784
+
The Upgrade->downgrade->upgrade was tested manually using the 1.31 release
785
+
(Alpha), with the following steps:
780
786
781
-
1. Start the cluster with the `JobManagedBy` enabled for api server and control-plane.
787
+
1. Start the cluster with the `JobManagedBy` enabled for kube-apiserver and kube-controller-manager.
782
788
783
789
Then, create two-long running Jobs:
784
790
-`job-managed` with custom value of the "managedBy" field
@@ -788,13 +794,13 @@ Then, verify that:
788
794
- the `job-managed` does not get status updates from built-in controller. Update the status manually and observe it is not reset by the built-in controller.
789
795
- the `job-regular` starts making progress (creates pods and updates the status accordingly by the built-in controller)
790
796
791
-
2. Simulate downgrade by disabling the feature for api server and control-plane.
797
+
2. Simulate downgrade by disabling the feature for kube-apiserver and kube-controller-manager.
792
798
793
799
Then, verify that:
794
800
- the `job-managed` starts to make progress, the status is reset, and updated to some new values
795
801
- the `job-regular` continues making progress
796
802
797
-
3. Simulate upgrade by re-enabling the feature for api server and control-plane.
803
+
3. Simulate upgrade by re-enabling the feature for kube-apiserver and kube-controller-manager.
798
804
799
805
Then, verify that:
800
806
- the `job-managed` stops getting status updates from the built-in controller. Update the status manually and observe it is not reset by the built-in controller.
@@ -1080,6 +1086,10 @@ N/A.
1080
1086
- 2024-03-08 - Merged [Follow up fix to the job status update test](https://github.com/kubernetes/kubernetes/pull/123815)
1081
1087
- 2024-03-11 - Merged [Adjust the Job field API comments and validation to the current state](https://github.com/kubernetes/kubernetes/pull/123792)
1082
1088
- 2024-05-16 - Merged [Fix the comment for the Job managedBy field](https://github.com/kubernetes/kubernetes/pull/124793)
1089
+
- 2024-06-11 - Merged [Count terminating pods when deleting active pods for failed jobs](https://github.com/kubernetes/kubernetes/pull/1251753)
1090
+
- 2024-06-21 - Merged [Update the count of ready pods when deleting pods](https://github.com/kubernetes/kubernetes/pull/125546)
1091
+
- 2024-07-12 - Merged [Delay setting terminal Job conditions until all pods are terminal](https://github.com/kubernetes/kubernetes/pull/125510)
1092
+
- 2024-07-30 - Merged [Update the docs for JobManagedBy and JobPodReplacementPolicy related to pod termination](https://github.com/kubernetes/website/pull/46808)
1083
1093
1084
1094
<!--
1085
1095
Major milestones in the lifecycle of a KEP should be tracked in this section.
@@ -1100,6 +1110,26 @@ Why should this KEP _not_ be implemented?
1100
1110
1101
1111
## Alternatives
1102
1112
1113
+
### Skip reconciliation in the event handler
1114
+
1115
+
We discussed to skip the reconciliation only within the
0 commit comments