Skip to content

Commit f308272

Browse files
carloryalculquicondormacsko
committed
promote KEP-3902 to stable
Co-authored-by: Aldo Culquicondor <[email protected]> Co-authored-by: Maciej Skoczeń <[email protected]> Signed-off-by: carlory <[email protected]>
1 parent 334d04c commit f308272

File tree

3 files changed

+56
-25
lines changed

3 files changed

+56
-25
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3902
22
beta:
33
approver: "@wojtek-t"
4+
stable:
5+
approver: "@wojtek-t"

keps/sig-scheduling/3902-decoupled-taint-manager/README.md

Lines changed: 47 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -45,17 +45,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4545
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR)
4646
- [X] (R) KEP approvers have approved the KEP status as `implementable`
4747
- [X] (R) Design details are appropriately documented
48-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
49-
- [ ] e2e Tests for all Beta API Operations (endpoints)
50-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
51-
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
52-
- [ ] (R) Graduation criteria is in place
53-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
54-
- [ ] (R) Production readiness review completed
55-
- [ ] (R) Production readiness review approved
56-
- [ ] "Implementation History" section is up-to-date for milestone
57-
- [ ] User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/)
58-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
48+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
49+
- [X] e2e Tests for all Beta API Operations (endpoints)
50+
- [X] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
51+
- [X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
52+
- [X] (R) Graduation criteria is in place
53+
- [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
54+
- [X] (R) Production readiness review completed
55+
- [X] (R) Production readiness review approved
56+
- [X] "Implementation History" section is up-to-date for milestone
57+
- [X] User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/)
58+
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5959

6060
## Summary
6161

@@ -140,24 +140,27 @@ The creation and startup of the default taint-manager is performed by `kube-cont
140140
Kubernetes already has a good coverage of node `No-Execute` eviction. We will add tests only for the changed code.
141141

142142
##### Unit tests
143-
- `pkg/controller/taintmanager`: 2023-06-03 - 0% (to add)
143+
- `pkg/controller/tainteviction`: 2024-09-18 - 81.8%
144144
- Enable and disable the new `TaintEvictionController`
145145
- The new `TaintEvictionController` acts on node taints properly
146-
- `pkg/controller/apis/config/v1alpha1`: 2023-06-03 - 0%
147-
- Test new configuration
148-
- `pkg/controller/nodelifecycle`: 2023-06-03 - 74.8%
146+
- `pkg/controller/nodelifecycle`: 2024-09-18 - 68% (target 80% before GA)
149147
- Test the combined controller
150-
- `pkg/controller/nodelifecycle/scheduler`: 2023-06-03 - 90%
148+
- `pkg/controller/nodelifecycle/scheduler`: 2024-09-18 - 85.3%
151149
- Test the combined controller
152150

153151
##### Integration tests
154152
- Verify the ability to enable and disable the default `TaintEvictionController`.
155153
- Verify the new `TaintEvictionController`to act on node taints properly.
156154

155+
- `test/integration/node/lifecycle_test.go#TestEvictionForNoExecuteTaintAddedByUser`: [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?test=TestEvictionForNoExecuteTaintAddedByUser)
156+
- `test/integration/node/lifecycle_test.go#TestTaintBasedEvictions`: [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?test=TestTaintBasedEvictions)
157+
157158
##### E2E tests
158159
- Verify the new controllers to pass the existing E2E and conformance tests using the split `TaintEvictionController`.
159160
- Manually test rollback and the `upgrade->downgrade->upgrade` path to verify it can pass the existing e2e tests.
160161

162+
- `test/e2e/node/taints.go`: [k8s-triage](https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=NoExecuteTaintManager)
163+
161164
### Graduation Criteria
162165

163166
#### Alpha
@@ -168,19 +171,29 @@ Since there are no new API changes, we can skip Alpha and go to Beta directly.
168171

169172
* Support `kube-controller-manager` flag `--controllers=-taint-eviction-controller` to disable the default `TaintEvictionController`.
170173
* Unit and e2e tests completed and passed.
171-
* Performance and scalability tests to verify there is non-negligible performance degradation in node taint-based eviction.
172174
* Update documents to reflect the changes.
175+
* In the past, the taint-manager was running as an internal controller in the `NodeLifecycleManager`, so moving and renaming it as a new top-level controller has a low chance of degrading performance. And no bugs are reported in the current implementation.
173176

174177
#### GA
175178

176-
* Fix all reported bugs if any.
179+
* No negative feedback.
177180

178181
### Upgrade / Downgrade Strategy
179182

180183
A feature gate `SeparateTaintEvictionController` enables and disables the new feature. When the feature is turned on, a user can use `kube-controller-manager`'s flag to disable the default `TaintEvictionController`.
181184

185+
Regarding taint-based eviction, all versions of kube-controller-manager have this feature. The key difference is which controller implements the feature. The feature is implemented in the `NodeLifecycleController` before v1.29; the feature is implemented in a separate controller since v1.29.
186+
187+
For upgrade/downgrade scenarios, only one instance of the controller is leader and does the taint-based eviction.
188+
* If the newer instance becomes a leader, the taint-based eviction will work as usual.
189+
* If the older instance becomes a leader, the taint-based eviction will work as usual.
190+
191+
But if the cluster admin disables the default controller via the `kube-controller-manager`'s flag during upgrading, the taint-based eviction won't work once the newer instance becomes a leader.
192+
182193
### Version Skew Strategy
183194

195+
This feature affects only the kube-controller-manager, so there is no issue with version skew with other components.
196+
184197
## Production Readiness Review Questionnaire
185198

186199
### Feature Enablement and Rollback
@@ -213,11 +226,22 @@ This section must be completed when targeting beta to a release.
213226
This is an opt-in feature, and it does not change any default behavior. Unless there is a bug in the implementation, a rollout can not fail. If a rollout does fail, running workloads will not be evicted properly on tainted nodes. We don't except a rollback can fail.
214227

215228
###### What specific metrics should inform a rollback?
216-
A significantly changing number of pod evictions (`taint_eviction_controller_pod_evictions_total`) and/or a substantial increase in pod eviction latency (`taint_eviction_controller_pod_deletion_duration_seconds`) in Kubernetes.
229+
A significantly changing number of pod evictions (`taint_eviction_controller_pod_deletions_total`) and/or a substantial increase in pod eviction latency (`taint_eviction_controller_pod_deletion_duration_seconds`) in Kubernetes.
217230

218231
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
219-
The upgrade will be tested in the planned unit and e2e tests. The rollback and upgrade-downgrade-upgrade path will
220-
be tested manually (see [Test Plan](#test-plan) section).
232+
Manually tested. No issues were found when we enabled the feature gate -> disabled it ->
233+
re-enabled the feature gate.
234+
235+
Upgrade->downgrade->upgrade testing was done manually using the following steps:
236+
237+
1. Create a cluster in v1.29, the `SeparateTaintEvictionController` enabled by default.
238+
2. Run the existing e2e tests to verify the new controllers work as expected. Above mentioned metrics found.
239+
3. Simulate downgrade by modifying control-plane manifests to disable `SeparateTaintEvictionController`.
240+
4. Run the existing e2e tests to verify the current behavior without the new controllers.
241+
5. Simulate upgrade by modifying control-plane manifests to enable `SeparateTaintEvictionController`.
242+
6. Repeat step 2.
243+
244+
The e2e tests are mentioned in the [Test Plan](#test-plan) section.
221245

222246
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
223247
No
@@ -238,7 +262,7 @@ The performance of node taint-based eviction should remain the same level as bef
238262
The metrics for both `NodeLifecycleController` and `TaintEvictionController`'s queues should stay the same levels as before, the number of pod evictions and pod eviction latency.
239263

240264
- [X] Metrics
241-
- Metric name: `taint_eviction_controller_pod_evictions_total`, `taint_eviction_controller_pod_deletion_duration_seconds`
265+
- Metric name: `taint_eviction_controller_pod_deletions_total`, `taint_eviction_controller_pod_deletion_duration_seconds`
242266
- Components exposing the metric: `kube-controller-manager`
243267

244268
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -287,6 +311,7 @@ If the pod eviction latency increases significantly, validate if the communicati
287311

288312
* 2023-03-06: Initial KEP published.
289313
* 2023-10-23: KEP updated to update name of the `SeparateTaintEvictionController` feature-flag and exposed metrics.
314+
* 2025-04-24: KEP promoted to Stable.
290315

291316
## Drawbacks
292317

keps/sig-scheduling/3902-decoupled-taint-manager/kep.yaml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,29 @@ reviewers:
1616
- "@alculquicondor"
1717
- "@SergeyKanzhelev"
1818
- "@soltysh"
19+
- "@macsko"
1920
approvers:
2021
- "@Huang-Wei"
2122
- "@alculquicondor"
2223
- "@SergeyKanzhelev"
2324
- "@soltysh"
25+
- "@macsko"
2426

2527
see-also:
2628
replaces:
2729

2830
# The target maturity stage in the current dev cycle for this KEP.
29-
stage: beta
31+
stage: stable
3032

3133
# The most recent milestone for which work toward delivery of this KEP has been
3234
# done. This can be the current (upcoming) milestone, if it is being actively
3335
# worked on.
34-
latest-milestone: "v1.29"
36+
latest-milestone: "v1.34"
3537

3638
# The milestone at which this feature was, or is targeted to be, at each stage.
3739
milestone:
3840
beta: "v1.29"
41+
stable: "v1.34"
3942

4043
# The following PRR answers are required at alpha release
4144
# List the feature gate name and the components for which it must be enabled
@@ -47,4 +50,5 @@ disable-supported: true
4750

4851
# The following PRR answers are required at beta release
4952
metrics:
50-
- TBD
53+
- taint_eviction_controller_pod_deletions_total
54+
- taint_eviction_controller_pod_deletion_duration_seconds

0 commit comments

Comments
 (0)