Skip to content

Commit bdccc9c

Browse files
authored
Merge pull request kubernetes#3901 from yuanchen8911/master
KEP-3902: Decouple TaintManager from NodeLifeCycleController
2 parents 56e3e45 + 4438354 commit bdccc9c

File tree

5 files changed

+353
-0
lines changed

5 files changed

+353
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 3902
2+
beta:
3+
approver: "@wojtek-t"
Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
# KEP-3902: Decouple Taint-based Pod Eviction from Node Lifecycle Controller
2+
<!-- toc -->
3+
- [Release Signoff Checklist](#release-signoff-checklist)
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [User Stories (Optional)](#user-stories-optional)
10+
- [Story](#story)
11+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
12+
- [Risks and Mitigations](#risks-and-mitigations)
13+
- [Design Details](#design-details)
14+
- [Proposed Controllers](#proposed-controllers)
15+
- [Implementation](#implementation)
16+
- [Test Plan](#test-plan)
17+
- [Prerequisite testing updates](#prerequisite-testing-updates)
18+
- [Unit tests](#unit-tests)
19+
- [Integration tests](#integration-tests)
20+
- [E2E tests](#e2e-tests)
21+
- [Graduation Criteria](#graduation-criteria)
22+
- [Alpha](#alpha)
23+
- [Beta](#beta)
24+
- [GA](#ga)
25+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
26+
- [Version Skew Strategy](#version-skew-strategy)
27+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
28+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
29+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
30+
- [Monitoring Requirements](#monitoring-requirements)
31+
- [Dependencies](#dependencies)
32+
- [Scalability](#scalability)
33+
- [Troubleshooting](#troubleshooting)
34+
- [Implementation History](#implementation-history)
35+
- [Drawbacks](#drawbacks)
36+
- [Alternatives](#alternatives)
37+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
38+
- [Note](#note)
39+
<!-- /toc -->
40+
41+
## Release Signoff Checklist
42+
43+
Items marked with (R) are required *prior to targeting to a milestone / release*.
44+
45+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR)
46+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
47+
- [X] (R) Design details are appropriately documented
48+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
49+
- [ ] e2e Tests for all Beta API Operations (endpoints)
50+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
51+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
52+
- [ ] (R) Graduation criteria is in place
53+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
54+
- [ ] (R) Production readiness review completed
55+
- [ ] (R) Production readiness review approved
56+
- [ ] "Implementation History" section is up-to-date for milestone
57+
- [ ] User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/)
58+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
59+
60+
## Summary
61+
62+
In Kubernetes, `NodeLifecycleController` applies predefined `NoExecute` taints (e.g., `Unreachable`, `NotReady`) when the nodes are determined to be unhealthy. After the nodes get tainted, `TaintManager` does its due diligence to start deleting running pods on those nodes based on `NoExecute` taints, which can be added by anyone.
63+
64+
In this KEP, we propose to decouple `TaintManager` that performs taint-based pod eviction from `NodeLifecycleController` and make them two separate controllers: `NodeLifecycleController` to add taints to unhealthy nodes and `TaintManager` to perform pod deletion on nodes tainted with NoExecute effect.
65+
66+
This separation not only improves code organization but also makes it easier to improve `TaintManager` or build custom `TaintManager`.
67+
68+
## Motivation
69+
`NodeLifecycleController` combines two independent functions:
70+
* Adding a pre-defined set of `NoExecute` taints to nodes based on the node conditions
71+
* Performing pod eviction on `NoExecute` taints
72+
73+
Splitting the `NodeLifecycleController` based on the above functionalities will help to disentangle code and make future extensions to either component manageable.
74+
75+
### Goals
76+
77+
* Move taint-based eviction implementation out of `NodeLifecycleController` into a separate and independent taint-manager to enhance separation of concerns, maintainability.
78+
79+
### Non-Goals
80+
81+
The main focus of the KEP is on the separation of concerns between `NodeLifecycleController` and `TaintManager`, which will improve code organization and enable future extensions to `TaintManager`. While it is true that the separation allows cluster operators to build custom `TaintManager` and disable the default `TaintManager`, this is a side-effect of the change rather than the main goal. the KEP emphasizes the benefits of the separation of concerns and briefly mention the option to disable the default TaintManager as a potential side-effect.
82+
83+
The non-goals are the following.
84+
* Introduce extra maintenance burden or regression after the code reorganization
85+
* Introduce incompatible behavior of `NodeLifecycleController` or `TaintManager`
86+
* Extend and enhance the current built-in APIs for customization of node taint-based eviction, including additional fields, more flexible communication protocols, webhooks and CRD references, etc.
87+
* Actual extension and enhancement of the current `TaintManager`.
88+
89+
## Proposal
90+
91+
### User Stories (Optional)
92+
93+
While this split is for improving the code maintainability, an effect of this change is that it allows cluster-admins to extend and enhance the default `TaintManager` and even replace the default `TaintManager` with a custom implementation. However, discussing such use-cases for extending `TaintManager` or writing custom `TaintManager` is beyond the scope of this KEP.
94+
95+
#### Story
96+
97+
### Notes/Constraints/Caveats (Optional)
98+
99+
### Risks and Mitigations
100+
101+
Compared with a single combined controller, the risks of using two separate controllers include the following.
102+
* Slightly increase the communication overhead from applying node taints to performing pod eviction.
103+
* Make cancellation of pod eviction and un-tainting nodes harder.
104+
105+
However, the performance overhead of the split controllers is expected to be small, and pod eviction cancellation and un-tainting nodes are not common use cases. Splitting `NodeLifecycleController` is more beneficial by promoting a cleaner design and enhancing the overall maintainability and extensibility in the long term.
106+
107+
## Design Details
108+
109+
In Kubernetes version 1.27 and earlier, the `NoExecuteTaintManager` component is included within the `NodeLifecycleController`.
110+
111+
![Current Controller](current-nodelifecycle-controller.png)
112+
113+
### Proposed Controllers
114+
115+
The proposed design refactors `NodeLifecycleController` into two independent controllers, which are managed by `kube-controller-manager`.
116+
117+
1. `NodeLifecycleController` monitors node health and adds `NotReady` and `Unreachable` taints to nodes
118+
2. `NoExecuteTaintManager` watches for node and pod updates, and performs `NoExecute` taint based pod eviction
119+
120+
The existing kube-controller-manager flag `--controllers` can be used to optionally disable the `NoExecuteTaintManager`.
121+
122+
![New Controllers](new-controllers.png)
123+
124+
### Implementation
125+
126+
A new `NodeLifecycleManager` is implemented by removing taint-manager related code from `controller/nodelifecycle/node_lifecycle_controller.go`.
127+
128+
A new `NoExecuteTaintManager` is created as a top-level controller managed by` kube-controller-manager`. Its implementation is based on the current taint-manager from `controller/nodelifecycle/taint-manager.go`.
129+
130+
The creation and startup of the default taint-manager is performed by `kube-controller-manager`. A feature gate `SeparateTaintManager` controls whether to use the split `TaintManager` or roll back to the old `NodeLifecycleController`.
131+
132+
### Test Plan
133+
134+
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
135+
136+
##### Prerequisite testing updates
137+
138+
Kubernetes already has a good coverage of node `No-Execute` eviction. We will add tests only for the changed code.
139+
140+
##### Unit tests
141+
- `pkg/controller/taintmanager`: 2023-06-03 - 0% (to add)
142+
- Enable and disable the new `TaintManager`
143+
- The new `TaintManager` acts on node taints properly
144+
- `pkg/controller/apis/config/v1alpha1`: 2023-06-03 - 0%
145+
- Test new configuration
146+
- `pkg/controller/nodelifecycle`: 2023-06-03 - 74.8%
147+
- Test the combined controller
148+
- `pkg/controller/nodelifecycle/scheduler`: 2023-06-03 - 90%
149+
- Test the combined controller
150+
151+
##### Integration tests
152+
- Verify the ability to enable and disable the default `TaintManager`.
153+
- Verify the new `TaintManager`to act on node taints properly.
154+
155+
##### E2E tests
156+
- Verify the new controllers to pass the existing E2E and conformance tests using the split `TaintManager`.
157+
- Manually test rollback and the `upgrade->downgrade->upgrade` path to verify it can pass the existing e2e tests.
158+
159+
### Graduation Criteria
160+
161+
#### Alpha
162+
163+
Since there are no new API changes, we can skip Alpha and go to Beta directly.
164+
165+
#### Beta
166+
167+
* Support `kube-controller-manager` flag `--controllers=-taint-manager` to disable the default taint-manager.
168+
* Unit and e2e tests completed and passed.
169+
* Performance and scalability tests to verify there is non-negligible performance degradation in node taint-based eviction.
170+
* Update documents to reflect the changes.
171+
172+
#### GA
173+
174+
* Fix all reported bugs if any.
175+
176+
### Upgrade / Downgrade Strategy
177+
178+
A feature gate `SeparateTaintManager` enables and disables the new feature. When the feature is turned on, a user can use `kube-controller-manager`'s flag to disable the default `TaintManager`.
179+
180+
### Version Skew Strategy
181+
182+
## Production Readiness Review Questionnaire
183+
184+
### Feature Enablement and Rollback
185+
186+
###### How can this feature be enabled / disabled in a live cluster?
187+
188+
- [X] Feature gate (also fill in values in `kep.yaml`)
189+
- Feature gate name: SeparateTaintManager
190+
- Components depending on the feature gate: kube-controller-manager
191+
192+
###### Does enabling the feature change any default behavior?
193+
No, the default `NodeLifecycleManager` and `TaintManager` behavior will stay the same.
194+
195+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
196+
Yes. Once rolled back, `TaintManager` is enabled along with `NodeLifecycleController`.
197+
198+
###### What happens if we reenable the feature if it was previously rolled back?
199+
The feature will work as usual.
200+
201+
###### Are there any tests for feature enablement/disablement?
202+
No enablement/disablement tests are needed since this is in-memory feature and just regular tests with the feature being enabled/disabled do the job.
203+
204+
### Rollout, Upgrade and Rollback Planning
205+
206+
<!--
207+
This section must be completed when targeting beta to a release.
208+
-->
209+
210+
###### How can a rollout or rollback fail? Can it impact already running workloads?
211+
This is an opt-in feature, and it does not change any default behavior. Unless there is a bug in the implementation, a rollout can not fail. If a rollout does fail, running workloads will not be evicted properly on tainted nodes. We don't except a rollback can fail.
212+
213+
###### What specific metrics should inform a rollback?
214+
A significantly changing number of pod evictions (`taint_manager_pod_evictions_total`) and/or a substantial increase in pod eviction latency (`taint_manager_pod_eviction_latency`) in Kubernetes.
215+
216+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
217+
The upgrade will be tested in the planned unit and e2e tests. The rollback and upgrade-downgrade-upgrade path will
218+
be tested manually (see [Test Plan](#test-plan) section).
219+
220+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
221+
No
222+
223+
### Monitoring Requirements
224+
225+
###### How can an operator determine if the feature is in use by workloads?
226+
It can be determined by if the feature gate `SeparateTaintManager` is used or not.
227+
228+
###### How can someone using this feature know that it is working for their instance?
229+
- [x] Other (treat as last resort)
230+
- Details: Node taints and taint-based pod eviction should work as usual and there is no significant change in the number pod evictions and pod eviction latency.
231+
232+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
233+
The performance of node taint-based eviction should remain the same level as before.
234+
235+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
236+
The metrics for both `NodeLifecycleController` and `TaintManager`'s queues should stay the same levels as before, the number of pod evictions and pod eviction latency.
237+
238+
- [X] Metrics
239+
- Metric name: `taint_manager_pod_evictions_total`, `taint_manager_pod_eviction_latency`
240+
- Components exposing the metric: `kube-controller-manager`
241+
242+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
243+
No
244+
245+
### Dependencies
246+
247+
###### Does this feature depend on any specific services running in the cluster?
248+
No
249+
250+
### Scalability
251+
252+
###### Will enabling / using this feature result in any new API calls?
253+
No
254+
255+
###### Will enabling / using this feature result in introducing new API types?
256+
No
257+
258+
###### Will enabling / using this feature result in any new calls to the cloud provider?
259+
No
260+
261+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
262+
No
263+
264+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
265+
It may slightly increase the time of pod eviction.
266+
267+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
268+
No
269+
270+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
271+
No
272+
273+
### Troubleshooting
274+
275+
###### How does this feature react if the API server and/or etcd is unavailable?
276+
The same behavior as the current version: fail to add taint and/or evict pods on tainted nodes.
277+
278+
###### What are other known failure modes?
279+
No
280+
281+
###### What steps should be taken if SLOs are not being met to determine the problem?
282+
If the pod eviction latency increases significantly, validate if the communication between `NodeLifecycleController` and `TaintManager` works. If the number of pod evictions is abnormal, run tests to verify the `TaintManager` works properly.
283+
284+
## Implementation History
285+
286+
* 2023-03-06: Initial KEP published.
287+
288+
## Drawbacks
289+
290+
## Alternatives
291+
292+
Keeping the code as-is is an option, but it will make future extension of `TaintManager` harder and less flexible.
293+
294+
295+
## Infrastructure Needed (Optional)
296+
297+
N/A
298+
299+
## Note
300+
`taint-manager`, `TaintManager` and `NoExecuteTaintManager` are used interchangeably in this doc.
81.9 KB
Loading
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
title: Decouple TaintManager from NodeLifeCycleController
2+
kep-number: 3902
3+
authors:
4+
- "@yuanchen8911"
5+
- "@atosatto"
6+
- "@ddebroy"
7+
- "@ravisantoshgudimetla"
8+
owning-sig: sig-scheduling
9+
participating-sigs:
10+
- sig-apps
11+
- sig-node
12+
status: implementable
13+
creation-date: 2023-03-07
14+
reviewers:
15+
- "@Huang-Wei"
16+
- "@alculquicondor"
17+
- "@SergeyKanzhelev"
18+
- "@soltysh"
19+
approvers:
20+
- "@Huang-Wei"
21+
- "@alculquicondor"
22+
- "@SergeyKanzhelev"
23+
- "@soltysh"
24+
25+
see-also:
26+
replaces:
27+
28+
# The target maturity stage in the current dev cycle for this KEP.
29+
stage: beta
30+
31+
# The most recent milestone for which work toward delivery of this KEP has been
32+
# done. This can be the current (upcoming) milestone, if it is being actively
33+
# worked on.
34+
latest-milestone: "v1.28"
35+
36+
# The milestone at which this feature was, or is targeted to be, at each stage.
37+
milestone:
38+
beta: "v1.28"
39+
40+
# The following PRR answers are required at alpha release
41+
# List the feature gate name and the components for which it must be enabled
42+
feature-gates:
43+
- name: SeparateTaintManager
44+
components:
45+
- kube-controller-manager
46+
disable-supported: true
47+
48+
# The following PRR answers are required at beta release
49+
metrics:
50+
- TBD
80.1 KB
Loading

0 commit comments

Comments
 (0)