Skip to content

Commit 68d3c13

Browse files
committed
KEP-3902: update design to reflect name of the feature-flag and metrics used by the implementation
1 parent 8f94a4c commit 68d3c13

File tree

1 file changed

+42
-39
lines changed
  • keps/sig-scheduling/3902-decoupled-taint-manager

1 file changed

+42
-39
lines changed

keps/sig-scheduling/3902-decoupled-taint-manager/README.md

Lines changed: 42 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# KEP-3902: Decouple Taint-based Pod Eviction from Node Lifecycle Controller
1+
# KEP-3902: Decouple Taint-based Pod Eviction from Node Lifecycle Controller
22
<!-- toc -->
33
- [Release Signoff Checklist](#release-signoff-checklist)
44
- [Summary](#summary)
@@ -61,36 +61,38 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6161

6262
In Kubernetes, `NodeLifecycleController` applies predefined `NoExecute` taints (e.g., `Unreachable`, `NotReady`) when the nodes are determined to be unhealthy. After the nodes get tainted, `TaintManager` does its due diligence to start deleting running pods on those nodes based on `NoExecute` taints, which can be added by anyone.
6363

64-
In this KEP, we propose to decouple `TaintManager` that performs taint-based pod eviction from `NodeLifecycleController` and make them two separate controllers: `NodeLifecycleController` to add taints to unhealthy nodes and `TaintManager` to perform pod deletion on nodes tainted with NoExecute effect.
64+
In this KEP, we propose to decouple `TaintManager` that performs taint-based pod eviction from `NodeLifecycleController` and make them two separate controllers: `NodeLifecycleController` to add taints to unhealthy nodes and `TaintEvictionController` to perform pod deletion on nodes tainted with NoExecute effect.
6565

66-
This separation not only improves code organization but also makes it easier to improve `TaintManager` or build custom `TaintManager`.
66+
This separation not only improves code organization but also makes it easier to improve `TaintEvictionController` or build custom implementations of the taint based eviction.
6767

6868
## Motivation
69-
`NodeLifecycleController` combines two independent functions:
70-
* Adding a pre-defined set of `NoExecute` taints to nodes based on the node conditions
69+
`NodeLifecycleController` combines two independent functions:
70+
* Adding a pre-defined set of `NoExecute` taints to nodes based on the node conditions
7171
* Performing pod eviction on `NoExecute` taints
7272

7373
Splitting the `NodeLifecycleController` based on the above functionalities will help to disentangle code and make future extensions to either component manageable.
7474

7575
### Goals
7676

77-
* Move taint-based eviction implementation out of `NodeLifecycleController` into a separate and independent taint-manager to enhance separation of concerns, maintainability.
77+
* Move taint-based eviction implementation out of `NodeLifecycleController` into a separate and independent taint-manager to enhance separation of concerns, maintainability.
7878

7979
### Non-Goals
8080

81-
The main focus of the KEP is on the separation of concerns between `NodeLifecycleController` and `TaintManager`, which will improve code organization and enable future extensions to `TaintManager`. While it is true that the separation allows cluster operators to build custom `TaintManager` and disable the default `TaintManager`, this is a side-effect of the change rather than the main goal. the KEP emphasizes the benefits of the separation of concerns and briefly mention the option to disable the default TaintManager as a potential side-effect.
81+
The main focus of the KEP is on the separation of concerns between `NodeLifecycleController` and `TaintEvictionController`, which will improve code organization and enable future extensions to `TaintEvictionController`. While it is true that the separation allows cluster operators to build custom `TaintEvictionController` and disable the default `TaintEvictionController`, this is a side-effect of the change rather than the main goal. the KEP emphasizes the benefits of the separation of concerns and briefly mention the option to disable the default TaintEvictionController as a potential side-effect.
82+
83+
8284

8385
The non-goals are the following.
8486
* Introduce extra maintenance burden or regression after the code reorganization
85-
* Introduce incompatible behavior of `NodeLifecycleController` or `TaintManager`
87+
* Introduce incompatible behavior of `NodeLifecycleController` or `TaintEvictionController`
8688
* Extend and enhance the current built-in APIs for customization of node taint-based eviction, including additional fields, more flexible communication protocols, webhooks and CRD references, etc.
87-
* Actual extension and enhancement of the current `TaintManager`.
89+
* Actual extension and enhancement of the current `TaintEvictionController`.
8890

8991
## Proposal
9092

9193
### User Stories (Optional)
9294

93-
While this split is for improving the code maintainability, an effect of this change is that it allows cluster-admins to extend and enhance the default `TaintManager` and even replace the default `TaintManager` with a custom implementation. However, discussing such use-cases for extending `TaintManager` or writing custom `TaintManager` is beyond the scope of this KEP.
95+
While this split is for improving the code maintainability, an effect of this change is that it allows cluster-admins to extend and enhance the default `TaintEvictionController` and even replace the default `TaintEvictionController` with a custom implementation. However, discussing such use-cases for extending `TaintEvictionController` or writing custom `TaintEvictionController` is beyond the scope of this KEP.
9496

9597
#### Story
9698

@@ -100,7 +102,7 @@ While this split is for improving the code maintainability, an effect of this ch
100102

101103
Compared with a single combined controller, the risks of using two separate controllers include the following.
102104
* Slightly increase the communication overhead from applying node taints to performing pod eviction.
103-
* Make cancellation of pod eviction and un-tainting nodes harder.
105+
* Make cancellation of pod eviction and un-tainting nodes harder.
104106

105107
However, the performance overhead of the split controllers is expected to be small, and pod eviction cancellation and un-tainting nodes are not common use cases. Splitting `NodeLifecycleController` is more beneficial by promoting a cleaner design and enhancing the overall maintainability and extensibility in the long term.
106108

@@ -114,20 +116,20 @@ In Kubernetes version 1.27 and earlier, the `NoExecuteTaintManager` component is
114116

115117
The proposed design refactors `NodeLifecycleController` into two independent controllers, which are managed by `kube-controller-manager`.
116118

117-
1. `NodeLifecycleController` monitors node health and adds `NotReady` and `Unreachable` taints to nodes
118-
2. `NoExecuteTaintManager` watches for node and pod updates, and performs `NoExecute` taint based pod eviction
119+
1. `NodeLifecycleController` monitors node health and adds `NotReady` and `Unreachable` taints to nodes
120+
2. `TaintEvictionController` watches for node and pod updates, and performs `NoExecute` taint based pod eviction
119121

120-
The existing kube-controller-manager flag `--controllers` can be used to optionally disable the `NoExecuteTaintManager`.
122+
The existing kube-controller-manager flag `--controllers` can be used to optionally disable the `TaintEvictionController`.
121123

122124
![New Controllers](new-controllers.png)
123125

124126
### Implementation
125127

126128
A new `NodeLifecycleManager` is implemented by removing taint-manager related code from `controller/nodelifecycle/node_lifecycle_controller.go`.
127129

128-
A new `NoExecuteTaintManager` is created as a top-level controller managed by` kube-controller-manager`. Its implementation is based on the current taint-manager from `controller/nodelifecycle/taint-manager.go`.
130+
A new `TaintEvictionController` is created as a top-level controller managed by` kube-controller-manager`. Its implementation is based on the current taint-manager from `controller/nodelifecycle/taint-manager.go`.
129131

130-
The creation and startup of the default taint-manager is performed by `kube-controller-manager`. A feature gate `SeparateTaintManager` controls whether to use the split `TaintManager` or roll back to the old `NodeLifecycleController`.
132+
The creation and startup of the default taint-manager is performed by `kube-controller-manager`. A feature gate `SeparateTaintEvictionController` controls whether to use the split `TaintEvictionController` or roll back to the old `NodeLifecycleController`.
131133

132134
### Test Plan
133135

@@ -139,22 +141,22 @@ Kubernetes already has a good coverage of node `No-Execute` eviction. We will ad
139141

140142
##### Unit tests
141143
- `pkg/controller/taintmanager`: 2023-06-03 - 0% (to add)
142-
- Enable and disable the new `TaintManager`
143-
- The new `TaintManager` acts on node taints properly
144-
- `pkg/controller/apis/config/v1alpha1`: 2023-06-03 - 0%
145-
- Test new configuration
144+
- Enable and disable the new `TaintEvictionController`
145+
- The new `TaintEvictionController` acts on node taints properly
146+
- `pkg/controller/apis/config/v1alpha1`: 2023-06-03 - 0%
147+
- Test new configuration
146148
- `pkg/controller/nodelifecycle`: 2023-06-03 - 74.8%
147149
- Test the combined controller
148150
- `pkg/controller/nodelifecycle/scheduler`: 2023-06-03 - 90%
149151
- Test the combined controller
150152

151153
##### Integration tests
152-
- Verify the ability to enable and disable the default `TaintManager`.
153-
- Verify the new `TaintManager`to act on node taints properly.
154+
- Verify the ability to enable and disable the default `TaintEvictionController`.
155+
- Verify the new `TaintEvictionController`to act on node taints properly.
154156

155157
##### E2E tests
156-
- Verify the new controllers to pass the existing E2E and conformance tests using the split `TaintManager`.
157-
- Manually test rollback and the `upgrade->downgrade->upgrade` path to verify it can pass the existing e2e tests.
158+
- Verify the new controllers to pass the existing E2E and conformance tests using the split `TaintEvictionController`.
159+
- Manually test rollback and the `upgrade->downgrade->upgrade` path to verify it can pass the existing e2e tests.
158160

159161
### Graduation Criteria
160162

@@ -164,7 +166,7 @@ Since there are no new API changes, we can skip Alpha and go to Beta directly.
164166

165167
#### Beta
166168

167-
* Support `kube-controller-manager` flag `--controllers=-taint-manager` to disable the default taint-manager.
169+
* Support `kube-controller-manager` flag `--controllers=-taint-eviction-controller` to disable the default `TaintEvictionController`.
168170
* Unit and e2e tests completed and passed.
169171
* Performance and scalability tests to verify there is non-negligible performance degradation in node taint-based eviction.
170172
* Update documents to reflect the changes.
@@ -175,7 +177,7 @@ Since there are no new API changes, we can skip Alpha and go to Beta directly.
175177

176178
### Upgrade / Downgrade Strategy
177179

178-
A feature gate `SeparateTaintManager` enables and disables the new feature. When the feature is turned on, a user can use `kube-controller-manager`'s flag to disable the default `TaintManager`.
180+
A feature gate `SeparateTaintEvictionController` enables and disables the new feature. When the feature is turned on, a user can use `kube-controller-manager`'s flag to disable the default `TaintEvictionController`.
179181

180182
### Version Skew Strategy
181183

@@ -186,14 +188,14 @@ A feature gate `SeparateTaintManager` enables and disables the new feature. Whe
186188
###### How can this feature be enabled / disabled in a live cluster?
187189

188190
- [X] Feature gate (also fill in values in `kep.yaml`)
189-
- Feature gate name: SeparateTaintManager
191+
- Feature gate name: SeparateTaintEvictionController
190192
- Components depending on the feature gate: kube-controller-manager
191193

192194
###### Does enabling the feature change any default behavior?
193-
No, the default `NodeLifecycleManager` and `TaintManager` behavior will stay the same.
195+
No, the default `NodeLifecycleManager` and `TaintEvictionController` behavior will stay the same.
194196

195197
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
196-
Yes. Once rolled back, `TaintManager` is enabled along with `NodeLifecycleController`.
198+
Yes. Once rolled back, `TaintEvictionController` is enabled along with `NodeLifecycleController`.
197199

198200
###### What happens if we reenable the feature if it was previously rolled back?
199201
The feature will work as usual.
@@ -215,28 +217,28 @@ A significantly changing number of pod evictions (`taint_manager_pod_evictions_t
215217

216218
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
217219
The upgrade will be tested in the planned unit and e2e tests. The rollback and upgrade-downgrade-upgrade path will
218-
be tested manually (see [Test Plan](#test-plan) section).
220+
be tested manually (see [Test Plan](#test-plan) section).
219221

220222
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
221223
No
222224

223225
### Monitoring Requirements
224226

225227
###### How can an operator determine if the feature is in use by workloads?
226-
It can be determined by if the feature gate `SeparateTaintManager` is used or not.
228+
It can be determined by if the feature gate `SeparateTaintEvictionController` is used or not.
227229

228230
###### How can someone using this feature know that it is working for their instance?
229231
- [x] Other (treat as last resort)
230-
- Details: Node taints and taint-based pod eviction should work as usual and there is no significant change in the number pod evictions and pod eviction latency.
232+
- Details: Node taints and taint-based pod eviction should work as usual and there is no significant change in the number pod evictions and pod eviction latency.
231233

232234
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
233235
The performance of node taint-based eviction should remain the same level as before.
234236

235237
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
236-
The metrics for both `NodeLifecycleController` and `TaintManager`'s queues should stay the same levels as before, the number of pod evictions and pod eviction latency.
238+
The metrics for both `NodeLifecycleController` and `TaintEvictionController`'s queues should stay the same levels as before, the number of pod evictions and pod eviction latency.
237239

238240
- [X] Metrics
239-
- Metric name: `taint_manager_pod_evictions_total`, `taint_manager_pod_eviction_latency`
241+
- Metric name: `taint_eviction_controller_pod_evictions_total`, `taint_eviction_controller_pod_eviction_latency`
240242
- Components exposing the metric: `kube-controller-manager`
241243

242244
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -262,7 +264,7 @@ No
262264
No
263265

264266
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
265-
It may slightly increase the time of pod eviction.
267+
It may slightly increase the time of pod eviction.
266268

267269
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
268270
No
@@ -279,22 +281,23 @@ The same behavior as the current version: fail to add taint and/or evict pods on
279281
No
280282

281283
###### What steps should be taken if SLOs are not being met to determine the problem?
282-
If the pod eviction latency increases significantly, validate if the communication between `NodeLifecycleController` and `TaintManager` works. If the number of pod evictions is abnormal, run tests to verify the `TaintManager` works properly.
284+
If the pod eviction latency increases significantly, validate if the communication between `NodeLifecycleController` and `TaintEvictionController` works. If the number of pod evictions is abnormal, run tests to verify the `TaintEvictionController` works properly.
283285

284286
## Implementation History
285287

286-
* 2023-03-06: Initial KEP published.
288+
* 2023-03-06: Initial KEP published.
289+
* 2023-10-23: KEP updated to update name of the `SeparateTaintEvictionController` feature-flag and exposed metrics.
287290

288291
## Drawbacks
289292

290293
## Alternatives
291294

292-
Keeping the code as-is is an option, but it will make future extension of `TaintManager` harder and less flexible.
295+
Keeping the code as-is is an option, but it will make future extension of `TaintEvictionController` harder and less flexible.
293296

294297

295298
## Infrastructure Needed (Optional)
296299

297300
N/A
298301

299302
## Note
300-
`taint-manager`, `TaintManager` and `NoExecuteTaintManager` are used interchangeably in this doc.
303+
`taint-manager`, `TaintManager` and `TaintEvictionController` are used interchangeably in this doc.

0 commit comments

Comments
 (0)