Skip to content

Commit e50a22f

Browse files
committed
Add node resource hot-unplug KEP
1 parent 562dd75 commit e50a22f

File tree

3 files changed

+424
-0
lines changed

3 files changed

+424
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 5578
2+
alpha:
3+
approver: "@deads2k"
Lines changed: 368 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,368 @@
1+
# KEP-5578: Node Resource Hot-Unplug
2+
3+
<!--
4+
A table of contents is helpful for quickly jumping to sections of a KEP and for
5+
highlighting any additional information provided beyond the standard KEP
6+
template.
7+
8+
Ensure the TOC is wrapped with
9+
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
10+
tags, and then generate with `hack/update-toc.sh`.
11+
-->
12+
13+
<!-- toc -->
14+
- [Release Signoff Checklist](#release-signoff-checklist)
15+
- [Glossary](#glossary)
16+
- [Summary](#summary)
17+
- [Motivation](#motivation)
18+
- [Goals](#goals)
19+
- [Non-Goals](#non-goals)
20+
- [Proposal](#proposal)
21+
- [User Stories (Optional)](#user-stories-optional)
22+
- [Story 1](#story-1)
23+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
24+
- [Risks and Mitigations](#risks-and-mitigations)
25+
- [Design Details](#design-details)
26+
- [Test Plan](#test-plan)
27+
- [Prerequisite testing updates](#prerequisite-testing-updates)
28+
- [Unit tests](#unit-tests)
29+
- [Integration tests](#integration-tests)
30+
- [e2e tests](#e2e-tests)
31+
- [Graduation Criteria](#graduation-criteria)
32+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
33+
- [Version Skew Strategy](#version-skew-strategy)
34+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
35+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
36+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
37+
- [Monitoring Requirements](#monitoring-requirements)
38+
- [Dependencies](#dependencies)
39+
- [Scalability](#scalability)
40+
- [Troubleshooting](#troubleshooting)
41+
- [Implementation History](#implementation-history)
42+
- [Drawbacks](#drawbacks)
43+
- [Alternatives](#alternatives)
44+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
45+
<!-- /toc -->
46+
47+
## Release Signoff Checklist
48+
49+
Items marked with (R) are required *prior to targeting to a milestone / release*.
50+
51+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
52+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
53+
- [ ] (R) Design details are appropriately documented
54+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
55+
- [ ] e2e Tests for all Beta API Operations (endpoints)
56+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
57+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
58+
- [ ] (R) Graduation criteria is in place
59+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA
60+
- [ ] (R) Production readiness review completed
61+
- [ ] (R) Production readiness review approved
62+
- [ ] "Implementation History" section is up-to-date for milestone
63+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
64+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
65+
66+
<!--
67+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
68+
-->
69+
70+
[kubernetes.io]: https://kubernetes.io/
71+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
72+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
73+
[kubernetes/website]: https://git.k8s.io/website
74+
75+
## Glossary
76+
Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
77+
78+
Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
79+
80+
Node Compute Resource: CPU, Memory, Swap Capacity and HugePages
81+
82+
Node Resource Hotplug KEP: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug
83+
84+
## Summary
85+
86+
This KEP is intended to be added on top of the Node Resource Hotplug [KEP](https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug) to facilitate the hotunplug of the node compute resources.
87+
88+
## Motivation
89+
90+
Node Resource Hotplug provides the ability to increase the resources of a cluster on demand without any downtime during a surge of resource usages by workloads
91+
Now with Node resource hot-unplug the motive is to remove the resources when not needed for cost optimisation without any downtime.
92+
93+
94+
### Goals
95+
96+
* Achieve seamless node capacity reduction through resource hotunplug.
97+
98+
### Non-Goals
99+
100+
* Dynamically adjust system reserved and kube reserved values.
101+
* Update the autoscaler to utilize resource hotplug.
102+
* Re-balance workloads across the nodes.
103+
* Update runtime/NRI plugins with host resource changes.
104+
105+
106+
## Proposal
107+
108+
109+
### User Stories (Optional)
110+
111+
#### Story 1
112+
113+
As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot-unplug resources without waiting for nodes to be removed from cluster.
114+
115+
### Notes/Constraints/Caveats (Optional)
116+
117+
### Risks and Mitigations
118+
119+
- Workloads that are dependent on the initial node configuration, such as:
120+
- Workloads that spawns per-CPU processes (threads, workpools, etc.)
121+
- Workloads that depend on the CPU-Memory relationships (e.g Processes that depend on NUMA/NUMA alignment.)
122+
- Dependency of external libraries/device drivers to support CPU hot-unplug as a supported feature.
123+
- Kubelet failure during re running pod admission. But its very less likely to occur.
124+
125+
With hot-unplug of resources following components will also be affected but most of them will be handled in Node Resource HotPlug
126+
127+
- Change in Swap limit
128+
- Change in OOMScoreAdjust value
129+
130+
## Design Details
131+
132+
The diagram below shows the interaction between kubelet, node and cAdvisor.
133+
134+
```mermaid
135+
sequenceDiagram
136+
participant node
137+
participant kubelet
138+
participant cAdvisor-cache
139+
participant machine-info
140+
kubelet->>cAdvisor-cache: fetch
141+
cAdvisor-cache->>machine-info: fetch
142+
machine-info->>cAdvisor-cache: update
143+
cAdvisor-cache->>kubelet: update
144+
alt decrease in resource
145+
kubelet->>node: recalculate and update OOMScoreAdj <br> and Swap limit of containers
146+
kubelet->>node: re-initialize resource managers
147+
kubelet->>node: node status update with new capacity
148+
kubelet->>node: rerun pod podmission
149+
end
150+
```
151+
152+
PoC implementation is available here: https://github.com/marquiz/kubernetes/commits/devel/resource-discovery-hot-unplug
153+
154+
### Test Plan
155+
156+
- [x] I/we understand the owners of the involved components may require updates to
157+
existing tests to make this code solid enough prior to committing the changes necessary
158+
to implement this enhancement.
159+
160+
##### Prerequisite testing updates
161+
162+
##### Unit tests
163+
164+
- Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with resource hot-unplug.
165+
- Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod re-addition workflow.
166+
- Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change.
167+
- Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change.
168+
169+
170+
##### Integration tests
171+
172+
Nessary integration tests will be added.
173+
174+
##### e2e tests
175+
176+
Following scenarios need to be covered:
177+
178+
- Node resource information before and after resource hot-unplug for the following scenarios.
179+
downsize -> upsize
180+
downsize -> upsize -> downsize
181+
upsize -> downsize
182+
- State of Running pods after hot-unplug of resources.
183+
184+
### Graduation Criteria
185+
186+
Phase 1: Alpha (target 1.36)
187+
188+
- Feature is disabled by default. It is an opt-in feature which can be enabled by enabling the NodeResourceHotUnPlug feature gate.
189+
- Unit test coverage.
190+
- E2E tests.
191+
- Documentation mentioning high level design.
192+
193+
### Upgrade / Downgrade Strategy
194+
195+
Upgrade
196+
197+
To upgrade the cluster to use this feature, Kubelet should be updated to enable featuregate. Existing cluster does not have any impact as the node resources already been updated during cluster creation.
198+
199+
Downgrade
200+
201+
It's always possible to trivially downgrade to the previous kubelet.
202+
203+
### Version Skew Strategy
204+
205+
Not relevant, As this kubelet specific feature and does not impact other components.
206+
207+
## Production Readiness Review Questionnaire
208+
209+
### Feature Enablement and Rollback
210+
211+
###### How can this feature be enabled / disabled in a live cluster?
212+
213+
- [x] Feature gate (also fill in values in `kep.yaml`)
214+
- Feature gate name: NodeResourceHotUnPlug
215+
- Components depending on the feature gate: kubelet
216+
- [ ] Other
217+
- Describe the mechanism:
218+
- Will enabling / disabling the feature require downtime of the control
219+
plane?
220+
- Will enabling / disabling the feature require downtime or reprovisioning
221+
of a node?
222+
223+
###### Does enabling the feature change any default behavior?
224+
225+
No. This feature is guarded by a feature gate. Existing default behavior does not change if the feature is not used. Even if the feature is enabled via feature gate, If there is no change in node configuration the system will continue to work in the same way.
226+
227+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
228+
229+
Yes. The feature can be disabled by restarting kubelet with the feature-gate off.
230+
231+
###### What happens if we reenable the feature if it was previously rolled back?
232+
233+
To reenanble the feature, need to turn on the feature-gate and restart the kubelet, with feature reenabled, the node resources can be hot-unplugged in again. Cluster will be automatically updated with the new resource information.
234+
235+
###### Are there any tests for feature enablement/disablement?
236+
237+
Yes, the tests will be added along with alpha implementation.
238+
239+
- Validate the hot-unplug of resource to machine is updated at the node resource level.
240+
- Validate the hot-unplug of resource made the running pods move into pending due to lack of resources.
241+
242+
### Rollout, Upgrade and Rollback Planning
243+
244+
Rollout may fail if kubelet failing to rerun pod admission due to programmatic errors. In case of rollout failures, running workloads are not affected, The workload might get OOM killed and will be in Pending state due to lack of resources.
245+
Rollback failure should not affect running workloads.
246+
247+
###### How can a rollout or rollback fail? Can it impact already running workloads?
248+
249+
<!--
250+
Try to be as paranoid as possible - e.g., what if some components will restart
251+
mid-rollout?
252+
253+
Be sure to consider highly-available clusters, where, for example,
254+
feature flags will be enabled on some API servers and not others during the
255+
rollout. Similarly, consider large clusters and how enablement/disablement
256+
will rollout across nodes.
257+
-->
258+
259+
###### What specific metrics should inform a rollback?
260+
261+
If there is significant increase in node_hot_unplug_errors_total metric means the feature is not working as expected.
262+
263+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
264+
265+
It will be tested manually as a part of implementation and there will also be automated tests to cover the scenarios.
266+
267+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
268+
269+
No
270+
271+
### Monitoring Requirements
272+
273+
Monitor the metrics
274+
275+
- node_hot_unplug_request_total
276+
- node_hot_unplug_errors_total
277+
278+
###### How can an operator determine if the feature is in use by workloads?
279+
280+
This feature will be built into kubelet and behind a feature gate. Examining the kubelet feature gate would help in determining whether the feature is used. The enablement of the kubelet feature gate can be determined from the kubernetes_feature_enabled metric.
281+
282+
In addition, newly added metrics node_hot_unplug_request_total, node_hot_unplug_errors_total are incremented in case of unplug of resource and failing to re-runt the pod readmission.
283+
284+
###### How can someone using this feature know that it is working for their instance?
285+
286+
End user can do a hot-unplug of resource and verify the same change as reflected at the node resource level.
287+
288+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
289+
290+
For each node, the value of the metric node_hot_unplug_request_total is expected to match the number of time the node is hot-unplugged. For each node, the value of the metric node_hot_unplug_errors_total is expected to be zero.
291+
292+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
293+
294+
- [x] Metrics
295+
- Metric name:
296+
- node_hot_unplug_request_total
297+
- node_hot_unplug_errors_total
298+
- Components exposing the metric: kubelet
299+
300+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
301+
302+
- node_hot_unplug_request_total
303+
- node_hot_unplug_errors_total
304+
305+
### Dependencies
306+
307+
###### Does this feature depend on any specific services running in the cluster?
308+
309+
No, It does not depend on any service running on the cluster.
310+
311+
### Scalability
312+
313+
###### Will enabling / using this feature result in any new API calls?
314+
315+
No, It won't add/modify any user facing APIs. Internally kubelet runs the pod-readmission.
316+
317+
###### Will enabling / using this feature result in introducing new API types?
318+
319+
No
320+
321+
###### Will enabling / using this feature result in any new calls to the cloud provider?
322+
323+
No
324+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
325+
326+
No
327+
328+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
329+
330+
Negligible, In the case of resource hot-unplug pod readmission will be run, which might take some time.
331+
332+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
333+
334+
Negligible computational overhead might be introduced into kubelet as it need to re-run the pod admission after resource hot-unplug.
335+
336+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
337+
338+
No, Since feature relies on removing node resources it wont result in resource exhaustion as there wont be any addition of new pods.
339+
340+
### Troubleshooting
341+
342+
###### How does this feature react if the API server and/or etcd is unavailable?
343+
344+
This feature is node local and mainly handled in kubelet, It has no dependency on etcd.
345+
In case there are running pods using the resources and there is hot-unplug of resources, Kubelet will re-run the pod admission and it relies on the API server to fetch node information. Without access to the API server, it cannot make accurate decisions as the node resources are not updated.
346+
The pending pods would remain in same condition.
347+
348+
###### What are other known failure modes?
349+
The main logic is with the pod readmission during hot-unplug of resources. Failure scenarios can occur during the readmission.
350+
351+
###### What steps should be taken if SLOs are not being met to determine the problem?
352+
353+
If the SLOs are not being met one can examine the kubelet logs and its also advised not to hot-unplug the node resources.
354+
355+
## Implementation History
356+
357+
358+
## Drawbacks
359+
360+
Pod readmission will be run during hot-unplug of resources and there is very less chance of failure, because of it there might be workload disruption incase there is OOM Kill.
361+
362+
## Alternatives
363+
364+
Scale down the cluster by removing compute nodes.
365+
366+
## Infrastructure Needed (Optional)
367+
368+
VMs backing the nodes of cluster should support hot-unplug of compute resources for e2e tests.

0 commit comments

Comments
 (0)