Skip to content

Commit e06aa34

Browse files
authored
Merge pull request #5380 from zylxjtu/master
[ 4802] Graduate Windows node graceful shutdown from alpha to beta
2 parents 97ebcc9 + b2bef95 commit e06aa34

File tree

3 files changed

+68
-64
lines changed

3 files changed

+68
-64
lines changed

keps/prod-readiness/sig-windows/4802.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 4802
55
alpha:
66
approver: "@deads2k"
7+
beta:
8+
approver: "@deads2k"

keps/sig-windows/4802-windows-node-shutdown/README.md

Lines changed: 63 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -265,8 +265,9 @@ Until then, we will cover all the scenerios with e2e tests
265265

266266
#### Alpha -> Beta Graduation
267267

268-
* Addresses feedback from alpha testers
269268
* Sufficient E2E and unit testing
269+
* Adding [Windows node level test](https://github.com/kubernetes/kubernetes/pull/129938) , which will include the gracefulshutdown case.
270+
* [Enabling the test in CAPZ cluster](https://github.com/kubernetes-sigs/windows-testing/pull/506)
270271

271272
#### Beta -> GA Graduation
272273

@@ -292,7 +293,7 @@ n/a
292293
This section must be completed when targeting alpha to a release.
293294
-->
294295

295-
###### How can this feature be enabled / disabled in a live cluster?
296+
* **How can this feature be enabled / disabled in a live cluster?**
296297

297298
- [X] Feature gate (also fill in values in `kep.yaml`)
298299
- Feature gate name: `WindowsGracefulNodeShutdown`
@@ -301,58 +302,55 @@ This section must be completed when targeting alpha to a release.
301302
- Describe the mechanism:
302303
- Will enabling / disabling the feature require downtime of the control
303304
plane?
304-
No
305+
- No
305306
- Will enabling / disabling the feature require downtime or reprovisioning
306307
of a node?
307-
yes (will require restart of kubelet)
308+
- yes (will require restart of kubelet)
308309

309-
###### Does enabling the feature change any default behavior?
310+
* **Does enabling the feature change any default behavior?**
310311

311-
The main behavior change is that during a node shutdown, pods running on the
312+
* The main behavior change is that during a node shutdown, pods running on the
312313
node will be terminated gracefully.
313314

314-
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
315+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?**
315316

316-
Yes, the feature can be disabled by either disabling the feature gate, or
317+
* Yes, the feature can be disabled by either disabling the feature gate, or
317318
setting `kubeletConfig.ShutdownGracePeriod` to 0 seconds.
318319

319-
###### What happens if we reenable the feature if it was previously rolled back?
320+
* **What happens if we reenable the feature if it was previously rolled back?**
320321

321-
Kubelet will attempt to perform graceful termination of pods during a
322-
node shutdown.
322+
* Kubelet will attempt to perform graceful termination of pods during a
323+
node shutdown.
323324

324-
###### Are there any tests for feature enablement/disablement?
325+
* **Are there any tests for feature enablement/disablement?**
325326

326-
The e2e framework does not currently support enabling or disabling feature
327-
gates.
328-
We have e2e tests to cover the feature when it is enabled and some predefined
329-
setting.
330-
Will add node level integration tests when the node level test framework is available for Windows node
327+
* The e2e framework does not currently support enabling or disabling feature
328+
gates. We have e2e tests to cover the feature when it is enabled and some predefined
329+
setting. Will add node level integration tests when the node level test framework is
330+
available for Windows node
331331

332332
### Rollout, Upgrade and Rollback Planning
333333

334334
<!--
335335
This section must be completed when targeting beta to a release.
336336
-->
337337

338-
###### How can a rollout or rollback fail? Can it impact already running workloads?
338+
* **How can a rollout or rollback fail? Can it impact already running workloads?**
339339

340-
It wil not impact running workloads during rollout/rollback.
340+
* It wil not impact running workloads during rollout/rollback.
341341

342-
###### What specific metrics should inform a rollback?
342+
* **What specific metrics should inform a rollback?**
343343

344-
n/a
345-
346-
The failure of the roll out will behave like disbling this feature, operators can check the kubelet log to get more specific info.
344+
* The failure of the roll out will behave like disbling this feature, operators can check the kubelet log to get more specific info.
347345
ex: `The windows node graceful shutdown has not been enabled, the reasons are xxx`
348346

349-
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
347+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
350348

351-
This is basically how all features work so upgrade and downgrade apply as normal.
349+
* The feature is part of kubelet config so updating kubelet config should enable/disable the feature; upgrade/downgrade is N/A
352350

353-
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
351+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?**
354352

355-
No
353+
* No
356354

357355
### Monitoring Requirements
358356

@@ -363,11 +361,11 @@ For GA, this section is required: approvers should be able to confirm the
363361
previous answers based on experience in the field.
364362
-->
365363

366-
###### How can an operator determine if the feature is in use by workloads?
364+
* **How can an operator determine if the feature is in use by workloads?**
367365

368-
Check if the feature gate and kubelet config settings are enabled on a node.
366+
* Check if the feature gate and kubelet config settings are enabled on a node.
369367

370-
###### How can someone using this feature know that it is working for their instance?
368+
* **How can someone using this feature know that it is working for their instance?**
371369

372370
- [ ] Events
373371
- Event Reason:
@@ -377,36 +375,36 @@ Check if the feature gate and kubelet config settings are enabled on a node.
377375
- [X] Other (treat as last resort)
378376
- Details: Pod.Status.Message, Pod.Status.Reason
379377

380-
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
378+
* **What are the reasonable SLOs (Service Level Objectives) for the enhancement?**
381379

382-
n/a
380+
* n/a
383381

384-
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
382+
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
385383

386384
<!--
387385
Pick one more of these and delete the rest.
388386
-->
389387

390-
- [ ] Metrics
391-
- Metric name:
388+
- [x] Metrics
389+
- Metric name: GracefulShutdownStartTime, GracefulShutdownEndTime
392390
- [Optional] Aggregation method:
393-
- Components exposing the metric:
394-
- [X] Other (treat as last resort)
391+
- Components exposing the metric: Kubelet
392+
- [x] Other (treat as last resort)
395393
- Details: The operator can get the service health information from the logs
396394

397-
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
395+
* **Are there any missing metrics that would be useful to have to improve observability of this feature?**
398396

399-
n/a
397+
* n/a
400398

401399
### Dependencies
402400

403401
<!--
404402
This section must be completed when targeting beta to a release.
405403
-->
406404

407-
###### Does this feature depend on any specific services running in the cluster?
405+
* **Does this feature depend on any specific services running in the cluster?**
408406

409-
No, this feature doesn't depend on any specific services running the cluster.
407+
* No, this feature doesn't depend on any specific services running the cluster.
410408

411409
### Scalability
412410

@@ -420,33 +418,33 @@ For GA, this section is required: approvers should be able to confirm the
420418
previous answers based on experience in the field.
421419
-->
422420

423-
###### Will enabling / using this feature result in any new API calls?
421+
* **Will enabling / using this feature result in any new API calls?**
424422

425-
No
423+
* No
426424

427-
###### Will enabling / using this feature result in introducing new API types?
425+
* **Will enabling / using this feature result in introducing new API types?**
428426

429-
No
427+
* No
430428

431-
###### Will enabling / using this feature result in any new calls to the cloud provider?
429+
* **Will enabling / using this feature result in any new calls to the cloud provider?**
432430

433-
No
431+
* No
434432

435-
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
433+
* **Will enabling / using this feature result in increasing size or count of the existing API objects?**
436434

437-
No
435+
* No
438436

439-
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
437+
* **Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?**
440438

441-
No
439+
* No
442440

443-
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
441+
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?**
444442

445-
No
443+
* No
446444

447-
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
445+
* **Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?**
448446

449-
No
447+
* No
450448

451449
### Troubleshooting
452450

@@ -461,17 +459,21 @@ splitting it into a dedicated `Playbook` document (potentially with some monitor
461459
details). For now, we leave it here.
462460
-->
463461

464-
###### How does this feature react if the API server and/or etcd is unavailable?
462+
* **How does this feature react if the API server and/or etcd is unavailable?**
465463

466-
The feature does not depend on the API server / etcd.
464+
* The feature does not depend on the API server / etcd.
467465

468-
###### What are other known failure modes?
466+
* **What are other known failure modes?**
469467

470-
n/a
468+
- Kubelet does not detect the shutdown e.g. due to kubelet is not started as a Windows service.
469+
- Detection: Kubelet logs
470+
- Mitigations: Workloads will not be affected, graceful node shutdown will not be enabled
471+
- Diagnostics: At default (v2) logging verbosity, kubelet will log if it is [running as a windows service](https://github.com/kubernetes/kubernetes/blob/b4e17418b340e161b8c6cc7f85a6e716abcb561a/pkg/windows/service/service.go#L130)
472+
- Testing: Working on adding SIG-Windows node level E2E tests check for graceful node shutdown including priority based shutdown
471473

472-
###### What steps should be taken if SLOs are not being met to determine the problem?
474+
* **What steps should be taken if SLOs are not being met to determine the problem?**
473475

474-
n/a
476+
* n/a
475477

476478
## Implementation History
477479

keps/sig-windows/4802-windows-node-shutdown/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,17 @@ approvers:
1616
see-also:
1717
- "/keps/sig-node/2000-graceful-node-shutdown"
1818
# The target maturity stage in the current dev cycle for this KEP.
19-
stage: alpha
19+
stage: beta
2020

2121
# The most recent milestone for which work toward delivery of this KEP has been
2222
# done. This can be the current (upcoming) milestone, if it is being actively
2323
# worked on.
24-
latest-milestone: "v1.32"
24+
latest-milestone: "v1.34"
2525

2626
# The milestone at which this feature was, or is targeted to be, at each stage.
2727
milestone:
2828
alpha: "v1.32"
29-
beta: "v1.33"
29+
beta: "v1.34"
3030

3131
# The following PRR answers are required at alpha release
3232
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)