Skip to content

Commit cd4cf82

Browse files
authored
Merge pull request kubernetes#2472 from bobbypage/graceful-shutdown-beta-kep
KEP 2000: Update graceful shutdown KEP for beta
2 parents 8f19839 + 27f1a4e commit cd4cf82

File tree

3 files changed

+50
-3
lines changed

3 files changed

+50
-3
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2000
2+
beta:
3+
approver: "@johnbelamaric"

keps/sig-node/2000-graceful-node-shutdown/README.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,7 @@ you need any help or guidance.
646646
_This section must be completed when targeting alpha to a release._
647647

648648
* **How can this feature be enabled / disabled in a live cluster?**
649-
- [ ] Feature gate (also fill in values in `kep.yaml`)
649+
- [X] Feature gate (also fill in values in `kep.yaml`)
650650
- Feature gate name: `GracefulNodeShutdown`
651651
- Components depending on the feature gate:
652652
- `kubelet`
@@ -696,17 +696,26 @@ _This section must be completed when targeting beta graduation to a release._
696696
Try to be as paranoid as possible - e.g., what if some components will restart
697697
mid-rollout?
698698

699+
This feature should not impact rollouts.
700+
699701
* **What specific metrics should inform a rollback?**
700702

703+
N/A.
704+
701705
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
702706
Describe manual testing that was done and the outcomes.
703707
Longer term, we may want to require automated upgrade/rollback tests, but we
704708
are missing a bunch of machinery and tooling and can't do that now.
705709

710+
The feature is part of kubelet config so updating kubelet config should
711+
enable/disable the feature; upgrade/downgrade is N/A.
712+
706713
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
707714
fields of API types, flags, etc.?**
708715
Even if applying deprecation policies, they may still surprise some users.
709716

717+
No.
718+
710719
### Monitoring Requirements
711720

712721
_This section must be completed when targeting beta graduation to a release._
@@ -716,6 +725,8 @@ _This section must be completed when targeting beta graduation to a release._
716725
checking if there are objects with field X set) may be a last resort. Avoid
717726
logs or events for this purpose.
718727

728+
Check if the feature gate and kubelet config settings are enabled on a node.
729+
719730
* **What are the SLIs (Service Level Indicators) an operator can use to determine
720731
the health of the service?**
721732
- [ ] Metrics
@@ -725,6 +736,8 @@ the health of the service?**
725736
- [ ] Other (treat as last resort)
726737
- Details:
727738

739+
N/A
740+
728741
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
729742
At a high level, this usually will be in the form of "high percentile of SLI
730743
per day <= X". It's impossible to provide comprehensive guidance, but at the very
@@ -734,11 +747,15 @@ the health of the service?**
734747
job creation time) for cron job <= 10%
735748
- 99,9% of /health requests per day finish with 200 code
736749

750+
N/A.
751+
737752
* **Are there any missing metrics that would be useful to have to improve observability
738753
of this feature?**
739754
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
740755
implementation difficulties, etc.).
741756

757+
N/A.
758+
742759
### Dependencies
743760

744761
_This section must be completed when targeting beta graduation to a release._
@@ -757,6 +774,8 @@ _This section must be completed when targeting beta graduation to a release._
757774
- Impact of its outage on the feature:
758775
- Impact of its degraded performance or high-error rates on the feature:
759776

777+
No, this feature doesn't depend on any specific services running the cluster.
778+
It only depends on systemd running on the node itself.
760779

761780
### Scalability
762781

@@ -780,27 +799,37 @@ previous answers based on experience in the field._
780799
- periodic API calls to reconcile state (e.g. periodic fetching state,
781800
heartbeats, leader election, etc.)
782801

802+
No.
803+
783804
* **Will enabling / using this feature result in introducing new API types?**
784805
Describe them, providing:
785806
- API type
786807
- Supported number of objects per cluster
787808
- Supported number of objects per namespace (for namespace-scoped objects)
788809

810+
No.
811+
789812
* **Will enabling / using this feature result in any new calls to the cloud
790813
provider?**
791814

815+
No.
816+
792817
* **Will enabling / using this feature result in increasing size or count of
793818
the existing API objects?**
794819
Describe them, providing:
795820
- API type(s):
796821
- Estimated increase in size: (e.g., new annotation of size 32B)
797822
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
798823

824+
No.
825+
799826
* **Will enabling / using this feature result in increasing time taken by any
800827
operations covered by [existing SLIs/SLOs]?**
801828
Think about adding additional work or introducing new steps in between
802829
(e.g. need to do X to start a container), etc. Please describe the details.
803830

831+
No.
832+
804833
* **Will enabling / using this feature result in non-negligible increase of
805834
resource usage (CPU, RAM, disk, IO, ...) in any components?**
806835
Things to keep in mind include: additional in-memory state, additional
@@ -809,6 +838,8 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
809838
This through this both in small and large cases, again with respect to the
810839
[supported limits].
811840

841+
No.
842+
812843
### Troubleshooting
813844

814845
The Troubleshooting section currently serves the `Playbook` role. We may consider
@@ -819,6 +850,8 @@ _This section must be completed when targeting beta graduation to a release._
819850

820851
* **How does this feature react if the API server and/or etcd is unavailable?**
821852

853+
The feature does not depend on the API server / etcd.
854+
822855
* **What are other known failure modes?**
823856
For each of them, fill in the following information by copying the below template:
824857
- [Failure mode brief description]
@@ -836,6 +869,8 @@ _This section must be completed when targeting beta graduation to a release._
836869
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
837870
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
838871

872+
N/A.
873+
839874
## Implementation History
840875

841876
<!--

keps/sig-node/2000-graceful-node-shutdown/kep.yaml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,27 @@ authors:
66
owning-sig: sig-node
77
status: implementable
88
creation-date: 2020-09-21
9+
reviewers:
10+
- "@SergeyKanzhelev"
11+
- "@karan"
12+
approvers:
13+
- "@dchen1107"
14+
- "@derekwaynecarr"
15+
prr-approvers:
16+
- "@johnbelamaric"
917

1018
# The target maturity stage in the current dev cycle for this KEP.
11-
stage: alpha
19+
stage: beta
1220

1321
# The most recent milestone for which work toward delivery of this KEP has been
1422
# done. This can be the current (upcoming) milestone, if it is being actively
1523
# worked on.
16-
latest-milestone: "v1.20"
24+
latest-milestone: "v1.21"
1725

1826
# The milestone at which this feature was, or is targeted to be, at each stage.
1927
milestone:
2028
alpha: "v1.20"
29+
beta: "v1.21"
2130

2231
# The following PRR answers are required at alpha release
2332
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)