Skip to content

Commit b26c25d

Browse files
authored
Merge pull request kubernetes#3194 from mrunalp/pod_priority_shutdown_beta
KEP-2712: Update Pod Priority Based Graceful Shutdown kep to target beta
2 parents a40b53b + ca1c1f3 commit b26c25d

File tree

3 files changed

+29
-16
lines changed

3 files changed

+29
-16
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2712
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-node/2712-pod-priority-based-graceful-node-shutdown/README.md

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -374,17 +374,23 @@ _This section must be completed when targeting beta graduation to a release._
374374
logs or events for this purpose.
375375

376376
Check if the feature gate and kubelet config settings are enabled on a node.
377+
Kubelet will be exposing metrics described below.
377378

378379
* **What are the SLIs (Service Level Indicators) an operator can use to determine
379380
the health of the service?**
380-
- [ ] Metrics
381-
- Metric name:
381+
- [x] Metrics
382+
- Metric name: GracefulShutdownStartTime, GracefulShutdownEndTime
382383
- [Optional] Aggregation method:
383-
- Components exposing the metric:
384+
- Components exposing the metric: Kubelet
384385
- [ ] Other (treat as last resort)
385386
- Details:
386387

387-
N/A
388+
Kubelet can write down the start and end time for the graceful shutdown to local
389+
storage and expose those metrics for scraping. In some cases, the metrics could
390+
be missed based on scraping interval, but they can be served back up after the
391+
kubelet comes online on a reboot. Operators could then look at these metrics to
392+
troubleshoot issues with the feature across their nodes in a cluster.
393+
388394

389395
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
390396
At a high level, this usually will be in the form of "high percentile of SLI
@@ -395,7 +401,8 @@ N/A
395401
job creation time) for cron job <= 10%
396402
- 99,9% of /health requests per day finish with 200 code
397403

398-
N/A.
404+
Graceful shutdown feature should function outside of power failures or h/w failure
405+
rates on the nodes.
399406

400407
* **Are there any missing metrics that would be useful to have to improve observability
401408
of this feature?**
@@ -502,15 +509,18 @@ The feature does not depend on the API server / etcd.
502509

503510
* **What are other known failure modes?**
504511
For each of them, fill in the following information by copying the below template:
505-
- [Failure mode brief description]
506-
- Detection: How can it be detected via metrics? Stated another way:
507-
how can an operator troubleshoot without logging into a master or worker node?
508-
- Mitigations: What can be done to stop the bleeding, especially for already
509-
running user workloads?
510-
- Diagnostics: What are the useful log messages and their required logging
511-
levels that could help debug the issue?
512-
Not required until feature graduated to beta.
513-
- Testing: Are there any tests for failure mode? If not, describe why.
512+
513+
- Kubelet does not detect the shutdown e.g. due to systemd inhibitor not registering.
514+
- Detection: Kubelet logs
515+
- Mitigations: Workloads will not be affected, graceful node shutdown will not be enabled
516+
- Diagnostics: At default (v2) logging verbosity, kubelet will log if inhibitor was registered
517+
- Testing: Existing OSS SIG-Node E2E tests check for graceful node shutdown including priority based shutdown
518+
519+
- Priority based graceful node shutdown setting is not respected
520+
- Detection: Kubelet logs contain time allocated to each pod shutdown
521+
- Mitigations: Change priority based graceful node shutdown config or revert to existing beta graceful node shutdown config
522+
- Diagnostics: Kubelet logging at v2 level
523+
- Testing: Existing OSS SIG-Node E2E tests check that pod shutdown time is respected depending on the config
514524

515525
* **What steps should be taken if SLOs are not being met to determine the problem?**
516526

keps/sig-node/2712-pod-priority-based-graceful-node-shutdown/kep.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,17 @@ prr-approvers:
1515
- "@deads2k"
1616

1717
# The target maturity stage in the current dev cycle for this KEP.
18-
stage: alpha
18+
stage: beta
1919

2020
# The most recent milestone for which work toward delivery of this KEP has been
2121
# done. This can be the current (upcoming) milestone, if it is being actively
2222
# worked on.
23-
latest-milestone: "v1.23"
23+
latest-milestone: "v1.24"
2424

2525
# The milestone at which this feature was, or is targeted to be, at each stage.
2626
milestone:
2727
alpha: "v1.23"
28+
beta: "v1.24"
2829

2930
# The following PRR answers are required at alpha release
3031
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)