@@ -374,17 +374,23 @@ _This section must be completed when targeting beta graduation to a release._
374
374
logs or events for this purpose.
375
375
376
376
Check if the feature gate and kubelet config settings are enabled on a node.
377
+ Kubelet will be exposing metrics described below.
377
378
378
379
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
379
380
the health of the service?**
380
- - [ ] Metrics
381
- - Metric name:
381
+ - [x ] Metrics
382
+ - Metric name: GracefulShutdownStartTime, GracefulShutdownEndTime
382
383
- [ Optional] Aggregation method:
383
- - Components exposing the metric:
384
+ - Components exposing the metric: Kubelet
384
385
- [ ] Other (treat as last resort)
385
386
- Details:
386
387
387
- N/A
388
+ Kubelet can write down the start and end time for the graceful shutdown to local
389
+ storage and expose those metrics for scraping. In some cases, the metrics could
390
+ be missed based on scraping interval, but they can be served back up after the
391
+ kubelet comes online on a reboot. Operators could then look at these metrics to
392
+ troubleshoot issues with the feature across their nodes in a cluster.
393
+
388
394
389
395
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
390
396
At a high level, this usually will be in the form of "high percentile of SLI
395
401
job creation time) for cron job <= 10%
396
402
- 99,9% of /health requests per day finish with 200 code
397
403
398
- N/A.
404
+ Graceful shutdown feature should function outside of power failures or h/w failure
405
+ rates on the nodes.
399
406
400
407
* ** Are there any missing metrics that would be useful to have to improve observability
401
408
of this feature?**
@@ -502,15 +509,18 @@ The feature does not depend on the API server / etcd.
502
509
503
510
* ** What are other known failure modes?**
504
511
For each of them, fill in the following information by copying the below template:
505
- - [ Failure mode brief description]
506
- - Detection: How can it be detected via metrics? Stated another way:
507
- how can an operator troubleshoot without logging into a master or worker node?
508
- - Mitigations: What can be done to stop the bleeding, especially for already
509
- running user workloads?
510
- - Diagnostics: What are the useful log messages and their required logging
511
- levels that could help debug the issue?
512
- Not required until feature graduated to beta.
513
- - Testing: Are there any tests for failure mode? If not, describe why.
512
+
513
+ - Kubelet does not detect the shutdown e.g. due to systemd inhibitor not registering.
514
+ - Detection: Kubelet logs
515
+ - Mitigations: Workloads will not be affected, graceful node shutdown will not be enabled
516
+ - Diagnostics: At default (v2) logging verbosity, kubelet will log if inhibitor was registered
517
+ - Testing: Existing OSS SIG-Node E2E tests check for graceful node shutdown including priority based shutdown
518
+
519
+ - Priority based graceful node shutdown setting is not respected
520
+ - Detection: Kubelet logs contain time allocated to each pod shutdown
521
+ - Mitigations: Change priority based graceful node shutdown config or revert to existing beta graceful node shutdown config
522
+ - Diagnostics: Kubelet logging at v2 level
523
+ - Testing: Existing OSS SIG-Node E2E tests check that pod shutdown time is respected depending on the config
514
524
515
525
* ** What steps should be taken if SLOs are not being met to determine the problem?**
516
526
0 commit comments