Skip to content

Commit f83edd3

Browse files
Updated PRR approver, addressed comments from PRR review
1 parent 7eadd9f commit f83edd3

File tree

2 files changed

+15
-8
lines changed

2 files changed

+15
-8
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
kep-number: 4960
22
alpha:
3-
approver: "@jpbetz"
3+
approver: "@deads2k"

keps/sig-node/4960-container-stop-signals/README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -358,6 +358,8 @@ No, enabling the feature gate does not change existing behaviour.
358358

359359
Yes, the feature gate can be turned off to disable the feature once it has been enabled.
360360

361+
The feature gate is present in kube-apiserver and kubelet. Both enabling and disabling the feature gate once it has been turned on would involve restarting the kube-apiserver and the kubelet. The update strategy of how to roll out the feature gate flips for both kube-apiserver and kubelet are described in the [Version Skew Strategy](#version-skew-strategy) section.
362+
361363
###### What happens if we reenable the feature if it was previously rolled back?
362364

363365
If you reenable the feature, you'll be able to create Pods with StopSignal lifecycle hooks for their containers. Once the gate is disabled, if you try to create new Pods with StopSignal, those would be invalid and wouldn't pass validation. Existing worklods using StopSignal should still continue to function.
@@ -374,11 +376,13 @@ Yes, unit tests are planned for alpha for testing the disabling and reenabling o
374376

375377
###### How can a rollout or rollback fail? Can it impact already running workloads?
376378

377-
The change is opt-in, it doesn't impact already running workloads.
379+
The change is opt-in, it doesn't impact already running workloads. The only change to the existing workloads would be the stop signal showing up in the container statuses for existing Pods once the change is rolled out in the kubelet.
378380

379381
###### What specific metrics should inform a rollback?
380382

381-
Pods/Containers not getting terminated properly might indicate that something is wrong, although we will aim to handle all such cases gracefully and show proper error messages if something is missing.
383+
Pods/Containers not getting terminated properly might indicate that something is wrong, although we will aim to handle all such cases gracefully and show proper error messages if something is missing.
384+
385+
You can also look at the newly proposed metric `kubelet_pod_termination_grace_period_exceeded_total` which gives you the number of Pods which are killed forcefully after the timeout for graceful termination exceeded. A high value for this metric could mean that Pods are not getting killed gracefully. This could mean that Pods might have a misconfigured stop signals and might need a rollback.
382386

383387
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
384388

@@ -402,24 +406,27 @@ Inspect the workloads' Container spec for Stop lifecycle hook and also check if
402406
- Condition name:
403407
- Other field:
404408
- [x] Other (treat as last resort)
405-
- Details: Check if the containers with custom stop signals are being killed with the stop signal provided. For example your container might want to take SIGUSR1 to be exited. You can achieve this by defining it in the Container spec and have to bake it into your container image.
409+
- Check if the containers with custom stop signals are being killed with the stop signal provided. For example your container might want to take SIGUSR1 to be exited. You can achieve this by defining it in the Container spec and have to bake it into your container image.
410+
- Since we're showing the effective stop signal in the container status, irrespective of whether a custom signal is used or not, users can check whether Pods scheduled to every node has a StopSignal field in their statuses to confirm whether the feature is enabled and working in that particular instance of kubelet.
406411

407412
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
408413

409414
N/A
410415

411416
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
412417

413-
- [ ] Metrics
418+
- [x] Metrics
414419
- Metric name:
420+
- kubelet_pod_stop_signals_count (Guage measuring number of pods configured with each stop signal)
421+
- kubelet_pod_termination_grace_period_exceeded_total (Counter counting the number of Pods that doesn't get terminated gracefully with the duration of terminationGracePeriodSeconds)
415422
- [Optional] Aggregation method:
416423
- Components exposing the metric:
417424
- [x] Other (treat as last resort)
418425
- Details: Check if the containers with custom stop signals are being killed with the stop signal provided.
419426

420427
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
421428

422-
No.
429+
Metrics related to pods termination could be useful to improve the observability of stop signal usage. There aren't any metrics related to the terminationGracePeriodSeconds for Pods and one such metric has been proposed above, `kubelet_pod_termination_grace_period_exceeded_total`, which counts the number of Pods that gets terminated forcefully after exceeding terminationGracePeriodSeconds.
423430

424431
### Dependencies
425432

@@ -465,7 +472,7 @@ The same way any write to kube-apiserver/etcd would behave. This feature doesn't
465472

466473
###### What are other known failure modes?
467474

468-
N/A
475+
Pods can fail and hang if the user configures a stop signal that is not handled in the container. This is a new failure mode that is introduced by this KEP. The KEPs would hang until they're forcecully killed with SIGKILL after the terminationGracePeriodSeconds.
469476

470477
###### What steps should be taken if SLOs are not being met to determine the problem?
471478

@@ -475,7 +482,7 @@ Disable the ContainerStopSignal feature gate, and restart the kube-apiserver and
475482

476483
## Drawbacks
477484

478-
There aren't any drawbacks to why this KEP shouldn't be implemented since it does not change the default behaviour.
485+
One of the drawbacks of introducing stop signal to the container spec is that this introduces the scope of users misconfiguring the stop signal leading to unexpected behaviour such as the hanging pods as mentioned in the [Risks and Mitigations](#risks-and-mitigations) section.
479486

480487
## Alternatives
481488

0 commit comments

Comments
 (0)