|
| 1 | +# KEP-1972: Kubelet Exec Probe Timeouts |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 11 | +- [Design Details](#design-details) |
| 12 | + - [Test Plan](#test-plan) |
| 13 | + - [Graduation Criteria](#graduation-criteria) |
| 14 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 15 | + - [Version Skew Strategy](#version-skew-strategy) |
| 16 | +- [Implementation History](#implementation-history) |
| 17 | +- [Drawbacks](#drawbacks) |
| 18 | +- [Alternatives](#alternatives) |
| 19 | +<!-- /toc --> |
| 20 | + |
| 21 | +## Release Signoff Checklist |
| 22 | + |
| 23 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 24 | + |
| 25 | +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 26 | +- [X] (R) KEP approvers have approved the KEP status as `implementable` |
| 27 | +- [X] (R) Design details are appropriately documented |
| 28 | +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 29 | +- [X] (R) Graduation criteria is in place |
| 30 | +- [ ] (R) Production readiness review completed |
| 31 | +- [ ] Production readiness review approved |
| 32 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 33 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 34 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 35 | + |
| 36 | +[kubernetes.io]: https://kubernetes.io/ |
| 37 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 38 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 39 | +[kubernetes/website]: https://git.k8s.io/website |
| 40 | + |
| 41 | +## Summary |
| 42 | + |
| 43 | +Kubelet today does not respect exec probe timeouts. This is considered a bug we should fix since |
| 44 | +the timeout value is supported in the Container Probe API. Because exec probe timeouts |
| 45 | +were never respected by kubelet, a new feature gate `ExecProbeTimeouts` will be introduced. |
| 46 | +With this feature, nodes can be configured to preserve the current behavior while the proper |
| 47 | +timeouts are enabled for exec probes. |
| 48 | + |
| 49 | +## Motivation |
| 50 | + |
| 51 | +Kubelet not respecting the probe timeout is a bug and should be fixed. |
| 52 | + |
| 53 | +### Goals |
| 54 | + |
| 55 | +* treat exec probe timeouts as probe failures in kubelet |
| 56 | + |
| 57 | +### Non-Goals |
| 58 | + |
| 59 | +* ensuring exec processes that timed out have been killed by kubelet. |
| 60 | +* introducing CRI errors for handling scenarios such as time outs. |
| 61 | + |
| 62 | +## Proposal |
| 63 | + |
| 64 | +### Risks and Mitigations |
| 65 | + |
| 66 | +* existing workloads on Kubernetes that relied on this bug may unexpectedly see their probes timeout |
| 67 | + |
| 68 | +## Design Details |
| 69 | + |
| 70 | +Changes to kubelet: |
| 71 | +* Ensure kubelet handles timeout errors and registers them as failing probes. |
| 72 | +* Add feature gate `ExecProbeTimeouts` that is GA and on by default. |
| 73 | +* If the feature gate `ExecProbeTimeouts` is disabled and an exec probe timeout is reached, add warning logs to inform users that exec probes are timing out. |
| 74 | +* Re-enable existing exec liveness probe e2e test. |
| 75 | +* Add new exec readiness probe e2e test. |
| 76 | + |
| 77 | +### Test Plan |
| 78 | + |
| 79 | +E2E tests: |
| 80 | +* re-enable [existing exec liveness probe e2e test](https://github.com/kubernetes/kubernetes/blob/ea1458550077bdf3b26ac34551a3591d280fe1f5/test/e2e/common/container_probe.go#L210-L227) that is currently being skipped |
| 81 | +* add new exec readiness probe e2e test. |
| 82 | + |
| 83 | +### Graduation Criteria |
| 84 | + |
| 85 | +This is a bug fix so the feature gate will be GA and on by default from the start. |
| 86 | + |
| 87 | +### Upgrade / Downgrade Strategy |
| 88 | + |
| 89 | +N/A |
| 90 | + |
| 91 | +### Version Skew Strategy |
| 92 | + |
| 93 | +N/A |
| 94 | + |
| 95 | +## Implementation History |
| 96 | + |
| 97 | +* 2020-09-08 - the KEP was merged as implementable for v1.20 |
| 98 | + |
| 99 | +## Drawbacks |
| 100 | + |
| 101 | +* Existing workloads may depend on the fact that exec probe timeouts were never respected. Introducing |
| 102 | +the timeout now may result in unexpected behavior for some workloads. |
| 103 | + |
| 104 | +## Alternatives |
| 105 | + |
| 106 | +Some alternatives that were considered: |
| 107 | +1. Increasing the default timeout for exec probes |
| 108 | +2. Continuing to ignore the exec probe timeout |
| 109 | + |
0 commit comments