|
26 | 26 | - [Alternatives](#alternatives)
|
27 | 27 | - [Introduce pod level resource requirements](#introduce-pod-level-resource-requirements)
|
28 | 28 | - [Leaving the PodSpec unchanged](#leaving-the-podspec-unchanged)
|
| 29 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 30 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 31 | + - [Monitoring Requirements](#monitoring-requirements) |
| 32 | + - [Dependencies](#dependencies) |
| 33 | + - [Scalability](#scalability) |
| 34 | + - [Troubleshooting](#troubleshooting) |
29 | 35 | - [Implementation History](#implementation-history)
|
30 | 36 | - [Version 1.16](#version-116)
|
31 | 37 | - [Version 1.18](#version-118)
|
| 38 | + - [Version 1.24](#version-124) |
32 | 39 | <!-- /toc -->
|
33 | 40 |
|
34 | 41 | ## Release Signoff Checklist
|
35 | 42 |
|
36 |
| -- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) |
37 |
| -- [ ] KEP approvers have set the KEP status to `implementable` |
38 |
| -- [ ] Design details are appropriately documented |
39 |
| -- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
40 |
| -- [ ] Graduation criteria is in place |
41 |
| -- [ ] "Implementation History" section is up-to-date for milestone |
42 |
| -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
43 |
| -- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 43 | +- [X] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) |
| 44 | +- [X] KEP approvers have set the KEP status to `implementable` |
| 45 | +- [X] Design details are appropriately documented |
| 46 | +- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 47 | +- [X] Graduation criteria is in place |
| 48 | +- [X] "Implementation History" section is up-to-date for milestone |
| 49 | +- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 50 | +- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
44 | 51 |
|
45 | 52 | **Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those
|
46 | 53 | approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
|
@@ -311,12 +318,140 @@ Cons:
|
311 | 318 | * Not user perceptible from a workload perspective.
|
312 | 319 | * very complicated if the runtimeClass policy changes after workloads are running
|
313 | 320 |
|
| 321 | +## Production Readiness Review Questionnaire |
| 322 | + |
| 323 | +### Feature Enablement and Rollback |
| 324 | + |
| 325 | +<!-- |
| 326 | +This section must be completed when targeting alpha to a release. |
| 327 | +--> |
| 328 | + |
| 329 | +Skipping this section as the feature was already rolled out to all supported k8s versions. |
| 330 | + |
| 331 | +### Monitoring Requirements |
| 332 | + |
| 333 | +<!-- |
| 334 | +This section must be completed when targeting beta to a release. |
| 335 | +--> |
| 336 | + |
| 337 | +###### How can an operator determine if the feature is in use by workloads? |
| 338 | + |
| 339 | +Using metrics mentioned in documentation https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/#observability: |
| 340 | + |
| 341 | +- `kube_pod_overhead_cpu_cores` |
| 342 | +- `kube_pod_overhead_memory_bytes` |
| 343 | + |
| 344 | +###### How can someone using this feature know that it is working for their instance? |
| 345 | + |
| 346 | +Using metrics mentioned in documentation https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/#observability: |
| 347 | + |
| 348 | +- `kube_pod_overhead_cpu_cores` |
| 349 | +- `kube_pod_overhead_memory_bytes` |
| 350 | + |
| 351 | +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? |
| 352 | + |
| 353 | +The ultimate SLO is that Pod will not be evicted when it does not exceed set |
| 354 | +limits because of the overhead introduced by the runtime. Due to the complex |
| 355 | +nature of estimating resources Pod and runtime use, this is hard to measure. |
| 356 | + |
| 357 | +Closest approximation to the intended SLO is that Pod's `Overhead` will be |
| 358 | +updated on admission and cgroups will be adjusted as needed. |
| 359 | + |
| 360 | +Since RuntimeClass Admission controller logic is straightforward and does not |
| 361 | +introduce any new API calls, just one value assignment, Pod scheduling |
| 362 | +latency is not affected by this feature. |
| 363 | + |
| 364 | +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? |
| 365 | + |
| 366 | +Excessive Pod evictions on specific runtime that specifies an Overhead, may |
| 367 | +indicate that feature is not working. However this is a proxy indication that |
| 368 | +is very unreliable - there is a big chance that evictions are caused by Pod or |
| 369 | +Runtime behavior. |
| 370 | + |
| 371 | +Checking Pod object and cgroup settings as described in [Usage Example](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/#usage-example) |
| 372 | +section of the documentation may be used as a good proxy to check that the |
| 373 | +feature is functional. |
| 374 | + |
| 375 | +Finally, increased pod scheduling latency may indicate an issue with the |
| 376 | +RuntimeClass admission controller. |
| 377 | + |
| 378 | +###### Are there any missing metrics that would be useful to have to improve observability of this feature? |
| 379 | + |
| 380 | +No |
| 381 | + |
| 382 | +### Dependencies |
| 383 | + |
| 384 | +No |
| 385 | + |
| 386 | +###### Does this feature depend on any specific services running in the cluster? |
| 387 | + |
| 388 | +The feature depends on RuntimeClass admission controller presence. |
| 389 | + |
| 390 | +### Scalability |
| 391 | + |
| 392 | + |
| 393 | +###### Will enabling / using this feature result in any new API calls? |
| 394 | + |
| 395 | +No, RuntimeClass is already being checked for every pod in RuntimeClass |
| 396 | +Admission Controller and PodOverhead assignment doesn't introduce any new API |
| 397 | +calls. Same for the Kubelet. |
| 398 | + |
| 399 | +###### Will enabling / using this feature result in introducing new API types? |
| 400 | + |
| 401 | +No |
| 402 | + |
| 403 | +###### Will enabling / using this feature result in any new calls to the cloud provider? |
| 404 | + |
| 405 | +No |
| 406 | + |
| 407 | +###### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 408 | + |
| 409 | +Every Pod that is scheduled for the RuntimeClass with the Overhead specified |
| 410 | +will carry two additional values for the `Overhead` structure. |
| 411 | + |
| 412 | +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? |
| 413 | + |
| 414 | +No |
| 415 | + |
| 416 | +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? |
| 417 | + |
| 418 | +N/A. |
| 419 | + |
| 420 | +Note, specifying PodOverhead will increase the allocated resources for pods by design. |
| 421 | + |
| 422 | +### Troubleshooting |
| 423 | + |
| 424 | +Documentation has troubleshooting steps: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/ |
| 425 | + |
| 426 | +###### How does this feature react if the API server and/or etcd is unavailable? |
| 427 | + |
| 428 | +No dependency on etcd availability. |
| 429 | + |
| 430 | +###### What are other known failure modes? |
| 431 | + |
| 432 | +No |
| 433 | + |
| 434 | +###### What steps should be taken if SLOs are not being met to determine the problem? |
| 435 | + |
| 436 | +- Validate the RuntimeClass Admission controller is functional |
| 437 | +- Validate that Pod objects are updated correctly |
| 438 | +- Validate that cgroups are updated correctly |
| 439 | + |
314 | 440 | ## Implementation History
|
315 | 441 |
|
316 |
| -2019-04-04: Initial KEP published. |
| 442 | +- 2019-04-04: Initial KEP published. |
317 | 443 |
|
318 | 444 | ### Version 1.16
|
| 445 | + |
319 | 446 | - Implemented as Alpha.
|
320 | 447 |
|
321 | 448 | ### Version 1.18
|
322 |
| -- Promoted to Beta. |
| 449 | + |
| 450 | +- Promoted to Beta. |
| 451 | + |
| 452 | +### Version 1.24 |
| 453 | + |
| 454 | +1. Production usage: https://github.com/openshift/sandboxed-containers-operator/blob/0edbfbf353945dec4066a6d127bf9d88fbbc80a7/controllers/openshift_controller.go#L342 |
| 455 | +2. Documentation is in place: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/ |
| 456 | + |
| 457 | +- Promoted to stable |
0 commit comments