|
21 | 21 | - [PVC API Change](#pvc-api-change)
|
22 | 22 | - [StorageClass API change](#storageclass-api-change)
|
23 | 23 | - [Other API changes](#other-api-changes)
|
| 24 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 25 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 26 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 27 | + - [Monitoring Requirements](#monitoring-requirements) |
| 28 | + - [Dependencies](#dependencies) |
| 29 | + - [Scalability](#scalability) |
| 30 | + - [Troubleshooting](#troubleshooting) |
| 31 | +- [Implementation History](#implementation-history) |
| 32 | +- [Drawbacks](#drawbacks) |
| 33 | +- [Alternatives](#alternatives) |
| 34 | +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) |
24 | 35 | <!-- /toc -->
|
25 | 36 |
|
26 | 37 | ## Release Signoff Checklist
|
@@ -344,3 +355,348 @@ type StorageClass struct {
|
344 | 355 |
|
345 | 356 | This proposal relies on ability to update PVC status from kubelet. While updating PVC's status
|
346 | 357 | a PATCH request must be made from kubelet to update the status.
|
| 358 | + |
| 359 | +## Production Readiness Review Questionnaire |
| 360 | + |
| 361 | +<!-- |
| 362 | + |
| 363 | +Production readiness reviews are intended to ensure that features merging into |
| 364 | +Kubernetes are observable, scalable and supportable; can be safely operated in |
| 365 | +production environments, and can be disabled or rolled back in the event they |
| 366 | +cause increased failures in production. See more in the PRR KEP at |
| 367 | +https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness. |
| 368 | + |
| 369 | +The production readiness review questionnaire must be completed and approved |
| 370 | +for the KEP to move to `implementable` status and be included in the release. |
| 371 | + |
| 372 | +In some cases, the questions below should also have answers in `kep.yaml`. This |
| 373 | +is to enable automation to verify the presence of the review, and to reduce review |
| 374 | +burden and latency. |
| 375 | + |
| 376 | +The KEP must have a approver from the |
| 377 | +[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES) |
| 378 | +team. Please reach out on the |
| 379 | +[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if |
| 380 | +you need any help or guidance. |
| 381 | +--> |
| 382 | + |
| 383 | +### Feature Enablement and Rollback |
| 384 | + |
| 385 | +<!-- |
| 386 | +This section must be completed when targeting alpha to a release. |
| 387 | +--> |
| 388 | + |
| 389 | +###### How can this feature be enabled / disabled in a live cluster? |
| 390 | + |
| 391 | +<!-- |
| 392 | +Pick one of these and delete the rest. |
| 393 | +--> |
| 394 | + |
| 395 | +- [ ] Feature gate (also fill in values in `kep.yaml`) |
| 396 | + - Feature gate name: |
| 397 | + - Components depending on the feature gate: |
| 398 | +- [ ] Other |
| 399 | + - Describe the mechanism: |
| 400 | + - Will enabling / disabling the feature require downtime of the control |
| 401 | + plane? |
| 402 | + - Will enabling / disabling the feature require downtime or reprovisioning |
| 403 | + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). |
| 404 | + |
| 405 | +###### Does enabling the feature change any default behavior? |
| 406 | + |
| 407 | +<!-- |
| 408 | +Any change of default behavior may be surprising to users or break existing |
| 409 | +automations, so be extremely careful here. |
| 410 | +--> |
| 411 | + |
| 412 | +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? |
| 413 | + |
| 414 | +<!-- |
| 415 | +Describe the consequences on existing workloads (e.g., if this is a runtime |
| 416 | +feature, can it break the existing applications?). |
| 417 | + |
| 418 | +NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. |
| 419 | +--> |
| 420 | + |
| 421 | +###### What happens if we reenable the feature if it was previously rolled back? |
| 422 | + |
| 423 | +###### Are there any tests for feature enablement/disablement? |
| 424 | + |
| 425 | +<!-- |
| 426 | +The e2e framework does not currently support enabling or disabling feature |
| 427 | +gates. However, unit tests in each component dealing with managing data, created |
| 428 | +with and without the feature, are necessary. At the very least, think about |
| 429 | +conversion tests if API types are being modified. |
| 430 | +--> |
| 431 | + |
| 432 | +### Rollout, Upgrade and Rollback Planning |
| 433 | + |
| 434 | +<!-- |
| 435 | +This section must be completed when targeting beta to a release. |
| 436 | +--> |
| 437 | + |
| 438 | +###### How can a rollout or rollback fail? Can it impact already running workloads? |
| 439 | + |
| 440 | +<!-- |
| 441 | +Try to be as paranoid as possible - e.g., what if some components will restart |
| 442 | +mid-rollout? |
| 443 | + |
| 444 | +Be sure to consider highly-available clusters, where, for example, |
| 445 | +feature flags will be enabled on some API servers and not others during the |
| 446 | +rollout. Similarly, consider large clusters and how enablement/disablement |
| 447 | +will rollout across nodes. |
| 448 | +--> |
| 449 | + |
| 450 | +###### What specific metrics should inform a rollback? |
| 451 | + |
| 452 | +<!-- |
| 453 | +What signals should users be paying attention to when the feature is young |
| 454 | +that might indicate a serious problem? |
| 455 | +--> |
| 456 | + |
| 457 | +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
| 458 | + |
| 459 | +<!-- |
| 460 | +Describe manual testing that was done and the outcomes. |
| 461 | +Longer term, we may want to require automated upgrade/rollback tests, but we |
| 462 | +are missing a bunch of machinery and tooling and can't do that now. |
| 463 | +--> |
| 464 | + |
| 465 | +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? |
| 466 | + |
| 467 | +<!-- |
| 468 | +Even if applying deprecation policies, they may still surprise some users. |
| 469 | +--> |
| 470 | + |
| 471 | +### Monitoring Requirements |
| 472 | + |
| 473 | +<!-- |
| 474 | +This section must be completed when targeting beta to a release. |
| 475 | +--> |
| 476 | + |
| 477 | +###### How can an operator determine if the feature is in use by workloads? |
| 478 | + |
| 479 | +<!-- |
| 480 | +Ideally, this should be a metric. Operations against the Kubernetes API (e.g., |
| 481 | +checking if there are objects with field X set) may be a last resort. Avoid |
| 482 | +logs or events for this purpose. |
| 483 | +--> |
| 484 | + |
| 485 | +###### How can someone using this feature know that it is working for their instance? |
| 486 | + |
| 487 | +<!-- |
| 488 | +For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly |
| 489 | +for each individual pod. |
| 490 | +Pick one more of these and delete the rest. |
| 491 | +Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement |
| 492 | +and operation of this feature. |
| 493 | +Recall that end users cannot usually observe component logs or access metrics. |
| 494 | +--> |
| 495 | + |
| 496 | +- [ ] Events |
| 497 | + - Event Reason: |
| 498 | +- [ ] API .status |
| 499 | + - Condition name: |
| 500 | + - Other field: |
| 501 | +- [ ] Other (treat as last resort) |
| 502 | + - Details: |
| 503 | + |
| 504 | +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? |
| 505 | + |
| 506 | +<!-- |
| 507 | +This is your opportunity to define what "normal" quality of service looks like |
| 508 | +for a feature. |
| 509 | + |
| 510 | +It's impossible to provide comprehensive guidance, but at the very |
| 511 | +high level (needs more precise definitions) those may be things like: |
| 512 | + - per-day percentage of API calls finishing with 5XX errors <= 1% |
| 513 | + - 99% percentile over day of absolute value from (job creation time minus expected |
| 514 | + job creation time) for cron job <= 10% |
| 515 | + - 99.9% of /health requests per day finish with 200 code |
| 516 | + |
| 517 | +These goals will help you determine what you need to measure (SLIs) in the next |
| 518 | +question. |
| 519 | +--> |
| 520 | + |
| 521 | +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? |
| 522 | + |
| 523 | +<!-- |
| 524 | +Pick one more of these and delete the rest. |
| 525 | +--> |
| 526 | + |
| 527 | +- [ ] Metrics |
| 528 | + - Metric name: |
| 529 | + - [Optional] Aggregation method: |
| 530 | + - Components exposing the metric: |
| 531 | +- [ ] Other (treat as last resort) |
| 532 | + - Details: |
| 533 | + |
| 534 | +###### Are there any missing metrics that would be useful to have to improve observability of this feature? |
| 535 | + |
| 536 | +<!-- |
| 537 | +Describe the metrics themselves and the reasons why they weren't added (e.g., cost, |
| 538 | +implementation difficulties, etc.). |
| 539 | +--> |
| 540 | + |
| 541 | +### Dependencies |
| 542 | + |
| 543 | +<!-- |
| 544 | +This section must be completed when targeting beta to a release. |
| 545 | +--> |
| 546 | + |
| 547 | +###### Does this feature depend on any specific services running in the cluster? |
| 548 | + |
| 549 | +<!-- |
| 550 | +Think about both cluster-level services (e.g. metrics-server) as well |
| 551 | +as node-level agents (e.g. specific version of CRI). Focus on external or |
| 552 | +optional services that are needed. For example, if this feature depends on |
| 553 | +a cloud provider API, or upon an external software-defined storage or network |
| 554 | +control plane. |
| 555 | + |
| 556 | +For each of these, fill in the following—thinking about running existing user workloads |
| 557 | +and creating new ones, as well as about cluster-level services (e.g. DNS): |
| 558 | + - [Dependency name] |
| 559 | + - Usage description: |
| 560 | + - Impact of its outage on the feature: |
| 561 | + - Impact of its degraded performance or high-error rates on the feature: |
| 562 | +--> |
| 563 | + |
| 564 | +### Scalability |
| 565 | + |
| 566 | +<!-- |
| 567 | +For alpha, this section is encouraged: reviewers should consider these questions |
| 568 | +and attempt to answer them. |
| 569 | + |
| 570 | +For beta, this section is required: reviewers must answer these questions. |
| 571 | + |
| 572 | +For GA, this section is required: approvers should be able to confirm the |
| 573 | +previous answers based on experience in the field. |
| 574 | +--> |
| 575 | + |
| 576 | +###### Will enabling / using this feature result in any new API calls? |
| 577 | + |
| 578 | +<!-- |
| 579 | +Describe them, providing: |
| 580 | + - API call type (e.g. PATCH pods) |
| 581 | + - estimated throughput |
| 582 | + - originating component(s) (e.g. Kubelet, Feature-X-controller) |
| 583 | +Focusing mostly on: |
| 584 | + - components listing and/or watching resources they didn't before |
| 585 | + - API calls that may be triggered by changes of some Kubernetes resources |
| 586 | + (e.g. update of object X triggers new updates of object Y) |
| 587 | + - periodic API calls to reconcile state (e.g. periodic fetching state, |
| 588 | + heartbeats, leader election, etc.) |
| 589 | +--> |
| 590 | + |
| 591 | +###### Will enabling / using this feature result in introducing new API types? |
| 592 | + |
| 593 | +<!-- |
| 594 | +Describe them, providing: |
| 595 | + - API type |
| 596 | + - Supported number of objects per cluster |
| 597 | + - Supported number of objects per namespace (for namespace-scoped objects) |
| 598 | +--> |
| 599 | + |
| 600 | +###### Will enabling / using this feature result in any new calls to the cloud provider? |
| 601 | + |
| 602 | +<!-- |
| 603 | +Describe them, providing: |
| 604 | + - Which API(s): |
| 605 | + - Estimated increase: |
| 606 | +--> |
| 607 | + |
| 608 | +###### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 609 | + |
| 610 | +<!-- |
| 611 | +Describe them, providing: |
| 612 | + - API type(s): |
| 613 | + - Estimated increase in size: (e.g., new annotation of size 32B) |
| 614 | + - Estimated amount of new objects: (e.g., new Object X for every existing Pod) |
| 615 | +--> |
| 616 | + |
| 617 | +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? |
| 618 | + |
| 619 | +<!-- |
| 620 | +Look at the [existing SLIs/SLOs]. |
| 621 | + |
| 622 | +Think about adding additional work or introducing new steps in between |
| 623 | +(e.g. need to do X to start a container), etc. Please describe the details. |
| 624 | + |
| 625 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 626 | +--> |
| 627 | + |
| 628 | +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? |
| 629 | + |
| 630 | +<!-- |
| 631 | +Things to keep in mind include: additional in-memory state, additional |
| 632 | +non-trivial computations, excessive access to disks (including increased log |
| 633 | +volume), significant amount of data sent and/or received over network, etc. |
| 634 | +This through this both in small and large cases, again with respect to the |
| 635 | +[supported limits]. |
| 636 | + |
| 637 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 638 | +--> |
| 639 | + |
| 640 | +### Troubleshooting |
| 641 | + |
| 642 | +<!-- |
| 643 | +This section must be completed when targeting beta to a release. |
| 644 | + |
| 645 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 646 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 647 | +details). For now, we leave it here. |
| 648 | +--> |
| 649 | + |
| 650 | +###### How does this feature react if the API server and/or etcd is unavailable? |
| 651 | + |
| 652 | +###### What are other known failure modes? |
| 653 | + |
| 654 | +<!-- |
| 655 | +For each of them, fill in the following information by copying the below template: |
| 656 | + - [Failure mode brief description] |
| 657 | + - Detection: How can it be detected via metrics? Stated another way: |
| 658 | + how can an operator troubleshoot without logging into a master or worker node? |
| 659 | + - Mitigations: What can be done to stop the bleeding, especially for already |
| 660 | + running user workloads? |
| 661 | + - Diagnostics: What are the useful log messages and their required logging |
| 662 | + levels that could help debug the issue? |
| 663 | + Not required until feature graduated to beta. |
| 664 | + - Testing: Are there any tests for failure mode? If not, describe why. |
| 665 | +--> |
| 666 | + |
| 667 | +###### What steps should be taken if SLOs are not being met to determine the problem? |
| 668 | + |
| 669 | +## Implementation History |
| 670 | + |
| 671 | +<!-- |
| 672 | +Major milestones in the lifecycle of a KEP should be tracked in this section. |
| 673 | +Major milestones might include: |
| 674 | +- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance |
| 675 | +- the `Proposal` section being merged, signaling agreement on a proposed design |
| 676 | +- the date implementation started |
| 677 | +- the first Kubernetes release where an initial version of the KEP was available |
| 678 | +- the version of Kubernetes where the KEP graduated to general availability |
| 679 | +- when the KEP was retired or superseded |
| 680 | +--> |
| 681 | + |
| 682 | +## Drawbacks |
| 683 | + |
| 684 | +<!-- |
| 685 | +Why should this KEP _not_ be implemented? |
| 686 | +--> |
| 687 | + |
| 688 | +## Alternatives |
| 689 | + |
| 690 | +<!-- |
| 691 | +What other approaches did you consider, and why did you rule them out? These do |
| 692 | +not need to be as detailed as the proposal, but should include enough |
| 693 | +information to express the idea and why it was not acceptable. |
| 694 | +--> |
| 695 | + |
| 696 | +## Infrastructure Needed (Optional) |
| 697 | + |
| 698 | +<!-- |
| 699 | +Use this section if you need things from the project/SIG. Examples include a |
| 700 | +new subproject, repos requested, or GitHub details. Listing these here allows a |
| 701 | +SIG to get the process for these resources started right away. |
| 702 | +--> |
0 commit comments