|
13 | 13 | - [Design Details](#design-details)
|
14 | 14 | - [Test Plan](#test-plan)
|
15 | 15 | - [Graduation Criteria](#graduation-criteria)
|
| 16 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 17 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 18 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 19 | + - [Monitoring Requirements](#monitoring-requirements) |
| 20 | + - [Dependencies](#dependencies) |
| 21 | + - [Scalability](#scalability) |
| 22 | + - [Troubleshooting](#troubleshooting) |
16 | 23 | - [Implementation History](#implementation-history)
|
17 | 24 | <!-- /toc -->
|
18 | 25 |
|
@@ -296,11 +303,208 @@ of versioning. However, we can still treat graduation in terms of
|
296 | 303 | [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
|
297 | 304 | [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
|
298 | 305 |
|
| 306 | +## Production Readiness Review Questionnaire |
| 307 | + |
| 308 | +<!-- |
| 309 | +
|
| 310 | +Production readiness reviews are intended to ensure that features merging into |
| 311 | +Kubernetes are observable, scalable and supportable; can be safely operated in |
| 312 | +production environments, and can be disabled or rolled back in the event they |
| 313 | +cause increased failures in production. See more in the PRR KEP at |
| 314 | +https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness. |
| 315 | +
|
| 316 | +The production readiness review questionnaire must be completed and approved |
| 317 | +for the KEP to move to `implementable` status and be included in the release. |
| 318 | +
|
| 319 | +In some cases, the questions below should also have answers in `kep.yaml`. This |
| 320 | +is to enable automation to verify the presence of the review, and to reduce review |
| 321 | +burden and latency. |
| 322 | +
|
| 323 | +The KEP must have a approver from the |
| 324 | +[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES) |
| 325 | +team. Please reach out on the |
| 326 | +[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if |
| 327 | +you need any help or guidance. |
| 328 | +
|
| 329 | +--> |
| 330 | + |
| 331 | +### Feature Enablement and Rollback |
| 332 | + |
| 333 | +_This section must be completed when targeting alpha to a release._ |
| 334 | + |
| 335 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 336 | + |
| 337 | + No. This feature is always enabled (post-GA). |
| 338 | + Pre-GA it was possible to disable with the feature gate. |
| 339 | + |
| 340 | + - [x] Feature gate (also fill in values in `kep.yaml`) |
| 341 | + - Feature gate name: ServiceAccountIssuerDiscovery |
| 342 | + - Components depending on the feature gate: kube-apiserver |
| 343 | + - Note: This feature is targeted to GA in 1.21, at which point feature gates |
| 344 | + lock to enabled. This means it will not be possible to disable after the |
| 345 | + current dev cycle. |
| 346 | + |
| 347 | +* **Does enabling the feature change any default behavior?** |
| 348 | + No. It adds an entirely new non-resource-url that can be used to discover |
| 349 | + metadata related to the cluster's service account issuer. |
| 350 | + |
| 351 | +* **Can the feature be disabled once it has been enabled (i.e. can we roll back |
| 352 | + the enablement)?** |
| 353 | + |
| 354 | + No. This feature is always enabled post-GA. |
| 355 | + The only way to roll back is to return to an older K8s version. |
| 356 | + |
| 357 | + **Describe the consequences on existing workloads (e.g., if this is a runtime |
| 358 | + feature, can it break the existing applications?).** |
| 359 | + |
| 360 | + Existing applications would have to take a dependency on this feature to |
| 361 | + be broken by it. Thus, enabling the feature for the first time is not a risk |
| 362 | + to existing applications, but disabling it later could be. |
| 363 | + |
| 364 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 365 | + |
| 366 | + The feature should continue to work just fine. |
| 367 | + |
| 368 | +* **Are there any tests for feature enablement/disablement?** |
| 369 | + |
| 370 | + No. |
| 371 | + |
| 372 | +### Rollout, Upgrade and Rollback Planning |
| 373 | + |
| 374 | +_This section must be completed when targeting beta graduation to a release._ |
| 375 | + |
| 376 | +* **How can a rollout fail? Can it impact already running workloads?** |
| 377 | + Enablement shouldn't affect any existing workloads. If we broke the feature in |
| 378 | + the future, we would _possibly_ see failures of workloads to authenticate to |
| 379 | + Relying Parties _outside_ the cluster, but in-cluster workload to |
| 380 | + kube-apiserver authentication would still work, since it doesn't rely |
| 381 | + on this path. |
| 382 | + |
| 383 | +* **What specific metrics should inform a rollback?** |
| 384 | + N/A |
| 385 | + |
| 386 | +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** |
| 387 | + The standard upgrade tests would have covered this between alpha and beta, |
| 388 | + when the feature was enabled by default. |
| 389 | + |
| 390 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, |
| 391 | +fields of API types, flags, etc.?** |
| 392 | + |
| 393 | + No. |
| 394 | + |
| 395 | +### Monitoring Requirements |
| 396 | + |
| 397 | +_This section must be completed when targeting beta graduation to a release._ |
| 398 | + |
| 399 | +* **How can an operator determine if the feature is in use by workloads?** |
| 400 | + Ideally, there would just be usage metrics for all API server endpoints. |
| 401 | + Since we don't currently have that, the next best option would be to examine |
| 402 | + API server logs. |
| 403 | + |
| 404 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine |
| 405 | +the health of the service?** |
| 406 | + - [x] Other (treat as last resort) |
| 407 | + - Details: API server logs, or ability of workloads to authenticate to |
| 408 | + Relying Parties. |
| 409 | + |
| 410 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 411 | + We expect the endpoints to maintain high reliability, with reliability |
| 412 | + matching that of kube-apiserver. |
| 413 | + |
| 414 | +* **Are there any missing metrics that would be useful to have to improve observability |
| 415 | +of this feature?** |
| 416 | + It would be nice to have usage metrics for this endpoint. We haven't added |
| 417 | + them so far because non-resource URLs don't have them by default. This could |
| 418 | + be worth solving in general but a general solution is out of scope for this |
| 419 | + KEP. |
| 420 | + |
| 421 | +### Dependencies |
| 422 | + |
| 423 | +_This section must be completed when targeting beta graduation to a release._ |
| 424 | + |
| 425 | +* **Does this feature depend on any specific services running in the cluster?** |
| 426 | + It only depends on kube-apiserver being up. If, for example, the issuer is |
| 427 | + configured as https://kubernetes.default.svc, then the corresponding Service |
| 428 | + needs to exist in the cluster as well. |
| 429 | + |
| 430 | + |
| 431 | +### Scalability |
| 432 | + |
| 433 | +_For alpha, this section is encouraged: reviewers should consider these questions |
| 434 | +and attempt to answer them._ |
| 435 | + |
| 436 | +_For beta, this section is required: reviewers must answer these questions._ |
| 437 | + |
| 438 | +_For GA, this section is required: approvers should be able to confirm the |
| 439 | +previous answers based on experience in the field._ |
| 440 | + |
| 441 | +* **Will enabling / using this feature result in any new API calls?** |
| 442 | + Yes. |
| 443 | + - GET `${API_SERVER}/.well-known/openid-configuration` |
| 444 | + - GET `${API_SERVER}/openid/v1/jwks` |
| 445 | + - Note each endpoint serves a response that is pre-rendered when |
| 446 | + kube-apiserver starts up. |
| 447 | + - Originating components: Could be arbitrary. For example: |
| 448 | + - A cluster installer reads these once when configuring identity federation |
| 449 | + with a cloud provider (Low throughput). |
| 450 | + - In-cluster components use this to perform an OIDC discovery flow to |
| 451 | + validate tokens (Medium to High throughput). Note TokenReview is the |
| 452 | + preferred approach in this case. |
| 453 | + - A cluster admin adds additional RBAC to make these endpoints public, and |
| 454 | + points Relying Parties directly at these endpoints (High throughput, |
| 455 | + though RPs _should_ do some caching instead of making calls on every |
| 456 | + token validation). |
| 457 | + |
| 458 | +* **Will enabling / using this feature result in introducing new API types?** |
| 459 | + No new types, just two new non-resource URLs that implement this KEP, as |
| 460 | + described above. There is no new state stored in etcd. |
| 461 | + |
| 462 | +* **Will enabling / using this feature result in any new calls to the cloud |
| 463 | +provider?** |
| 464 | + No. |
| 465 | + |
| 466 | +* **Will enabling / using this feature result in increasing size or count of |
| 467 | +the existing API objects?** |
| 468 | + No. |
| 469 | + |
| 470 | +* **Will enabling / using this feature result in increasing time taken by any |
| 471 | +operations covered by [existing SLIs/SLOs]?** |
| 472 | + No. |
| 473 | + |
| 474 | +* **Will enabling / using this feature result in non-negligible increase of |
| 475 | +resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 476 | + This isn't expected, given it's just copying a pre-rendered string into |
| 477 | + the response. |
| 478 | + |
| 479 | +### Troubleshooting |
| 480 | + |
| 481 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 482 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 483 | +details). For now, we leave it here. |
| 484 | + |
| 485 | +_This section must be completed when targeting beta graduation to a release._ |
| 486 | + |
| 487 | +* **How does this feature react if the API server and/or etcd is unavailable?** |
| 488 | + If kube-apiserver is unavailable, this feature is also unavailable. This |
| 489 | + feature is not affected by etcd availability. |
| 490 | + |
| 491 | +* **What are other known failure modes?** |
| 492 | + N/A |
| 493 | + |
| 494 | +* **What steps should be taken if SLOs are not being met to determine the problem?** |
| 495 | +- Examine the responses from the above endpoints. |
| 496 | +- Examine kube-apiserver logs. |
| 497 | +- Examine kube-apiserver configuration related to this KEP. |
| 498 | + |
| 499 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 500 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 501 | + |
299 | 502 | ## Implementation History
|
300 | 503 |
|
301 | 504 | - 2018-06-26: Proposed in https://github.com/kubernetes/community/pull/2314
|
302 | 505 | - 2018, 2019: Various comments on pull request
|
303 | 506 | - 2019-07-30: Moved to a KEP (with no edits from the original proposal)
|
304 | 507 | - 2019-08-05: Updated KEP with more details.
|
305 | 508 | - 2019-10-18: Updated KEP with more RBAC details.
|
306 |
| -- 2020-1-25: Updated KEP and marked as implementable. |
| 509 | +- 2020-01-25: Updated KEP and marked as implementable. |
| 510 | +- 2021-01-28: Added PRR questionaire. |
0 commit comments