|
13 | 13 | - [Phase 1 - Moving Cloud Provider Code to Staging](#phase-1---moving-cloud-provider-code-to-staging)
|
14 | 14 | - [Phase 2 - Building CCM from Provider Repos](#phase-2---building-ccm-from-provider-repos)
|
15 | 15 | - [Phase 3 - Migrating Provider Code to Provider Repos](#phase-3---migrating-provider-code-to-provider-repos)
|
| 16 | + - [Phase 4 - Disabling In-Tree Providers](#phase-4---disabling-in-tree-providers) |
16 | 17 | - [Staging Directory](#staging-directory)
|
17 | 18 | - [Cloud Provider Instances](#cloud-provider-instances)
|
| 19 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 20 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 21 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 22 | + - [Monitoring Requirements](#monitoring-requirements) |
| 23 | + - [Dependencies](#dependencies) |
| 24 | + - [Scalability](#scalability) |
| 25 | + - [Troubleshooting](#troubleshooting) |
18 | 26 | - [Alternatives](#alternatives)
|
19 | 27 | - [Staging Alternatives](#staging-alternatives)
|
20 | 28 | - [Git Filter-Branch](#git-filter-branch)
|
@@ -99,10 +107,27 @@ The kube-controller-manager will still import the cloud provider implementations
|
99 | 107 |
|
100 | 108 | #### Phase 3 - Migrating Provider Code to Provider Repos
|
101 | 109 |
|
102 |
| -In Phase 3, all code in `k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/<provider>` will be removed and development of each cloud provider should be done in their respective external repos. It's important that by this phase, both in-tree and out-of-tree cloud providers are tested and production ready. Ideally most Kubernetes clusters in production should be using the out-of-tree provider before in-tree support is removed. A plan to migrate existing clusters from using the `kube-controller-manager` to the `cloud-controller-manager` is currently being developed. More details soon. |
| 110 | +In Phase 3, feature development is no longer accepted in `k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/<provider>` and development of each cloud provider should be done in their respective external repos. Only bug and security fixes are accepted in-tree during this phase. It's important that by this phase, both in-tree and out-of-tree cloud providers are tested and production ready. Ideally most Kubernetes clusters in production should be using the out-of-tree provider before in-tree support is removed. A plan to migrate existing clusters from using the `kube-controller-manager` to the `cloud-controller-manager` is currently being developed. More details soon. |
103 | 111 |
|
104 | 112 | External cloud providers can optionally still import providers from `k8s.io/legacy-cloud-providers` but no core components in `k8s.io/kubernetes` will import the legacy provider and the respective staging directory will be removed along with all its dependencies.
|
105 | 113 |
|
| 114 | +#### Phase 4 - Disabling In-Tree Providers |
| 115 | + |
| 116 | +In Phase 4, two feature gates will be introduced to gradually disable and remove in-tree cloud providers: |
| 117 | +1. `DisableCloudProviders` - this feature gate will disable any functionality in kube-apiserver, kube-controller-manager and kubelet related to the `--cloud-provider` component flag. |
| 118 | +2. `DisableKubeletCloudCredentialProvider` - this feature gate will disable in-tree functionality in the kubelet to authenticate to the AWS, Azure and GCP container registries for image pull credentials. |
| 119 | + |
| 120 | +Both of these features gates only impacts functionality tied to the `--cloud-provider` flag, specifically in-tree volume plugins are not covered. Users should refer to CSI migration efforts for these. |
| 121 | + |
| 122 | +For alpha, the feature gates will be used for testing purposes. When enabled, tests will ensure that clusters with in-tree cloud providers disabled behaves as expected. This is targeted for v1.21 and will be |
| 123 | +disabled by default. |
| 124 | + |
| 125 | +For beta, the feature gates will be on by default, meaning core components will disallow use of in-tree cloud providers. This will act as a warning for users to migrate to external components. Users may |
| 126 | +choose to continue using the in-tree provider by explicitly disabling the feature gates. Beta is targeted for v1.23 or v1.24. |
| 127 | + |
| 128 | +For GA, the feature gate will be enabled by default and locked. Users at this point MUST migrate to external components and use of the in-tree cloud providers will be disallowed. One release after GA, |
| 129 | +the in-tree cloud providers can be safely removed. GA is targeted for v1.25 or v1.26. |
| 130 | + |
106 | 131 | ### Staging Directory
|
107 | 132 |
|
108 | 133 | There are several sections of code which need to be shared between the K8s/K8s repo and the K8s/Cloud-provider repos.
|
@@ -169,6 +194,154 @@ import (
|
169 | 194 | )
|
170 | 195 | ```
|
171 | 196 |
|
| 197 | +## Production Readiness Review Questionnaire |
| 198 | + |
| 199 | +### Feature Enablement and Rollback |
| 200 | + |
| 201 | +_This section must be completed when targeting alpha to a release._ |
| 202 | + |
| 203 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 204 | + - [X] Feature gate (also fill in values in `kep.yaml`) |
| 205 | + - Feature gate name: DisableCloudProviders |
| 206 | + - Components depending on the feature gate: kubelet, kube-apiserver, kube-controller-manager |
| 207 | + - [X] Feature gate (also fill in values in `kep.yaml`) |
| 208 | + - Feature gate name: DisableKubeletCloudCredentialProvider |
| 209 | + - Components depending on the feature gate: kubelet |
| 210 | + |
| 211 | +* **Does enabling the feature change any default behavior?** |
| 212 | + Yes, enabling this feature will disable all capabilities enabled when `--cloud-provider` is set in core components. |
| 213 | + Users need to ensure they have migrated to out-of-tree components prior to enabling this feature gate. |
| 214 | + If appropriate extensions (CCM, credential provider, apiserver-network-proxy, etc) are in use, cloud provider capabilities |
| 215 | + should remain the same at the very least. |
| 216 | + |
| 217 | +* **Can the feature be disabled once it has been enabled (i.e. can we roll back |
| 218 | + the enablement)?** |
| 219 | + Yes, the feature can be disabled once it is enabled. If disabled, users must ensure |
| 220 | + that the CCM is no longer running in the cluster. Credential provider plugins and the |
| 221 | + apiserver network proxy do not have to be stopped on rollback. |
| 222 | + |
| 223 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 224 | + |
| 225 | + All capabilities from in-tree cloud providers will be re-disabled. |
| 226 | + |
| 227 | +* **Are there any tests for feature enablement/disablement?** |
| 228 | + Adequate unit tests, component integration test and e2e tests will be added for this feature before |
| 229 | + it is goes beta and on by default. |
| 230 | + |
| 231 | +### Rollout, Upgrade and Rollback Planning |
| 232 | + |
| 233 | +_This section must be completed when targeting beta graduation to a release._ |
| 234 | + |
| 235 | +* **How can a rollout fail? Can it impact already running workloads?** |
| 236 | + |
| 237 | + TBD for beta. |
| 238 | + |
| 239 | +* **What specific metrics should inform a rollback?** |
| 240 | + |
| 241 | + TBD for beta. |
| 242 | + |
| 243 | +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** |
| 244 | + |
| 245 | + TBD for beta. |
| 246 | + |
| 247 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, |
| 248 | +fields of API types, flags, etc.?** |
| 249 | + |
| 250 | + TBD for beta. |
| 251 | + |
| 252 | +### Monitoring Requirements |
| 253 | + |
| 254 | +_This section must be completed when targeting beta graduation to a release._ |
| 255 | + |
| 256 | +* **How can an operator determine if the feature is in use by workloads?** |
| 257 | + |
| 258 | + TBD for beta. |
| 259 | + |
| 260 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine |
| 261 | +the health of the service?** |
| 262 | + |
| 263 | + TBD for beta. |
| 264 | + |
| 265 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 266 | + |
| 267 | + TBD for beta. |
| 268 | + |
| 269 | +* **Are there any missing metrics that would be useful to have to improve observability |
| 270 | +of this feature?** |
| 271 | + |
| 272 | + TBD for beta. |
| 273 | + |
| 274 | +### Dependencies |
| 275 | + |
| 276 | +_This section must be completed when targeting beta graduation to a release._ |
| 277 | + |
| 278 | +* **Does this feature depend on any specific services running in the cluster?** |
| 279 | + |
| 280 | + TBD for beta. |
| 281 | + |
| 282 | + |
| 283 | +### Scalability |
| 284 | + |
| 285 | +_For alpha, this section is encouraged: reviewers should consider these questions |
| 286 | +and attempt to answer them._ |
| 287 | + |
| 288 | +_For beta, this section is required: reviewers must answer these questions._ |
| 289 | + |
| 290 | +_For GA, this section is required: approvers should be able to confirm the |
| 291 | +previous answers based on experience in the field._ |
| 292 | + |
| 293 | +* **Will enabling / using this feature result in any new API calls?** |
| 294 | + |
| 295 | + No, if anything it will result in reduced API calls in core components. |
| 296 | + |
| 297 | +* **Will enabling / using this feature result in introducing new API types?** |
| 298 | + |
| 299 | + No. |
| 300 | + |
| 301 | +* **Will enabling / using this feature result in any new calls to the cloud |
| 302 | +provider?** |
| 303 | + |
| 304 | + No, it will actually remove calls to the cloud provider in all core components. |
| 305 | + |
| 306 | +* **Will enabling / using this feature result in increasing size or count of |
| 307 | +the existing API objects?** |
| 308 | + |
| 309 | + No. |
| 310 | + |
| 311 | +* **Will enabling / using this feature result in increasing time taken by any |
| 312 | +operations covered by [existing SLIs/SLOs]?** |
| 313 | + |
| 314 | + No. |
| 315 | + |
| 316 | +* **Will enabling / using this feature result in non-negligible increase of |
| 317 | +resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 318 | + |
| 319 | + No. In fact, it should reduce resource usage. |
| 320 | + |
| 321 | +### Troubleshooting |
| 322 | + |
| 323 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 324 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 325 | +details). For now, we leave it here. |
| 326 | + |
| 327 | +_This section must be completed when targeting beta graduation to a release._ |
| 328 | + |
| 329 | +* **How does this feature react if the API server and/or etcd is unavailable?** |
| 330 | + |
| 331 | +TBD for beta. |
| 332 | + |
| 333 | +* **What are other known failure modes?** |
| 334 | + |
| 335 | +TBD for beta. |
| 336 | + |
| 337 | +* **What steps should be taken if SLOs are not being met to determine the problem?** |
| 338 | + |
| 339 | +TBD for beta. |
| 340 | + |
| 341 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 342 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 343 | + |
| 344 | + |
172 | 345 | ## Alternatives
|
173 | 346 |
|
174 | 347 | ### Staging Alternatives
|
|
0 commit comments