Skip to content

Commit ea4b9cc

Browse files
authored
Merge pull request kubernetes#2028 from feiskyer/update-az-kep
Update Out-of-Tree Azure Cloud Provider KEP
2 parents 1274d44 + 9c7e16e commit ea4b9cc

File tree

1 file changed

+171
-14
lines changed

1 file changed

+171
-14
lines changed

keps/sig-cloud-provider/azure/20190125-out-of-tree-azure.md

Lines changed: 171 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ approvers:
1717
- "@jagosan"
1818
editor: "@feiskyer"
1919
creation-date: 2019-01-29
20-
last-updated: 2020-01-18
20+
last-updated: 2020-09-29
2121
status: implementable
2222
---
2323

@@ -40,8 +40,17 @@ status: implementable
4040
- [Design Details](#design-details)
4141
- [Test Plan](#test-plan)
4242
- [Graduation Criteria](#graduation-criteria)
43+
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
44+
- [Beta -> GA Graduation](#beta---ga-graduation)
4345
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
4446
- [Version Skew Strategy](#version-skew-strategy)
47+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
48+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
49+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
50+
- [Monitoring Requirements](#monitoring-requirements)
51+
- [Dependencies](#dependencies)
52+
- [Scalability](#scalability)
53+
- [Troubleshooting](#troubleshooting)
4554
- [Implementation History](#implementation-history)
4655
- [Technical Leads are members of the Kubernetes Organization](#technical-leads-are-members-of-the-kubernetes-organization)
4756
- [Subproject Leads](#subproject-leads)
@@ -50,18 +59,22 @@ status: implementable
5059

5160
## Release Signoff Checklist
5261

53-
- [X] k/enhancements issue in release milestone and linked to KEP (https://github.com/kubernetes/enhancements/issues/667)
54-
- [X] KEP approvers have set the KEP status to `implementable`
55-
- [X] Design details are appropriately documented
56-
- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
57-
- [X] Graduation criteria is in place
58-
- [X] "Implementation History" section is up-to-date for milestone
59-
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
62+
Items marked with (R) are required *prior to targeting to a milestone / release*.
63+
64+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/667).
65+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
66+
- [x] (R) Design details are appropriately documented
67+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
68+
- [x] (R) Graduation criteria is in place
69+
- [x] (R) Production readiness review completed
70+
- [x] Production readiness review approved
71+
- [x] "Implementation History" section is up-to-date for milestone
72+
- [x] User-facing documentation has been created in [kubernetes-sigs/cloud-provider-azure](https://kubernetes-sigs.github.io/cloud-provider-azure/)
73+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6074

6175
## Summary
6276

63-
Build support for the out-of-tree Azure cloud provider. This involves a well-tested version of the cloud-controller-manager
64-
that has feature parity to the kube-controller-manager.
77+
Build support for the out-of-tree Azure cloud provider. This involves a well-tested version of the cloud-controller-manager that has feature parity to the kube-controller-manager.
6578

6679
## Motivation
6780

@@ -124,7 +137,7 @@ cloud-provider-azure/
124137

125138
- The core of Azure cloud provider would be moved to [kubernetes-sigs/cloud-provider-azure](https://github.com/kubernetes-sigs/cloud-provider-azure).
126139
- The storage drivers would be moved to [kubernetes-sigs/azuredisk-csi-driver](https://github.com/kubernetes-sigs/azuredisk-csi-driver) and [kubernetes-sigs/azurefile-csi-driver](https://github.com/kubernetes-sigs/azurefile-csi-driver).
127-
- The credential provider is still under discussion on [kubernetes/cloud-provider#13](https://github.com/kubernetes/cloud-provider/issues/13).
140+
- The credential provider is tracked by out-of-tree credential provider [KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-cloud-provider/20191004-out-of-tree-credential-providers.md) and it won't block the progress of this feature.
128141

129142
### Risks and Mitigation
130143

@@ -162,13 +175,27 @@ See [report](https://testgrid.k8s.io/provider-azure-cloud-provider-azure) for mo
162175

163176
- Azure cloud controller manager is moving to GA
164177
- Feature compatible with KCM
165-
- Conformance tests are passed and published to testgrid
178+
- Conformance tests are passed and published to [testgrid](https://testgrid.k8s.io/provider-azure-cloud-provider-azure)
166179
- CSI drivers for AzureDisk/AzureFile are moving to GA
167180
- Feature compatible with KCM
168-
- Conformance tests are passed and published to testgrid
181+
- Features implemented from CSI API SPEC
182+
- Conformance tests are passed and published to [testgrid](https://testgrid.k8s.io/provider-azure-azuredisk-csi-driver)
169183
- Azure credential provider is still supported in Kubelet
170184
- Feature compatible with KCM
171-
- Conformance tests are passed and published to testgrid
185+
- Features implemented from CSI API SPEC
186+
- Conformance tests are passed and published to [testgrid](https://testgrid.k8s.io/provider-azure-cloud-provider-azure)
187+
188+
#### Alpha -> Beta Graduation
189+
190+
- E2E tests have been added in [testgrid](https://testgrid.k8s.io/provider-azure-cloud-provider-azure)
191+
- The same set of tests have been passed with out-of-tree projects
192+
- All the features from in-tree implementations are still supported
193+
194+
#### Beta -> GA Graduation
195+
196+
- Code changes are decoupled from in-tree cloud provide (e.g. it shouldn't vendor in-tree implementations directly)
197+
- E2E tests have been run stably (e.g. no flaky tests)
198+
- Upgrade tests and scalability tests have been passed
172199

173200
### Upgrade / Downgrade Strategy
174201

@@ -181,6 +208,136 @@ For each Kubernetes minor releases (e.g. v1.15.x), dedicated Azure cloud control
181208
- The version matrix for Azure cloud controller manager would be documented on [kubernetes/cloud-provider-azure](https://github.com/kubernetes/cloud-provider-azure/blob/master/README.md#current-status).
182209
- The version matrix for CSI drivers would be documented on [kubernetes-sigs/azuredisk-csi-driver](https://github.com/kubernetes-sigs/azuredisk-csi-driver#container-images--csi-compatibility) and [kubernetes-sigs/azurefile-csi-driver](https://github.com/kubernetes-sigs/azurefile-csi-driver#container-images--csi-compatibility).
183210

211+
## Production Readiness Review Questionnaire
212+
213+
### Feature Enablement and Rollback
214+
215+
_This section must be completed when targeting alpha to a release._
216+
217+
* **How can this feature be enabled / disabled in a live cluster?**
218+
- [x] Feature gate (also fill in values in `kep.yaml`)
219+
- Feature gate name: CSIMigrationAzureDisk and CSIMigrationAzureFile
220+
- Components depending on the feature gate: kube-controller-manager and kubelet
221+
- [x] Other
222+
- Describe the mechanism: deploy cloud-controller-manager, cloud-node-manager and CSI drivers in the cluster.
223+
- Will enabling / disabling the feature require downtime of the control
224+
plane? `--cloud-provider=external` should be set for kube-controller-manager.
225+
- Will enabling / disabling the feature require downtime or reprovisioning
226+
of a node? --cloud-provider=external` should be set for for kubelet.
227+
228+
* **Does enabling the feature change any default behavior?**
229+
230+
The default behaviors are still same as before.
231+
232+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
233+
the enablement)?**
234+
235+
Yes. Delete the cloud-controller-manager and cloud-node-manager, then change the `--cloud-provider`
236+
option back to `azure` would still work. CSI drivers should be kept to ensure CSI-provisioned PVCs are still working.
237+
238+
* **What happens if we reenable the feature if it was previously rolled back?**
239+
240+
It would still work as expected.
241+
242+
* **Are there any tests for feature enablement/disablement?**
243+
244+
E2E tests have already been added and results are published on testgrid.
245+
246+
### Rollout, Upgrade and Rollback Planning
247+
248+
_This section must be completed when targeting beta graduation to a release._
249+
250+
* **How can a rollout fail? Can it impact already running workloads?**
251+
252+
Wrong component configurations may cause rollout fail, and running workloads won't be impacted.
253+
254+
* **What specific metrics should inform a rollback?**
255+
256+
Couldn't create a LoadBalancer typed service or AzureDisk PVC indicate the rollout needs to rollback.
257+
258+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
259+
260+
Manually changing the `--cloud-provider` options have been verified. For upgrade->downgrade,
261+
the volumes provisioned by CSI drivers should continue to be managed by CSI drivers. They're
262+
not able to migrate to in-tree drivers.
263+
264+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
265+
fields of API types, flags, etc.?**
266+
267+
In-tree AzureDisk/AzureFile drivers would be migrated to CSI drivers automatically.
268+
269+
### Monitoring Requirements
270+
271+
_This section must be completed when targeting beta graduation to a release._
272+
273+
* **How can an operator determine if the feature is in use by workloads?**
274+
275+
Operation specific metrics (e.g. LoadBalancer creation and route table update) have been added.
276+
277+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
278+
the health of the service?**
279+
- [x] Metrics
280+
- Metric names:
281+
- cloudprovider_azure_op_duration_seconds
282+
- cloudprovider_azure_api_request_errors
283+
- cloudprovider_azure_api_request_throttled_count
284+
- cloudprovider_azure_op_duration_seconds_bucket
285+
- Components exposing the metric: cloud-controller-manager and CSI drivers
286+
287+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
288+
289+
- 99.5% of read and write ARM requests in the last 5 minutes were successful
290+
- LoadBalancer service requests in the last 5 minutes are served in 60 seconds @99th percentile
291+
- Routes for new nodes in the last 5 minutes are served in 90 seconds @99th percentile
292+
- Disk PVC attach requests in the last 5 minutes are served in 60 seconds @99th percentile
293+
294+
### Dependencies
295+
296+
_This section must be completed when targeting beta graduation to a release._
297+
298+
* **Does this feature depend on any specific services running in the cluster?**
299+
300+
CSI drivers for AzureDisk/AzureFile are required for out-of-tree cloud provider,
301+
and their plans has already been added in above designs.
302+
303+
### Scalability
304+
305+
_For alpha, this section is encouraged: reviewers should consider these questions
306+
and attempt to answer them._
307+
308+
_For beta, this section is required: reviewers must answer these questions._
309+
310+
_For GA, this section is required: approvers should be able to confirm the
311+
previous answers based on experience in the field._
312+
313+
* **Will enabling / using this feature result in any new API calls?**
314+
315+
Yes, CSI drivers for AzureDisk/AzureFile would be introduced.
316+
317+
* **Will enabling / using this feature result in introducing new API types?**
318+
319+
Yes, CSI drivers AzureDisk/AzureFile would be introduced.
320+
321+
### Troubleshooting
322+
323+
The Troubleshooting section currently serves the `Playbook` role. We may consider
324+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
325+
details). For now, we leave it here.
326+
327+
_This section must be completed when targeting beta graduation to a release._
328+
329+
* **How does this feature react if the API server and/or etcd is unavailable?**
330+
331+
Same as before.
332+
333+
* **What are other known failure modes?**
334+
335+
Refer <https://kubernetes-sigs.github.io/cloud-provider-azure/faq>.
336+
337+
* **What steps should be taken if SLOs are not being met to determine the problem?**
338+
339+
Check the debug logs of cloud-provider-azure since detailed steps are logged in debug level.
340+
184341
## Implementation History
185342

186343
See [kubernetes/cloud-provider-azure#pulls](https://github.com/kubernetes/cloud-provider-azure/pulls?utf8=%E2%9C%93&q=+is%3Apr+), [kubernetes-sigs/azuredisk-csi-driver#pulls](https://github.com/kubernetes-sigs/azuredisk-csi-driver/pulls?utf8=%E2%9C%93&q=is%3Apr++) and [kubernetes-sigs/azurefile-csi-driver#pulls](https://github.com/kubernetes-sigs/azurefile-csi-driver/pulls?utf8=%E2%9C%93&q=is%3Apr++).

0 commit comments

Comments
 (0)