|
3 | 3 | ## Table of Contents
|
4 | 4 |
|
5 | 5 | <!-- toc -->
|
| 6 | + |
6 | 7 | - [Summary](#summary)
|
7 | 8 | - [Motivation](#motivation)
|
8 | 9 | - [User stories](#user-stories)
|
|
19 | 20 | - [Beta->GA](#beta-ga)
|
20 | 21 | - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
|
21 | 22 | - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
|
| 23 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 24 | + - [Monitoring Requirements](#monitoring-requirements) |
| 25 | + - [Dependencies](#dependencies) |
22 | 26 | - [Scalability](#scalability)
|
| 27 | + - [Troubleshooting](#troubleshooting) |
23 | 28 | - [Alternatives](#alternatives)
|
24 | 29 | - [Implementation History](#implementation-history)
|
25 | 30 | <!-- /toc -->
|
@@ -215,6 +220,53 @@ Option 1 is adopted. See discussion
|
215 | 220 | - **Are there any tests for feature enablement/disablement?** yes, unit tests
|
216 | 221 | will cover this.
|
217 | 222 |
|
| 223 | +### Rollout, Upgrade and Rollback Planning |
| 224 | + |
| 225 | +- **How can a rollout fail? Can it impact already running workloads?** |
| 226 | + Rollout will not fail because this change only exposes an extra field in CSIDriverSpec. |
| 227 | + |
| 228 | +* **What specific metrics should inform a rollback?** |
| 229 | + |
| 230 | + - `storage_operation_duration_seconds`: if the corresponding csi plugin has |
| 231 | + high error rates by aggregating on `status`. |
| 232 | + |
| 233 | +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** |
| 234 | + No. When downgrade happens where kube-apiserver doesn't have the added fields, |
| 235 | + the existing volumes will continue to work as long as it doesn't rely on the |
| 236 | + acquired token being valid. |
| 237 | + |
| 238 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, |
| 239 | + fields of API types, flags, etc.?** |
| 240 | + No. |
| 241 | + |
| 242 | +### Monitoring Requirements |
| 243 | + |
| 244 | +- **How can an operator determine if the feature is in use by workloads?** |
| 245 | + run `kubectl get CSIDriver` to see whether `tokenRequests` or `requiresRepublish` |
| 246 | + is specified. |
| 247 | + |
| 248 | +- **What are the SLIs (Service Level Indicators) an operator can use to determine |
| 249 | + the health of the service?** |
| 250 | + |
| 251 | + - [x] Metrics |
| 252 | + - Metric name: `storage_operation_duration_seconds` |
| 253 | + - Aggregation method: volume_plugin, operation_name, status |
| 254 | + - Components exposing the metric: kubelet |
| 255 | + |
| 256 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 257 | + for the particular csi plugin, per-day percentage of failed storage operations |
| 258 | + <= 1% |
| 259 | + |
| 260 | +- **Are there any missing metrics that would be useful to have to improve observability |
| 261 | + of this feature?** |
| 262 | + None |
| 263 | + |
| 264 | +### Dependencies |
| 265 | + |
| 266 | +- **Does this feature depend on any specific services running in the cluster?** |
| 267 | + |
| 268 | + There are no new components required, but requires kubelets >= 1.12 |
| 269 | + |
218 | 270 | ### Scalability
|
219 | 271 |
|
220 | 272 | - **Will enabling / using this feature result in any new API calls?**
|
@@ -245,6 +297,25 @@ Option 1 is adopted. See discussion
|
245 | 297 | - **Will enabling / using this feature result in non-negligible increase of
|
246 | 298 | resource usage (CPU, RAM, disk, IO, ...) in any components?** no.
|
247 | 299 |
|
| 300 | +### Troubleshooting |
| 301 | + |
| 302 | +- **How does this feature react if the API server and/or etcd is unavailable?** |
| 303 | + `RequiresRepublish` will continue to function but `TokenRequests` will fail. |
| 304 | + |
| 305 | +- **What are other known failure modes?** |
| 306 | + |
| 307 | + - Failed to fetch token |
| 308 | + |
| 309 | + - Detection: Check mount failure in Pod events or kubelet log. |
| 310 | + - Mitigations: Set `TokenRequests=[]`, subsequent `NodePublishVolume` will |
| 311 | + not have tokens in volume attributes. Tokens retrieved before will |
| 312 | + eventually expire. |
| 313 | + - Diagnostics: Search "mounter.SetUpAt failed to get service accoount token attributes" |
| 314 | + - Testing: E2E test |
| 315 | + |
| 316 | +- **What steps should be taken if SLOs are not being met to determine the problem?** |
| 317 | + None. |
| 318 | + |
248 | 319 | ## Alternatives
|
249 | 320 |
|
250 | 321 | 1. Instead of fetching tokens in kubelet, CSI drivers will be granted
|
|
0 commit comments