Skip to content

Commit 9ffa073

Browse files
authored
Merge pull request kubernetes#3188 from gnufied/add-recover-resizer-prr
KEP 1790: Update recover resize failure KEP for going beta.
2 parents ba726bf + fde8cea commit 9ffa073

File tree

3 files changed

+69
-64
lines changed

3 files changed

+69
-64
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 1790
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-storage/1790-recover-resize-failure/README.md

Lines changed: 63 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -224,8 +224,7 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
224224
### Risks and Mitigations
225225

226226
- Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request.
227-
228-
227+
229228
## Graduation Criteria
230229

231230
* *Alpha* in 1.23 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
@@ -258,12 +257,12 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
258257
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
259258

260259
* **Does enabling the feature change any default behavior?**
261-
Allow users to reduce size of pvc in `pvc.spec.resources`. In general this was not permitted before
260+
Allow users to reduce size of pvc in `pvc.spec.resources`. In general this was not permitted before,
262261
so it should not break any of existing automation. This means that if `pvc.Status.AllocatedResources` is available it will be
263262
used for calculating quota.
264263

265-
To facilitate older kubelets - external resize controller will set `pvc.Status.ResizeStatus` to "''" after entire expansion process is complete. This will ensure that `ResizeStatus` is updated
266-
after expansion is complete even with older kubelets. No recovery from expansion failure will be possible in this case and the workaround will be removed once feature goes GA.
264+
To facilitate older kubelet - external resize controller will set `pvc.Status.ResizeStatus` to "''" after entire expansion process is complete. This will ensure that `ResizeStatus` is updated
265+
after expansion is complete even with older kubelet. No recovery from expansion failure will be possible in this case and the workaround will be removed once feature goes GA.
267266

268267
One more thing to keep in mind is - enabling this feature in kubelet while keeping it disabled in external-resizer will cause
269268
all volume expansions operations to get stuck(similar thing will happen when feature moves to beta and kubelet is newer but external-resizer sidecar is older).
@@ -288,73 +287,70 @@ after expansion is complete even with older kubelets. No recovery from expansion
288287

289288
### Rollout, Upgrade and Rollback Planning
290289

291-
_This section must be completed when targeting beta graduation to a release._
292-
293290
* **How can a rollout fail? Can it impact already running workloads?**
294291
This change should not impact existing workloads and requires user interaction via reducing pvc capacity.
295292

296293
* **What specific metrics should inform a rollback?**
294+
No specific metric but if expansion of PVCs are being stuck (can be verified from `pvc.Status.Conditions`)
295+
then user should plan a rollback.
297296

298297
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
299-
Describe manual testing that was done and the outcomes.
300-
Longer term, we may want to require automated upgrade/rollback tests, but we
301-
are missing a bunch of machinery and tooling and do that now.
298+
We have not fully tested upgrade and rollback but as part of beta process we will have it tested.
302299

303300
* **Is the rollout accompanied by any deprecations and/or removals of features,
304301
APIs, fields of API types, flags, etc.?**
305-
Even if applying deprecation policies, they may still surprise some users.
302+
This feature deprecates no existing functionality.
306303

307304
### Monitoring requirements
308305

309306
_This section must be completed when targeting beta graduation to a release._
310307

311308
* **How can an operator determine if the feature is in use by workloads?**
312-
Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
313-
checking if there are objects with field X set) may be last resort. Avoid
314-
logs or events for this purpose.
315-
309+
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
310+
316311
* **What are the SLIs (Service Level Indicators) an operator can use to
317312
determine the health of the service?**
318313
- [ ] Metrics
319-
- Metric name:
320-
- [Optional] Aggregation method:
321-
- Components exposing the metric:
314+
- controller expansion operation duration:
315+
- Metric name: storage_operation_duration_seconds{operation_name=expand_volume}
316+
- [Optional] Aggregation method: percentile
317+
- Components exposing the metric: kube-controller-manager
318+
- controller expansion operation errors:
319+
- Metric name: storage_operation_errors_total{operation_name=expand_volume}
320+
- [Optional] Aggregation method: cumulative counter
321+
- Components exposing the metric: kube-controller-manager
322+
- node expansion operation duration:
323+
- Metric name: storage_operation_duration_seconds{operation_name=volume_fs_resize}
324+
- [Optional] Aggregation method: percentile
325+
- Components exposing the metric: kubelet
326+
- node expansion operation errors:
327+
- Metric name: storage_operation_errors_total{operation_name=volume_fs_resize}
328+
- [Optional] Aggregation method: cumulative counter
329+
- Components exposing the metric: kubelet
322330
- [ ] Other (treat as last resort)
323331
- Details:
324332

325333
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
326-
At the high-level this usually will be in the form of "high percentile of SLI
327-
per day <= X". It's impossible to provide a comprehensive guidance, but at the very
328-
high level (they needs more precise definitions) those may be things like:
329-
- per-day percentage of API calls finishing with 5XX errors <= 1%
330-
- 99% percentile over day of absolute value from (job creation time minus expected
331-
job creation time) for cron job <= 10%
332-
- 99,9% of /health requests per day finish with 200 code
334+
After this feature is rolled out, there should not be any increase in 95-99 percentile of
335+
both `expand_volume` and `volume_fs_resize` durations. Also the error rate should not increase for
336+
`storage_operation_errors_total` metric.
333337

334338
* **Are there any missing metrics that would be useful to have to improve
335339
observability if this feature?**
336-
Describe the metrics themselves and the reason they weren't added (e.g. cost,
337-
implementation difficulties, etc.).
338-
339-
### Dependencies
340+
We are planning to add new counter metrics that will record success and failure of recovery operations.
341+
In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.
340342

341-
_This section must be completed when targeting beta graduation to a release._
342-
343-
* **Does this feature depend on any specific services running in the cluster?**
344-
Think about both cluster-level services (e.g. metrics-server) as well
345-
as node-level agents (e.g. specific version of CRI). Focus on external or
346-
optional services that are needed. For example, if this feature depends on
347-
a cloud provider API, or upon an external software-defined storage or network
348-
control plane.
343+
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`
349344

350-
For each of the fill in the following, thinking both about running user workloads
351-
and creating new ones, as well as about cluster-level services (e.g. DNS):
352-
- [Dependency name]
353-
- Usage description:
354-
- Impact of its outage on the feature:
355-
- Impact of its degraded performance or high error rates on the feature:
345+
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often
346+
and hence it should be okay to use name of PVs that were recovered this way.
356347

348+
### Dependencies
357349

350+
* **Does this feature depend on any specific services running in the cluster?**
351+
For CSI volumes this feature requires users to run with latest version of `external-resizer`
352+
sidecar which also should have `RecoverExpansionFailure` feature enabled.
353+
358354
### Scalability
359355

360356
_For beta, this section is required: reviewers must answer these questions._
@@ -363,14 +359,19 @@ _For GA, this section is required: approvers should be able to confirms the
363359
previous answers based on experience in the field._
364360

365361
* **Will enabling / using this feature result in any new API calls?**
366-
None
362+
Potentially yes. If user expands a PVC and expansion fails on the node, then
363+
expansion controller in control-plane must intervene and verify if it is safe
364+
to retry expansion on the kubelet.This requires round-trips between kubelet
365+
and control-plane and hence more API calls.
367366

368367
* **Will enabling / using this feature result in introducing new API types?**
369368
None
370369

371370
* **Will enabling / using this feature result in any new calls to cloud
372371
provider?**
373-
None
372+
Potentially yes. Since expansion operation is idempotent and expansion controller
373+
must verify if it is safe to retry expansion with lower values, there could be
374+
additional calls to CSI drivers (and hence cloudproviders).
374375

375376
* **Will enabling / using this feature result in increasing size or count
376377
of the existing API objects?**
@@ -381,16 +382,12 @@ previous answers based on experience in the field._
381382

382383
* **Will enabling / using this feature result in increasing time taken by any
383384
operations covered by [existing SLIs/SLOs][]?**
384-
Think about adding additional work or introducing new steps in between
385-
(e.g. need to do X to start a container), etc. Please describe the details.
385+
In some cases if expansion fails on the node, then since it requires now correction
386+
from control-plane, it may require longer time for entire expansion.
386387

387388
* **Will enabling / using this feature result in non-negligible increase of
388389
resource usage (CPU, RAM, disk, IO, ...) in any components?**
389-
Things to keep in mind include: additional in-memory state, additional
390-
non-trivial computations, excessive access to disks (including increased log
391-
volume), significant amount of data send and/or received over network, etc.
392-
This through this both in small and large cases, again with respect to the
393-
[supported limits][].
390+
It should not result in increase of resource usage.
394391

395392
### Troubleshooting
396393

@@ -401,20 +398,25 @@ details). For now we leave it here though.
401398
_This section must be completed when targeting beta graduation to a release._
402399

403400
* **How does this feature react if the API server and/or etcd is unavailable?**
401+
If API server or etcd is not available midway through a request then some of the
402+
modifications to API objects can not be saved. But this should be a recoverable state because
403+
expansion operation is idempotent and controller will retry the operation and succeed
404+
whenever API server does come back.
404405

405406
* **What are other known failure modes?**
406407
For each of them fill in the following information by copying the below template:
407-
- [Failure mode brief description]
408-
- Detection: How can it be detected via metrics? Stated another way:
409-
how can an operator troubleshoot without loogging into a master or worker node?
410-
- Mitigations: What can be done to stop the bleeding, especially for already
411-
running user workloads?
412-
- Diagnostics: What are the useful log messages and their required logging
413-
levels that could help debugging the issue?
414-
Not required until feature graduated to Beta.
415-
- Testing: Are there any tests for failure mode? If not describe why.
408+
- No recovery is possible if volume has been expanded on control-plane and only failing on node.
409+
- Detection: Expansion is stuck with `ResizeStatus` - `NodeExpansionPending` or `NodeExpansionFailed`.
410+
- Mitigations: This should not affect any of existing PVCs but this was already broken in some sense and if volume has been
411+
expanded in control-plane then we can't allow users to shrink their PVCs because that would violate the quota.
412+
- Diagnostics: Expansion is stuck with `ResizeStatus` - `NodeExpansionPending` or `NodeExpansionFailed`.
413+
- Testing: There are some unit tests for this failure mode.
416414

417415
* **What steps should be taken if SLOs are not being met to determine the problem?**
416+
If admin notices an increase in expansion failure operations via aformentioned metrics or
417+
by observing `pvc.Status.ResizeStatus` then:
418+
- Check events on the PVC and observe why is PVC expansion failing.
419+
- Gather logs from kubelet and external-resizer and check for problems.
418420

419421
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
420422
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

keps/sig-storage/1790-recover-resize-failure/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,18 @@ prr-approvers:
1717
- "@deads2k"
1818
editor: TBD
1919
creation-date: 2020-01-27
20-
last-updated: 2021-09-07
20+
last-updated: 2022-01-26
2121
status: implementable
2222
see-also:
2323
replaces:
2424
superseded-by:
2525

26-
latest-milestone: "v1.23"
27-
stage: "alpha"
26+
latest-milestone: "v1.24"
27+
stage: "beta"
2828

2929
milestone:
3030
alpha: "v1.23"
31+
beta: "v1.24"
3132

3233

3334
feature-gates:

0 commit comments

Comments
 (0)