Skip to content

Commit d7c7e9f

Browse files
committed
Start adding more details to PRR for recovery KEP
1 parent 24ef1ae commit d7c7e9f

File tree

1 file changed

+59
-60
lines changed
  • keps/sig-storage/1790-recover-resize-failure

1 file changed

+59
-60
lines changed

keps/sig-storage/1790-recover-resize-failure/README.md

Lines changed: 59 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -224,8 +224,7 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
224224
### Risks and Mitigations
225225

226226
- Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request.
227-
228-
227+
229228
## Graduation Criteria
230229

231230
* *Alpha* in 1.23 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
@@ -258,12 +257,12 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
258257
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
259258

260259
* **Does enabling the feature change any default behavior?**
261-
Allow users to reduce size of pvc in `pvc.spec.resources`. In general this was not permitted before
260+
Allow users to reduce size of pvc in `pvc.spec.resources`. In general this was not permitted before,
262261
so it should not break any of existing automation. This means that if `pvc.Status.AllocatedResources` is available it will be
263262
used for calculating quota.
264263

265-
To facilitate older kubelets - external resize controller will set `pvc.Status.ResizeStatus` to "''" after entire expansion process is complete. This will ensure that `ResizeStatus` is updated
266-
after expansion is complete even with older kubelets. No recovery from expansion failure will be possible in this case and the workaround will be removed once feature goes GA.
264+
To facilitate older kubelet - external resize controller will set `pvc.Status.ResizeStatus` to "''" after entire expansion process is complete. This will ensure that `ResizeStatus` is updated
265+
after expansion is complete even with older kubelet. No recovery from expansion failure will be possible in this case and the workaround will be removed once feature goes GA.
267266

268267
One more thing to keep in mind is - enabling this feature in kubelet while keeping it disabled in external-resizer will cause
269268
all volume expansions operations to get stuck(similar thing will happen when feature moves to beta and kubelet is newer but external-resizer sidecar is older).
@@ -288,73 +287,67 @@ after expansion is complete even with older kubelets. No recovery from expansion
288287

289288
### Rollout, Upgrade and Rollback Planning
290289

291-
_This section must be completed when targeting beta graduation to a release._
292-
293290
* **How can a rollout fail? Can it impact already running workloads?**
294291
This change should not impact existing workloads and requires user interaction via reducing pvc capacity.
295292

296293
* **What specific metrics should inform a rollback?**
294+
No specific metric but if expansion of PVCs are being stuck (can be verified from `pvc.Status.Conditions`)
295+
then user should plan a rollback.
297296

298297
* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
299-
Describe manual testing that was done and the outcomes.
300-
Longer term, we may want to require automated upgrade/rollback tests, but we
301-
are missing a bunch of machinery and tooling and do that now.
298+
We have not fully tested upgrade and rollback but as part of beta process we will have it tested.
302299

303300
* **Is the rollout accompanied by any deprecations and/or removals of features,
304301
APIs, fields of API types, flags, etc.?**
305-
Even if applying deprecation policies, they may still surprise some users.
302+
This feature deprecates no existing functionality.
306303

307304
### Monitoring requirements
308305

309306
_This section must be completed when targeting beta graduation to a release._
310307

311308
* **How can an operator determine if the feature is in use by workloads?**
312-
Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
313-
checking if there are objects with field X set) may be last resort. Avoid
314-
logs or events for this purpose.
309+
For a PVC that has undergone recovery from expansion failure successfully, it is not possible
310+
to identify the fact that - PVC used this feature. But for PVCs for which
311+
recovery failed even after reducing size, an operator can determine the feature in-use
312+
by looking at newly introduced `pvc.Status.ResizeStatus` field.
315313

316314
* **What are the SLIs (Service Level Indicators) an operator can use to
317315
determine the health of the service?**
318316
- [ ] Metrics
319-
- Metric name:
320-
- [Optional] Aggregation method:
321-
- Components exposing the metric:
317+
- controller expansion operation duration:
318+
- Metric name: storage_operation_duration_seconds{operation_name=expand_volume}
319+
- [Optional] Aggregation method: percentile
320+
- Components exposing the metric: kube-controller-manager
321+
- controller expansion operation errors:
322+
- Metric name: storage_operation_errors_total{operation_name=expand_volume}
323+
- [Optional] Aggregation method: cumulative counter
324+
- Components exposing the metric: kube-controller-manager
325+
- node expansion operation duration:
326+
- Metric name: storage_operation_duration_seconds{operation_name=volume_fs_resize}
327+
- [Optional] Aggregation method: percentile
328+
- Components exposing the metric: kubelet
329+
- node expansion operation errors:
330+
- Metric name: storage_operation_errors_total{operation_name=volume_fs_resize}
331+
- [Optional] Aggregation method: cumulative counter
332+
- Components exposing the metric: kubelet
322333
- [ ] Other (treat as last resort)
323334
- Details:
324335

325336
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
326-
At the high-level this usually will be in the form of "high percentile of SLI
327-
per day <= X". It's impossible to provide a comprehensive guidance, but at the very
328-
high level (they needs more precise definitions) those may be things like:
329-
- per-day percentage of API calls finishing with 5XX errors <= 1%
330-
- 99% percentile over day of absolute value from (job creation time minus expected
331-
job creation time) for cron job <= 10%
332-
- 99,9% of /health requests per day finish with 200 code
337+
After this feature is rolled out, there should not be any increase in 95-99 percentile of
338+
both `expand_volume` and `volume_fs_resize` durations. Also the error rate should not increase for
339+
`storage_operation_errors_total` metric.
333340

334341
* **Are there any missing metrics that would be useful to have to improve
335342
observability if this feature?**
336-
Describe the metrics themselves and the reason they weren't added (e.g. cost,
337-
implementation difficulties, etc.).
343+
Not applicable.
338344

339345
### Dependencies
340346

341-
_This section must be completed when targeting beta graduation to a release._
342-
343347
* **Does this feature depend on any specific services running in the cluster?**
344-
Think about both cluster-level services (e.g. metrics-server) as well
345-
as node-level agents (e.g. specific version of CRI). Focus on external or
346-
optional services that are needed. For example, if this feature depends on
347-
a cloud provider API, or upon an external software-defined storage or network
348-
control plane.
349-
350-
For each of the fill in the following, thinking both about running user workloads
351-
and creating new ones, as well as about cluster-level services (e.g. DNS):
352-
- [Dependency name]
353-
- Usage description:
354-
- Impact of its outage on the feature:
355-
- Impact of its degraded performance or high error rates on the feature:
356-
357-
348+
For CSI volumes this feature requires users to run with latest version of `external-resizer`
349+
sidecar which also should have `RecoverExpansionFailure` feature enabled.
350+
358351
### Scalability
359352

360353
_For beta, this section is required: reviewers must answer these questions._
@@ -363,14 +356,19 @@ _For GA, this section is required: approvers should be able to confirms the
363356
previous answers based on experience in the field._
364357

365358
* **Will enabling / using this feature result in any new API calls?**
366-
None
359+
Potentially yes. If user expands a PVC and expansion fails on the node, then
360+
expansion controller in control-plane must intervene and verify if it is safe
361+
to retry expansion on the kubelet.This requires round-trips between kubelet
362+
and control-plane and hence more API calls.
367363

368364
* **Will enabling / using this feature result in introducing new API types?**
369365
None
370366

371367
* **Will enabling / using this feature result in any new calls to cloud
372368
provider?**
373-
None
369+
Potentially yes. Since expansion operation is idempotent and expansion controller
370+
must verify if it is safe to retry expansion with lower values, there could be
371+
additional calls to CSI drivers (and hence cloudproviders).
374372

375373
* **Will enabling / using this feature result in increasing size or count
376374
of the existing API objects?**
@@ -381,16 +379,12 @@ previous answers based on experience in the field._
381379

382380
* **Will enabling / using this feature result in increasing time taken by any
383381
operations covered by [existing SLIs/SLOs][]?**
384-
Think about adding additional work or introducing new steps in between
385-
(e.g. need to do X to start a container), etc. Please describe the details.
382+
In some cases if expansion fails on the node, then since it requires now correction
383+
from control-plane, it may require longer time for entire expansion.
386384

387385
* **Will enabling / using this feature result in non-negligible increase of
388386
resource usage (CPU, RAM, disk, IO, ...) in any components?**
389-
Things to keep in mind include: additional in-memory state, additional
390-
non-trivial computations, excessive access to disks (including increased log
391-
volume), significant amount of data send and/or received over network, etc.
392-
This through this both in small and large cases, again with respect to the
393-
[supported limits][].
387+
It should not result in increase of resource usage.
394388

395389
### Troubleshooting
396390

@@ -401,20 +395,25 @@ details). For now we leave it here though.
401395
_This section must be completed when targeting beta graduation to a release._
402396

403397
* **How does this feature react if the API server and/or etcd is unavailable?**
398+
If API server or etcd is not available midway through a request then some of the
399+
modifications to API objects can not be saved. But this should be a recoverable state because
400+
expansion operation is idempotent and controller will retry the operation and succeed
401+
whenever API server does come back.
404402

405403
* **What are other known failure modes?**
406404
For each of them fill in the following information by copying the below template:
407-
- [Failure mode brief description]
408-
- Detection: How can it be detected via metrics? Stated another way:
409-
how can an operator troubleshoot without loogging into a master or worker node?
410-
- Mitigations: What can be done to stop the bleeding, especially for already
411-
running user workloads?
412-
- Diagnostics: What are the useful log messages and their required logging
413-
levels that could help debugging the issue?
414-
Not required until feature graduated to Beta.
415-
- Testing: Are there any tests for failure mode? If not describe why.
405+
- No recovery is possible if volume has been expanded on control-plane and only failing on node.
406+
- Detection: Expansion is stuck with `ResizeStatus` - `NodeExpansionPending` or `NodeExpansionFailed`.
407+
- Mitigations: This should not affect any of existing PVCs but this was already broken in some sense and if volume has been
408+
expanded in control-plane then we can't allow users to shrink their PVCs because that would violate the quota.
409+
- Diagnostics: Expansion is stuck with `ResizeStatus` - `NodeExpansionPending` or `NodeExpansionFailed`.
410+
- Testing: There are some unit tests for this failure mode.
416411

417412
* **What steps should be taken if SLOs are not being met to determine the problem?**
413+
If admin notices an increase in expansion failure operations via aformentioned metrics or
414+
by observing `pvc.Status.ResizeStatus` then:
415+
- Check events on the PVC and observe why is PVC expansion failing.
416+
- Gather logs from kubelet and external-resizer and check for problems.
418417

419418
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
420419
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

0 commit comments

Comments
 (0)