Skip to content

Commit 32bccf4

Browse files
authored
KEP-3333: Graduate RetroactiveDefaultStorageClass to beta in v1.26 (kubernetes#3544)
* update RetroactiveDefaultStorageClass for beta graduation * address review comments 1 * address review comments 2 * address review comments 3 * address review comments 4
1 parent 00b3eaf commit 32bccf4

File tree

3 files changed

+86
-86
lines changed

3 files changed

+86
-86
lines changed
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3333
22
alpha:
3-
approver: "@wojtek-t"
3+
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-storage/3333-reconcile-default-storage-class/README.md

Lines changed: 78 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -45,20 +45,20 @@
4545

4646
Items marked with (R) are required *prior to targeting to a milestone / release*.
4747

48-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
49-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
50-
- [ ] (R) Design details are appropriately documented
48+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
49+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
50+
- [X] (R) Design details are appropriately documented
5151
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
52-
- [ ] e2e Tests for all Beta API Operations (endpoints)
53-
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
52+
- [X] e2e Tests for all Beta API Operations (endpoints)
53+
- [X] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
5454
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
55-
- [ ] (R) Graduation criteria is in place
56-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
57-
- [ ] (R) Production readiness review completed
58-
- [ ] (R) Production readiness review approved
59-
- [ ] "Implementation History" section is up-to-date for milestone
60-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
61-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
55+
- [X] (R) Graduation criteria is in place
56+
- [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
57+
- [X] (R) Production readiness review completed
58+
- [X] (R) Production readiness review approved
59+
- [X] "Implementation History" section is up-to-date for milestone
60+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
61+
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6262

6363
[kubernetes.io]: https://kubernetes.io/
6464
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -247,11 +247,17 @@ https://storage.googleapis.com/k8s-triage/index.html
247247
- test/integration/volume/persistent_volumes_test.go: Has few cases for PV/PVC
248248
binding.
249249

250-
We plan to extend these test to include default SC and how it will be applied to
250+
Tests were extended to include default SC and how it will be applied to
251251
existing PVCs. We use integration tests almost like e2e tests of the new
252252
behavior to be sure that change of the default SC won't affect other tests
253253
running in parallel.
254254

255+
Test:
256+
https://github.com/kubernetes/kubernetes/blob/dfc9bf0953360201618ad52308ccb34fd8076dff/test/integration/volume/persistent_volumes_test.go#L1043
257+
258+
Testgrid:
259+
https://testgrid.k8s.io/sig-release-master-blocking#integration-master&width=5
260+
255261
##### e2e tests
256262

257263
<!--
@@ -269,10 +275,16 @@ need to create/delete default SCs, which could affect other tests running in
269275
parallel that may expect that a default SC exists (typically StatefulSet e2e
270276
tests).
271277

272-
We plan to add a few `[Disruptive]` `[Serial]` tests with the new behavior as
273-
"smoke" tests of the new behavior, still, most of the tests will be integration
278+
We added a `[Disruptive]` `[Serial]` test with the new behavior as
279+
"smoke" test of the new behavior, still, most of the tests will be integration
274280
ones.
275281

282+
Test:
283+
https://github.com/kubernetes/kubernetes/blob/91a9ce28ac2486c50222aeeec1f76e664155d769/test/e2e/storage/pvc_storageclass.go#L62
284+
285+
Test Grid:
286+
https://testgrid.k8s.io/sig-network-gce#gci-gce-alpha-features.
287+
276288
### Graduation Criteria
277289

278290
#### Alpha
@@ -288,6 +300,7 @@ ones.
288300
- No conformance tests, since we don't test StorageClasses there.
289301
- Manually test version skew between the API server and KCM. See the expected
290302
behavior below in Version Skew Strategy.
303+
- Manually test upgrade->downgrade->upgrade path.
291304

292305
#### GA
293306

@@ -352,68 +365,55 @@ enabled for the first time.
352365

353366
###### Are there any tests for feature enablement/disablement?
354367

355-
Unit tests will cover feature enablement/disablement.
368+
There is no new field that needs to be handled in a special way. The feature
369+
gate just enables/disables a code path in PV controller which is already covered
370+
by existing unit tests.
356371

357-
### Rollout, Upgrade and Rollback Planning
372+
Validation test:
373+
https://github.com/kubernetes/kubernetes/blob/42458952616406922ea59e6d0b65c35c94444172/pkg/apis/core/validation/validation_test.go#L2291
358374

359-
Will be provided when graduating to beta.
375+
PV controller test:
376+
https://github.com/kubernetes/kubernetes/blob/42458952616406922ea59e6d0b65c35c94444172/pkg/controller/volume/persistentvolume/pv_controller_test.go#L753
360377

361-
<!--
362-
This section must be completed when targeting beta to a release.
363-
-->
378+
### Rollout, Upgrade and Rollback Planning
364379

365380
###### How can a rollout or rollback fail? Can it impact already running workloads?
366381

367-
<!--
368-
Try to be as paranoid as possible - e.g., what if some components will restart
369-
mid-rollout?
370-
371-
Be sure to consider highly-available clusters, where, for example,
372-
feature flags will be enabled on some API servers and not others during the
373-
rollout. Similarly, consider large clusters and how enablement/disablement
374-
will rollout across nodes.
375-
-->
382+
In case the feature is not enabled during rollout due to a failure there
383+
should be no impact and the behavior of StorageClass assignment will not change.
384+
If the feature is rolled out partially, so it's enabled only on some API
385+
servers, there will be no impact on running workloads because the request will
386+
be processed as if the feature is disabled - that means nothing happens and PVC
387+
will not be changed.
376388

377389
###### What specific metrics should inform a rollback?
378390

379-
<!--
380-
What signals should users be paying attention to when the feature is young
381-
that might indicate a serious problem?
382-
-->
391+
A KCM metric for failure counts called `retroactive_storageclass_errors_total`
392+
will indicate a problem with the feature.
383393

384394
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
385395

386-
<!--
387-
Describe manual testing that was done and the outcomes.
388-
Longer term, we may want to require automated upgrade/rollback tests, but we
389-
are missing a bunch of machinery and tooling and can't do that now.
390-
-->
396+
Upgrade and rollback will be tested when the feature gate will change to beta.
391397

392398
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
393399

394-
<!--
395-
Even if applying deprecation policies, they may still surprise some users.
396-
-->
400+
No.
397401

398402
### Monitoring Requirements
399403

400-
<!--
401-
This section must be completed when targeting beta to a release.
402-
403-
For GA, this section is required: approvers should be able to confirm the
404-
previous answers based on experience in the field.
405-
-->
406-
407404
###### How can an operator determine if the feature is in use by workloads?
408405

409-
<!--
410-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
411-
checking if there are objects with field X set) may be a last resort. Avoid
412-
logs or events for this purpose.
413-
-->
406+
A counter metric will be present in KCM metric endpoint, it will show total
407+
count of successful and failed retroactive StorageClass assignments.
414408

415409
###### How can someone using this feature know that it is working for their instance?
416410

411+
By inspecting a `retroactive_storageclass_total` metric value. If the counter
412+
is increasing while letting PVCs being updated retroactively with a default
413+
StorageClass the feature is enabled. And at the same time if
414+
`retroactive_storageclass_errors_total` counter does not increase the feature
415+
works as expected.
416+
417417
<!--
418418
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
419419
for each individual pod.
@@ -425,11 +425,11 @@ Recall that end users cannot usually observe component logs or access metrics.
425425

426426
- [ ] Events
427427
- Event Reason:
428-
- [ ] API .status
428+
- [X] API .spec
429429
- Condition name:
430-
- Other field:
430+
- Other field: `pvc.spec.storageClassName` changing from nil to current default StorageClass name after the default is set
431431
- [ ] Other (treat as last resort)
432-
- Details:
432+
- Details: metric
433433

434434
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
435435

@@ -450,19 +450,17 @@ question.
450450

451451
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
452452

453-
<!--
454-
Pick one more of these and delete the rest.
455-
-->
456-
457-
- [ ] Metrics
458-
- Metric name:
453+
- [X] Metrics
454+
- Metric name: `retroactive_storageclass_total` and `retroactive_storageclass_errors_total`
459455
- [Optional] Aggregation method:
460-
- Components exposing the metric:
461-
- [ ] Other (treat as last resort)
462-
- Details:
456+
- Components exposing the metric: KCM
463457

464458
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
465459

460+
A metric which would include PersistentVolumeClaim name or StorageClass name
461+
as a label to help users debug possible issues. Such metric would have
462+
potentially unbounded cardinality, which is a hard blocker for adding it.
463+
466464
<!--
467465
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
468466
implementation difficulties, etc.).
@@ -476,20 +474,7 @@ This section must be completed when targeting beta to a release.
476474

477475
###### Does this feature depend on any specific services running in the cluster?
478476

479-
<!--
480-
Think about both cluster-level services (e.g. metrics-server) as well
481-
as node-level agents (e.g. specific version of CRI). Focus on external or
482-
optional services that are needed. For example, if this feature depends on
483-
a cloud provider API, or upon an external software-defined storage or network
484-
control plane.
485-
486-
For each of these, fill in the following—thinking about running existing user workloads
487-
and creating new ones, as well as about cluster-level services (e.g. DNS):
488-
- [Dependency name]
489-
- Usage description:
490-
- Impact of its outage on the feature:
491-
- Impact of its degraded performance or high-error rates on the feature:
492-
-->
477+
No.
493478

494479
### Scalability
495480

@@ -553,8 +538,16 @@ details). For now, we leave it here.
553538

554539
###### How does this feature react if the API server and/or etcd is unavailable?
555540

541+
API requests are performed by Persistent volume controller periodically. If
542+
some requests fail the PVC will not get updated with current default storage
543+
class as it would under normal operation and a metric with error count is
544+
increased. PV controller will attempt the PVC update again in the next
545+
periodic sync.
546+
556547
###### What are other known failure modes?
557548

549+
None.
550+
558551
<!--
559552
For each of them, fill in the following information by copying the below template:
560553
- [Failure mode brief description]
@@ -570,6 +563,10 @@ For each of them, fill in the following information by copying the below templat
570563

571564
###### What steps should be taken if SLOs are not being met to determine the problem?
572565

566+
The only case that is considered a failure is when a PVC fails to update due
567+
to API server error so users should investigate API server logs to determine
568+
the problem.
569+
573570
## Implementation History
574571

575572
<!--
@@ -583,6 +580,7 @@ Major milestones might include:
583580
- when the KEP was retired or superseded
584581
-->
585582
- 1.25: initial version
583+
- 1.26: beta
586584

587585
## Drawbacks
588586

keps/sig-storage/3333-reconcile-default-storage-class/kep.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
title: Retroactive default StorageClass assignement
1+
title: Retroactive default StorageClass assignment
22
kep-number: 3333
33
authors:
44
- "@RomanBednar"
@@ -16,12 +16,12 @@ see-also:
1616
replaces:
1717

1818
# The target maturity stage in the current dev cycle for this KEP.
19-
stage: alpha
19+
stage: beta
2020

2121
# The most recent milestone for which work toward delivery of this KEP has been
2222
# done. This can be the current (upcoming) milestone, if it is being actively
2323
# worked on.
24-
latest-milestone: "v1.25"
24+
latest-milestone: "v1.26"
2525

2626
# The milestone at which this feature was, or is targeted to be, at each stage.
2727
milestone:
@@ -40,5 +40,5 @@ disable-supported: true
4040

4141
# The following PRR answers are required at beta release
4242
metrics:
43-
# TBD
44-
# - my_feature_metric
43+
- retroactive_storageclass_total
44+
- retroactive_storageclass_errors_total

0 commit comments

Comments
 (0)