Skip to content

Commit 7ce7ca2

Browse files
authored
Merge pull request kubernetes#3556 from RaunakShah/3141-kep-2
KEP-3141: Graduate "Prevent unauthorised volume mode conversion" to Beta
2 parents 43fd4c0 + 421ca29 commit 7ce7ca2

File tree

3 files changed

+126
-43
lines changed

3 files changed

+126
-43
lines changed

keps/prod-readiness/sig-storage/3141.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 3141
55
alpha:
66
approver: "@deads2k"
7+
beta:
8+
approver: "@deads2k"

keps/sig-storage/3141-prevent-volume-mode-conversion/README.md

Lines changed: 116 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -77,8 +77,10 @@ SIG Architecture for cross-cutting KEPs).
7777
- [Changes to Snapshot Controller](#changes-to-snapshot-controller)
7878
- [Changes to external-provisioner](#changes-to-external-provisioner)
7979
- [Test Plan](#test-plan)
80-
- [Unit tests](#unit-tests)
81-
- [E2E tests](#e2e-tests)
80+
- [Prerequisite testing updates](#prerequisite-testing-updates)
81+
- [Unit tests](#unit-tests)
82+
- [Integration tests](#integration-tests)
83+
- [e2e tests](#e2e-tests)
8284
- [Graduation Criteria](#graduation-criteria)
8385
- [Alpha](#alpha)
8486
- [Alpha -> Beta](#alpha---beta)
@@ -214,7 +216,7 @@ need to identify the `VolumeSnapshotContent` mapped to the `VolumeSnapshot`
214216
from which the `PVC` is being created.
215217

216218
Either through software or via manual intervention, the annotation
217-
`snapshot.storage.kubernetes.io/allowVolumeModeChange: true` needs to be applied
219+
`snapshot.storage.kubernetes.io/allow-volume-mode-change: true` needs to be applied
218220
to the `VolumeSnapshotContent`. If the backup software is a privileged user,
219221
it will have `Update` and `Patch` permissions on `VolumeSnapshotContents`.
220222

@@ -289,7 +291,7 @@ like below after this change:
289291
kind: VolumeSnapshotContent
290292
metadata:
291293
annotations:
292-
- snapshot.storage.kubernetes.io/allowVolumeModeChange: "true"
294+
- snapshot.storage.kubernetes.io/allow-volume-mode-change: "true"
293295
...
294296
```
295297
@@ -328,7 +330,7 @@ As part of the preprocessing steps, it will:
328330
2. Get the `Spec.VolumeMode` of the `PVC` being created.
329331
If they do not match:
330332
1. Get all annotations on the `VolumeSnapshotContent` and verify if
331-
`snapshot.storage.kubernetes.io/allowVolumeModeChange: true` exists.
333+
`snapshot.storage.kubernetes.io/allow-volume-mode-change: true` exists.
332334
If it does not exist, block volume provisioning by returning an error.
333335
4. In all other cases, let volume provisioning continue.
334336
@@ -338,30 +340,70 @@ decisions.
338340
339341
### Test Plan
340342
341-
E2E tests will be added for this design, that attempt to restore a volume with
342-
and without requisite privileges.
343+
[x] I/we understand the owners of the involved components may require updates to
344+
existing tests to make this code solid enough prior to committing the changes necessary
345+
to implement this enhancement.
343346
344-
#### Unit tests
347+
##### Prerequisite testing updates
345348
346-
- With feature flag disabled:
347-
- attempt to convert volume mode when creating a `PVC`
348-
from a `VolumeSnapshot`.
349-
- With feature flag enabled, attempt to convert volume mode when creating a `PVC`
350-
from a `VolumeSnapshot`:
351-
- With `Spec.SourceVolumeMode` populated and `snapshot.storage.kubernetes.io/allowVolumeModeChange: true`
352-
annotation present.
353-
- With `Spec.SourceVolumeMode` populated but no `snapshot.storage.kubernetes.io/allowVolumeModeChange: true`
354-
annotation.
355-
- With `Spec.SourceVolumeMode` set to `nil`.
349+
<!--
350+
Based on reviewers feedback describe what additional tests need to be added prior
351+
implementing this enhancement to ensure the enhancements have also solid foundations.
352+
-->
353+
354+
None. New E2E tests will be added for the transition to beta.
355+
356+
##### Unit tests
357+
358+
<!--
359+
In principle every added code should have complete unit test coverage, so providing
360+
the exact set of tests will not bring additional value.
361+
However, if complete unit test coverage is not possible, explain the reason of it
362+
together with explanation why this is acceptable.
363+
-->
364+
365+
<!--
366+
Additionally, for Alpha try to enumerate the core package you will be touching
367+
to implement this enhancement and provide the current unit coverage for those
368+
in the form of:
369+
- <package>: <date> - <current test coverage>
370+
The data can be easily read from:
371+
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
372+
This can inform certain test coverage improvements that we want to do before
373+
extending the production code to implement this enhancement.
374+
-->
375+
376+
The unit tests were added to the CSI external-provisioner repo.
377+
378+
- https://github.com/kubernetes-csi/external-provisioner/pull/726/
379+
380+
##### Integration tests
381+
382+
<!--
383+
This question should be filled when targeting a release.
384+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
385+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
386+
https://storage.googleapis.com/k8s-triage/index.html
387+
-->
388+
389+
- No integration tests added.
390+
391+
##### e2e tests
356392
357-
#### E2E tests
393+
<!--
394+
This question should be filled when targeting a release.
395+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
396+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
397+
https://storage.googleapis.com/k8s-triage/index.html
398+
We expect no non-infra related flakes in the last month as a GA graduation criteria.
399+
-->
358400
359401
The feature flag will be enabled for e2e tests. The tests will attempt to convert volume
360402
mode when creating a `PVC` from a `VolumeSnapshot`:
361-
- With `Spec.SourceVolumeMode` populated and `snapshot.storage.kubernetes.io/allowVolumeModeChange: true`
403+
- With `Spec.SourceVolumeMode` populated and `snapshot.storage.kubernetes.io/allow-volume-mode-change: true`
362404
annotation present.
363-
- With `Spec.SourceVolumeMode` populated but no `snapshot.storage.kubernetes.io/allowVolumeModeChange: true`
364-
annotation.
405+
- With `Spec.SourceVolumeMode` populated but no `snapshot.storage.kubernetes.io/allow-volume-mode-change: true`
406+
annotation - https://github.com/kubernetes-csi/external-provisioner/pull/832: https://testgrid.k8s.io/sig-storage-csi-external-provisioner#canary
365407
- With `Spec.SourceVolumeMode` set to `nil`.
366408
367409
### Graduation Criteria
@@ -468,13 +510,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
468510
will rollout across nodes.
469511
-->
470512
513+
Due to the feature gate on the external-provisioner, rolling out this feature
514+
does not affect existing Pods that use PVCs. It also does not affect
515+
VolumeSnapshots that are created prior to rolling out the feature, ie, the
516+
volume mode of an existing VolumeSnapshot can be modified while creating a PVC.
517+
471518
###### What specific metrics should inform a rollback?
472519
473520
<!--
474521
What signals should users be paying attention to when the feature is young
475522
that might indicate a serious problem?
476523
-->
477524
525+
- persistentvolumeclaim_provision_failed_total
526+
478527
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
479528
480529
<!--
@@ -483,12 +532,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
483532
are missing a bunch of machinery and tooling and can't do that now.
484533
-->
485534
535+
Yes. The feature flag was enabled and disabled separately in the csi-provisioner and snapshot-controller.
536+
486537
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
487538
488539
<!--
489540
Even if applying deprecation policies, they may still surprise some users.
490541
-->
491542
543+
No.
544+
492545
### Monitoring Requirements
493546
494547
<!--
@@ -503,6 +556,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
503556
logs or events for this purpose.
504557
-->
505558
559+
If the feature gate is enabled in the external-provisioner and snapshot-controller,
560+
this feature will always be in use when creating a PVC from a VolumeSnapshot.
561+
506562
###### How can someone using this feature know that it is working for their instance?
507563
508564
<!--
@@ -514,13 +570,13 @@ and operation of this feature.
514570
Recall that end users cannot usually observe component logs or access metrics.
515571
-->
516572
517-
- [ ] Events
518-
- Event Reason:
519-
- [ ] API .status
520-
- Condition name:
521-
- Other field:
522-
- [ ] Other (treat as last resort)
523-
- Details:
573+
- [x] Events
574+
- Event Reason: ProvisioningFailed
575+
- Event Message: Failed to provision volume with StorageClass "csi-hostpath-sc": error getting handle for DataSource Type
576+
VolumeSnapshot by Name new-snapshot-demo: requested volume default/hpvc-restore modifies the mode of the source volume
577+
but does not have permission to do so. snapshot.storage.kubernetes.io/allow-volume-mode-change annotation is not present
578+
on snapshotcontent snapcontent-8d709f2e-db04-444f-aae2-e17d6c5398dd
579+
524580
525581
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
526582
@@ -539,18 +595,22 @@ These goals will help you determine what you need to measure (SLIs) in the next
539595
question.
540596
-->
541597
598+
We will add new labels to the existing persistentvolumeclaim_provision_failed_total metric
599+
for the volume data source and status code.
600+
The per-day percentage of calls with error status code <= 1.
601+
However the failure will always happen as long as the feature is correctly enabled and the
602+
annotations are not applied correctly to VolumeSnapshotContent objects.
603+
542604
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
543605
544606
<!--
545607
Pick one more of these and delete the rest.
546608
-->
547609
548-
- [ ] Metrics
549-
- Metric name:
550-
- [Optional] Aggregation method:
551-
- Components exposing the metric:
552-
- [ ] Other (treat as last resort)
553-
- Details:
610+
- [x] Metrics
611+
- Metric name: persistentvolumeclaim_provision_failed_total
612+
- [Optional] Aggregation method:
613+
- Components exposing the metric: external-provisioner
554614
555615
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
556616
@@ -559,6 +619,9 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
559619
implementation difficulties, etc.).
560620
-->
561621
622+
There are no metrics for persistentvolumeclaims created from volumesnapshots. This KEP aims to add those metrics to
623+
the external-provisioner.
624+
562625
### Dependencies
563626
564627
<!--
@@ -582,12 +645,16 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
582645
- Impact of its degraded performance or high-error rates on the feature:
583646
-->
584647
648+
- [external-provisioner]
649+
- Usage description: Failure events are emitted as events by the external-provisioner.
650+
- Impact of its outage on the feature: Outage of this component will prevent error reporting to users.
651+
- Impact of its degraded performance or high-error rates on the feature: Outage of this component will prevent error reporting to users.
652+
585653
### Scalability
586654
587655
###### Will enabling / using this feature result in any new API calls?
588656
589-
590-
This feature does not add any new API calls.
657+
This feature adds an event write to the API server when PVC creation is blocked.
591658
592659
###### Will enabling / using this feature result in introducing new API types?
593660
@@ -609,12 +676,16 @@ The latency of CSI's `CreateVolume` may increase due to this change, when the
609676
`Spec.DataSource` field points to a `VolumeSnapshot` instance. This is because
610677
there is an additional check to determine whether volume provisioning must
611678
continue. However, this increase is expected to be minimal as there are no new
612-
API calls.
679+
API calls and the volume spec has already been loaded into memory of the external-provisioner.
613680
614681
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
615682
616683
No.
617684
685+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
686+
687+
No. This feature does not introduce any resource exhaustive operations.
688+
618689
### Troubleshooting
619690
620691
<!--
@@ -627,6 +698,10 @@ details). For now, we leave it here.
627698
628699
###### How does this feature react if the API server and/or etcd is unavailable?
629700
701+
In case PVC creation is blocked due to this feature, the failure event will not be emitted
702+
due to the unavailability of the API server. Users will need to refer to the external-provisioner
703+
logs to determine why PVC creation is failing.
704+
630705
###### What are other known failure modes?
631706
632707
<!--
@@ -644,6 +719,9 @@ For each of them, fill in the following information by copying the below templat
644719
645720
###### What steps should be taken if SLOs are not being met to determine the problem?
646721
722+
The user needs to read the logs of the external-provisioner to determine the reason
723+
behind why PVC creation is failing.
724+
647725
## Implementation History
648726
649727
<!--

keps/sig-storage/3141-prevent-volume-mode-conversion/kep.yaml

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors:
55
- "@xing-yang"
66
owning-sig: sig-storage
77
participating-sigs:
8-
status: provisional
8+
status: implementable
99
creation-date: 2022-01-17
1010
reviewers:
1111
- "@jsafrane"
@@ -17,18 +17,18 @@ approvers:
1717
- "@liggitt"
1818

1919
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: alpha
20+
stage: beta
2121

2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.24"
25+
latest-milestone: "v1.27"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:
2929
alpha: "v1.24"
30-
beta: "v1.25"
31-
stable: "v1.26"
30+
beta: "v1.27"
31+
stable: "v1.28"
3232

3333
# The following PRR answers are required at alpha release
3434
# List the feature gate name and the components for which it must be enabled
@@ -39,3 +39,6 @@ disable-supported: true
3939

4040
# The following PRR answers are required at beta release
4141
metrics:
42+
- persistentvolumeclaim_provision_total
43+
- persistentvolumeclaim_provision_failed_total
44+
- persistentvolumeclaim_provision_duration_seconds

0 commit comments

Comments
 (0)