Skip to content

Commit 7da16e5

Browse files
committed
KEP-2485: Design Details
1 parent 9c91a01 commit 7da16e5

File tree

1 file changed

+234
-0
lines changed
  • keps/sig-storage/2485-read-write-once-pod-pv-access-mode

1 file changed

+234
-0
lines changed

keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,10 +92,24 @@ tags, and then generate with `hack/update-toc.sh`.
9292
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
9393
- [Risks and Mitigations](#risks-and-mitigations)
9494
- [Design Details](#design-details)
95+
- [Kubernetes Changes, Access Mode](#kubernetes-changes-access-mode)
96+
- [CSI Specification Changes, Volume Capabilities](#csi-specification-changes-volume-capabilities)
9597
- [Test Plan](#test-plan)
98+
- [Validation of PersistentVolumeSpec Object](#validation-of-persistentvolumespec-object)
99+
- [Mounting and Mapping with ReadWriteOncePod](#mounting-and-mapping-with-readwriteoncepod)
100+
- [Mounting and Mapping with ReadWriteOnce](#mounting-and-mapping-with-readwriteonce)
101+
- [Mapping Kubernetes Access Modes to CSI Volume Capability Access Modes](#mapping-kubernetes-access-modes-to-csi-volume-capability-access-modes)
102+
- [End to End Tests](#end-to-end-tests)
96103
- [Graduation Criteria](#graduation-criteria)
104+
- [Alpha](#alpha)
105+
- [Beta](#beta)
106+
- [GA](#ga)
97107
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
98108
- [Version Skew Strategy](#version-skew-strategy)
109+
- [API Server Version N / Scheduler Version N / Kubelet Version N-1 or N-2](#api-server-version-n--scheduler-version-n--kubelet-version-n-1-or-n-2)
110+
- [API Server Version N / Scheduler Version N-1 / Kubelet Version N-1 or N-2](#api-server-version-n--scheduler-version-n-1--kubelet-version-n-1-or-n-2)
111+
- [API Understands ReadWriteOncePod, CSI Sidecars Do Not](#api-understands-readwriteoncepod-csi-sidecars-do-not)
112+
- [CSI Controller Service Understands New CSI Access Modes, CSI Node Service Does Not](#csi-controller-service-understands-new-csi-access-modes-csi-node-service-does-not)
99113
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
100114
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
101115
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
@@ -292,6 +306,10 @@ the system. The goal here is to make this feel real for users without getting
292306
bogged down.
293307
-->
294308

309+
See the [version skew strategy] section below for additional scenarios.
310+
311+
[version skew strategy]: #version-skew-strategy
312+
295313
#### ReadWriteOncePod PVC Used Twice Fails for Second Consumer
296314

297315
This scenario asserts a ReadWriteOncePod can only be bind mounted into a single
@@ -352,6 +370,77 @@ required) or even code snippets. If there's any ambiguity about HOW your
352370
proposal will be implemented, this is the place to discuss them.
353371
-->
354372

373+
### Kubernetes Changes, Access Mode
374+
375+
In Kubernetes, we should add a new ReadWriteOncePod persistent volume access
376+
mode to PersistentVolumes and PersistentVolumeClaims. This change will require
377+
adding a feature gate to the kube-apiserver, kube-controller-manager,
378+
kube-scheduler, and kubelet. Validation logic will need updating to accept this
379+
access mode type if the feature gate is enabled.
380+
381+
```golang
382+
// can be mounted read/write mode to exactly 1 pod
383+
ReadWriteOncePod PersistentVolumeAccessMode = "ReadWriteOncePod"
384+
```
385+
386+
This access mode will be enforced in two places:
387+
388+
- First is at the time a pod is scheduled. When scheduling a pod, if another pod
389+
is found using the same PVC and the PVC uses ReadWriteOncePod, then scheduling
390+
will fail and the pod will be considered unresolvable.
391+
- As an additional precaution this will also be enforced at the time a volume is
392+
mounted for filesystem devices, and at the time a volume is mapped for block
393+
devices. During the mount operation, kubelet will check the actual state of
394+
the world to determine if the volume is already in-use by another pod. If it
395+
is, kubelet will fail mounting with an appropriate error message.
396+
397+
### CSI Specification Changes, Volume Capabilities
398+
399+
In the CSI spec we should add two new access modes that explicitly state the
400+
number of writers on a single node.
401+
402+
```protobuf
403+
// Can only be published once as read/write at a single worklad on
404+
// a single node, at any given time.
405+
SINGLE_NODE_SINGLE_WRITER = 6;
406+
407+
// Can be published as read/write at multiple workloads on a
408+
// single node simultaneously.
409+
SINGLE_NODE_MULTI_WRITER = 7;
410+
```
411+
412+
These access modes are modeled after the existing `MULTI_NODE_SINGLE_WRITER` and
413+
`MULTI_NODE_MULTI_WRITER` access modes. The reason for making this distinction
414+
is because the `SINGLE_NODE_WRITER` volume capability has conflicting
415+
definitions (see the [motivation](#motivation) section for context).
416+
417+
For CSI clients, the new ReadWriteOncePod Kubernetes access mode will map to the
418+
`SINGLE_NODE_SINGLE_WRITER` volume capability access mode in the CSI spec.
419+
420+
For the ReadWriteOnce access mode, the value it maps to depends on the CSI
421+
driver. If the CSI driver supports the `SINGLE_NODE_MULTI_WRITER` access mode,
422+
then ReadWriteOnce will map to that value. If the CSI driver does not support
423+
the `SINGLE_NODE_MULTI_WRITER` access mode, then ReadWriteOnce will map to
424+
`SINGLE_NODE_WRITER` to preserve backwards compatibility. In order to determine
425+
which mapping to use, both the controller and node services should have
426+
capability bits for this access mode.
427+
428+
```protobuf
429+
// Indicates the SP supports the SINGLE_NODE_MULTI_WRITER access
430+
// mode.
431+
SINGLE_NODE_MULTI_WRITER = 13;
432+
```
433+
434+
Put more succinctly:
435+
436+
| | Driver Supports `SINGLE_NODE_*_WRITER` | Driver Does Not Support `SINGLE_NODE_*_WRITER` |
437+
|------------------|----------------------------------------|---------------------------------------------------|
438+
| ReadWriteOncePod | SINGLE_NODE_SINGLE_WRITER | Don't use ReadWriteOncePod if driver is incapable |
439+
| ReadWriteOnce | SINGLE_NODE_MULTI_WRITER | SINGLE_NODE_WRITER (Existing behavior) |
440+
441+
CSI clients that will need updating are kubelet, external-provisioner,
442+
external-attacher, and external-resizer.
443+
355444
### Test Plan
356445

357446
<!--
@@ -372,6 +461,56 @@ when drafting this test plan.
372461
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
373462
-->
374463

464+
#### Validation of PersistentVolumeSpec Object
465+
466+
To test the validation logic of the PersistentVolumeSpec, we need to check the
467+
following cases:
468+
469+
- Validation succeeds when feature gate is enabled and PersistentVolume is created with
470+
ReadWriteOncePod access mode
471+
- Validation fails when feature gate is disabled and PersistentVolume is created with
472+
ReadWriteOncePod access mode
473+
- Validation succeeds when feature gate is enabled and PersistentVolumeClaim is created with
474+
ReadWriteOncePod access mode
475+
- Validation fails when feature gate is disabled and PersistentVolumeClaim is created with
476+
ReadWriteOncePod access mode
477+
478+
#### Mounting and Mapping with ReadWriteOncePod
479+
480+
To test mount behavior, we need to check the following cases:
481+
482+
- Mounting a volume with ReadWriteOncePod succeeds if the volume isn't already
483+
mounted
484+
- Mounting a volume with ReadWriteOncePod fails if the volume is already mounted
485+
486+
#### Mounting and Mapping with ReadWriteOnce
487+
488+
Existing unit tests should cover this scenario.
489+
490+
#### Mapping Kubernetes Access Modes to CSI Volume Capability Access Modes
491+
492+
This test involves asserting the behavior in the above table. The volume
493+
capability access mode for ReadWriteOnce will depend on the capabilities of the
494+
CSI driver. A test asserting this behavior will be needed in both Kubernetes as
495+
well as in CSI sidecars.
496+
497+
#### End to End Tests
498+
499+
To test this feature end to end, we will need to check the following cases:
500+
501+
- A ReadWriteOncePod volume will succeed mounting when consumed by a single pod
502+
on a node
503+
- A ReadWriteOncePod volume will fail to mount when consumed by a second pod on
504+
the same node
505+
- A ReadWriteOncePod volume will fail to attach when consumed by a second pod on
506+
a different node
507+
508+
For testing the mapping for ReadWriteOnce, we should update the mock CSI driver
509+
to support the new volume capability access modes and cut a release. The
510+
existing Kubernetes end to end tests will be updated to use this version which
511+
will test the mapping behavior because most storage end to end tests rely on the
512+
ReadWriteOnce access mode.
513+
375514
### Graduation Criteria
376515

377516
<!--
@@ -429,6 +568,27 @@ in back-to-back releases.
429568
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
430569
-->
431570

571+
#### Alpha
572+
573+
- CSI spec supports `SINGLE_NODE_*_WRITER` access modes
574+
- Kubernetes supports ReadWriteOncePod access mode, has unit test coverage, has
575+
updated CSI spec
576+
- CSI sidecars support `SINGLE_NODE_*_WRITER` access modes and have unit test
577+
coverage
578+
579+
#### Beta
580+
581+
- ReadWriteOncePod access mode has end to end test coverage
582+
- Mock CSI driver supports `SINGLE_NODE_*_WRITER` access modes, relevant end to
583+
end tests updated to use this driver
584+
- Hostpath CSI driver supports `SINGLE_NODE_*_WRITER` access modes, relevant end
585+
to end tests updated to use this driver
586+
587+
#### GA
588+
589+
- Kubernetes API and CSI spec changes are stable
590+
- CSI drivers support `SINGLE_NODE_*_WRITER` access modes
591+
432592
### Upgrade / Downgrade Strategy
433593

434594
<!--
@@ -443,6 +603,24 @@ enhancement:
443603
cluster required to make on upgrade, in order to make use of the enhancement?
444604
-->
445605

606+
In order to upgrade a cluster to use this feature, the user will need to restart
607+
the kube-apiserver, kube-controller-manager, kube-scheduler, and kubelet with
608+
the ReadWriteOncePod feature gate enabled. Additionally they will need to
609+
update their CSI drivers and sidecars to versions that depend on the new
610+
Kubernetes API and CSI spec.
611+
612+
When downgrading a cluster to disable this feature, the user will need to
613+
restart the kube-apiserver with the ReadWriteOncePod feature gate disabled. When
614+
disabling this feature gate, any existing volumes with the ReadWriteOncePod
615+
access mode will continue to exist, but can only be deleted. An alternative is
616+
to allow these volumes to be treated as ReadWriteOnce, however that would
617+
violate the intent of the user and so it is not recommended.
618+
619+
If a user downgrades their CSI drivers or sidecars, any existing volumes using
620+
ReadWriteOnce should continue working (switching from `SINGLE_NODE_MULTI_WRITER`
621+
to `SINGLE_NODE_WRITER`). This behavior is ultimately up to each CSI driver, but
622+
they should be designed with this backwards compatibility in mind.
623+
446624
### Version Skew Strategy
447625

448626
<!--
@@ -458,6 +636,62 @@ enhancement:
458636
CRI or CNI may require updating that component before the kubelet.
459637
-->
460638

639+
640+
#### API Server Version N / Scheduler Version N / Kubelet Version N-1 or N-2
641+
642+
When starting two pods with both using the same PVC with ReadWriteOncePod, one pod
643+
will successfully start, but the other will not be scheduled due to the
644+
ReadWriteOncePod access mode conflict.
645+
646+
When starting the same two pods but also setting `pod.spec.nodeName` to the same
647+
node, kubelet will not enforce the access mode and will proceed with starting
648+
both pods.
649+
650+
For older kubelets, [ReadWriteOncePod will map to access mode `UNKNOWN`]. How
651+
this access mode is used will vary across CSI drivers. By definition, the CSI
652+
spec says ["If ANY of the specified volume capabilities are not supported by the
653+
SP, the call MUST return the appropriate gRPC error code"], see the
654+
`volume_capabilities` field in CreateVolumeRequest. However, not all CSI drivers
655+
strictly adhere to this spec. For example, the EBS CSI driver will [error when
656+
supplied an unsupported access mode]. Other drivers like the mock CSI driver
657+
[won't check the supplied access modes], meaning `UNKNOWN` is valid.
658+
659+
[ReadWriteOncePod will map to access mode `UNKNOWN`]: https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/volume/csi/csi_client.go#L512
660+
[error when supplied an unsupported access mode]: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.0.0/pkg/driver/controller.go#L117-L122
661+
[won't check the supplied access modes]: https://github.com/kubernetes-csi/csi-test/blob/v4.2.0/mock/service/controller.go#L44-L46
662+
["If ANY of the specified volume capabilities are not supported by the SP, the call MUST return the appropriate gRPC error code"]: https://github.com/container-storage-interface/spec/blob/v1.4.0/spec.md#createvolume
663+
664+
#### API Server Version N / Scheduler Version N-1 / Kubelet Version N-1 or N-2
665+
666+
When creating a pod using ReadWriteOncePod, the scheduler will not enforce this
667+
access mode during scheduling. It will be possible for two pods using the same
668+
PVC with this access mode to be assigned the same node.
669+
670+
Same as the above case, with an older kubelet ReadWriteOncePod will map to
671+
access mode `UNKNOWN`. How this access mode is used will vary across CSI
672+
drivers.
673+
674+
#### API Understands ReadWriteOncePod, CSI Sidecars Do Not
675+
676+
Both the the [CSI attacher] and the [CSI resizer] will error if they do not
677+
understand ReadWriteOncePod and this access mode is used on a PV.
678+
679+
The CSI provisioner will [map ReadWriteOncePod to a nil access mode]. How this
680+
access mode is used will vary across CSI drivers.
681+
682+
[CSI attacher]: https://github.com/kubernetes-csi/external-attacher/blob/v3.2.0/pkg/controller/util.go#L196-L197
683+
[CSI resizer]: https://github.com/kubernetes-csi/external-resizer/blob/v1.2.0/pkg/resizer/csi_resizer.go#L237-L238
684+
[map ReadWriteOncePod to a nil access mode]: https://github.com/kubernetes-csi/external-provisioner/blob/v2.2.0/pkg/controller/controller.go#L468-L469
685+
686+
#### CSI Controller Service Understands New CSI Access Modes, CSI Node Service Does Not
687+
688+
If the CSI driver running the controller service understands the new access
689+
modes, then volumes will be provisioned and attached using these access modes
690+
(if ReadWriteOncePod or ReadWriteOnce are used). If the CSI driver running the
691+
node service does not understand these access modes, the behavior will depend on
692+
the CSI driver and how it treats unknown access modes. The recommendation is to
693+
upgrade the CSI drivers for the controller and node services together.
694+
461695
## Production Readiness Review Questionnaire
462696

463697
<!--

0 commit comments

Comments
 (0)