Skip to content

Commit da03a92

Browse files
committed
Add fsgroup mount proposal
Add wording about behaviour change when File fsgroupPolicy is used and CSI driver has VOLUME_MOUNT_GROUP capability.
1 parent 84f6013 commit da03a92

File tree

3 files changed

+427
-0
lines changed

3 files changed

+427
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2317
2+
alpha:
3+
approver: "@deads2k"
Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
# Provide fsgroup of pod to CSI driver on mount
2+
3+
## Table of Contents
4+
5+
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
7+
- [Summary](#summary)
8+
- [Motivation](#motivation)
9+
- [Goals](#goals)
10+
- [Non-Goals](#non-goals)
11+
- [Proposal](#proposal)
12+
- [Risks and Mitigations](#risks-and-mitigations)
13+
- [Design Details](#design-details)
14+
- [Test Plan](#test-plan)
15+
- [Graduation Criteria](#graduation-criteria)
16+
- [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
17+
- [Beta -&gt; GA Graduation](#beta---ga-graduation)
18+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
19+
- [Version Skew Strategy](#version-skew-strategy)
20+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
21+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
22+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
23+
- [Monitoring Requirements](#monitoring-requirements)
24+
- [Dependencies](#dependencies)
25+
- [Scalability](#scalability)
26+
- [Troubleshooting](#troubleshooting)
27+
- [Implementation History](#implementation-history)
28+
- [Drawbacks](#drawbacks)
29+
- [Alternatives](#alternatives)
30+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
31+
<!-- /toc -->
32+
33+
## Release Signoff Checklist
34+
35+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
36+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
37+
- [ ] (R) Design details are appropriately documented
38+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
39+
- [ ] (R) Graduation criteria is in place
40+
- [ ] (R) Production readiness review completed
41+
- [ ] Production readiness review approved
42+
- [ ] "Implementation History" section is up-to-date for milestone
43+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
44+
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
45+
46+
## Summary
47+
48+
Currently for most volume plugins kubelet applies fsgroup ownership and permission based changes by recursively `chown`ing and `chmod`ing the files and directories inside a volume. For certain CSI drivers this may not be possible because `chown` and `chmod` are unix primitives and underlying CSI driver
49+
may not support them. This enhancement proposes providing the CSI driver with fsgroup as an explicit field so as CSI driver can apply this on mount time.
50+
51+
## Motivation
52+
53+
Since some CSI drivers(AzureFile for example) - don't support chmod/chown - we propose that fsgroup of pod to be provided to the CSI driver on `NodeStageVolume`
54+
and `NodePublishVolume` CSI RPC calls. This will allow the CSI driver to apply `fsgroup` as a mount option on `NodeStageVolume` or `NodePublishVolume` and kubelet can be freed
55+
from responsibility of applying recursive ownership and permission change.
56+
57+
This feature hence becomes a prerequisite of CSI migration of Azure file driver and removal of Azure Cloud Provider.
58+
59+
### Goals
60+
61+
- Allow CSI driver to mount volumes with provided fsgroup.
62+
63+
### Non-Goals
64+
65+
- We are not supplying `fsGroup` as a generic ownerhip and permission handle to the CSI driver. We do not expect CSI drivers to `chown` or `chmod` files.
66+
67+
## Proposal
68+
69+
We are updating CSI specs by adding additional field called `volume_mount_group` to `NodeStageVolume` and `NodePublishVolume` RPC calls. The CSI proposal is available at - https://github.com/container-storage-interface/spec/pull/468 .
70+
71+
The CSI spec change is deliberately trying to avoid asking drivers to use supplied `fsGroup` as a generic handle for ownership and permissions. The reason being - Kubernetes may expect ownership and permissions to be in a way that is very platform/OS specific. We do not think CSI driver is right place to enforce all kind of different permissions expected by Kubernetes. The full scope of that discussion is out of scope for this enhancement and interested folks can follow along on - https://github.com/container-storage-interface/spec/issues/449
72+
73+
74+
### Risks and Mitigations
75+
76+
I am not aware of any associated risks. If a driver can not support using `fsgroup` as a mount option, it can always use `FileFSGroupPolicy` and let kubelet handle the ownership and permissions.
77+
78+
## Design Details
79+
80+
We are proposing that when kubelet determines a CSI driver has `VOLUME_MOUNT_GROUP` node capability, the kubelet will use proposed CSI field `volume_mount_group` to pass pod's `fsGroup` to the CSI driver. Kubelet will expect that driver will use
81+
this field for mounting volume with given `fsGroup` and no further permission/ownerhip change will be necessary.
82+
83+
It should be noted that if a CSI driver advertises `VOLUME_MOUNT_GROUP` node capability then value defined in `CSIDriver.Spec.FSGroupPolicy` will be ignored and kubelet will always use `fsGroup` as a mount option.
84+
85+
### Test Plan
86+
87+
Unit test:
88+
1. Test that whenever supported pod's `fsGroup` should be passed to CSI driver via `volume_mount_group` field.
89+
90+
For alpha feature:
91+
1. Update Azure File CSI driver to support supplying `fsGroup` via `NodeStageVolume` and `NodePublishVolume`.
92+
1. Run manual tests against azurefile CSI driver.
93+
94+
For beta:
95+
1. E2E tests that verify volume readability/writability using azurefile CSI driver.
96+
2. E2E tests using CSI mock driver.
97+
98+
We already have quite a few e2e tests that verify generic fsgroup functionality for existing drivers - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/testsuites/fsgroupchangepolicy.go . This should give us a reasonable
99+
confidence that we won't break any existing drivers.
100+
101+
102+
### Graduation Criteria
103+
104+
#### Alpha -> Beta Graduation
105+
106+
- Since this feature is a must-have for azurefile CSI migration, we will perform testing of the driver.
107+
- Currently CSI spec change is being introduced as alpha change and we will work to move the API change in CSI spec to stable.
108+
109+
#### Beta -> GA Graduation
110+
111+
- CSI spec change should be stable.
112+
- Tested via e2e and manually using azurefile CSI driver.
113+
114+
### Upgrade / Downgrade Strategy
115+
116+
Currently there is no way to make a volume readable/writable using azurefile and `fsGroup` unless
117+
pod was running as root.
118+
119+
When feature-gate is disabled, kubelet will no longer pass `fsGroup` to CSI drivers and such volumes will not be readable/writable by the Pod. This feature is currently broken anyways.
120+
121+
122+
<!--
123+
If applicable, how will the component be upgraded and downgraded? Make sure
124+
this is in the test plan.
125+
126+
Consider the following in developing an upgrade/downgrade strategy for this
127+
enhancement:
128+
- What changes (in invocations, configurations, API use, etc.) is an existing
129+
cluster required to make on upgrade, in order to maintain previous behavior?
130+
- What changes (in invocations, configurations, API use, etc.) is an existing
131+
cluster required to make on upgrade, in order to make use of the enhancement?
132+
-->
133+
134+
### Version Skew Strategy
135+
136+
<!--
137+
If applicable, how will the component handle version skew with other
138+
components? What are the guarantees? Make sure this is in the test plan.
139+
140+
Consider the following in developing a version skew strategy for this
141+
enhancement:
142+
- Does this enhancement involve coordinating behavior in the control plane and
143+
in the kubelet? How does an n-2 kubelet without this feature available behave
144+
when this feature is used?
145+
- Will any other components on the node change? For example, changes to CSI,
146+
CRI or CNI may require updating that component before the kubelet.
147+
-->
148+
149+
## Production Readiness Review Questionnaire
150+
151+
<!--
152+
153+
Production readiness reviews are intended to ensure that features merging into
154+
Kubernetes are observable, scalable and supportable; can be safely operated in
155+
production environments, and can be disabled or rolled back in the event they
156+
cause increased failures in production. See more in the PRR KEP at
157+
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
158+
159+
The production readiness review questionnaire must be completed and approved
160+
for the KEP to move to `implementable` status and be included in the release.
161+
162+
In some cases, the questions below should also have answers in `kep.yaml`. This
163+
is to enable automation to verify the presence of the review, and to reduce review
164+
burden and latency.
165+
166+
The KEP must have a approver from the
167+
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
168+
team. Please reach out on the
169+
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
170+
you need any help or guidance.
171+
172+
-->
173+
174+
### Feature Enablement and Rollback
175+
176+
* **How can this feature be enabled / disabled in a live cluster?**
177+
- [x] Feature gate (also fill in values in `kep.yaml`)
178+
- Feature gate name: MountWithFSGroup
179+
- Components depending on the feature gate:
180+
- Kubelet
181+
- [ ] Other
182+
- Describe the mechanism:
183+
- Will enabling / disabling the feature require downtime of the control
184+
plane?
185+
- Will enabling / disabling the feature require downtime or reprovisioning
186+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
187+
188+
* **Does enabling the feature change any default behavior?**
189+
Enabling this feature-gate could result in `CSIDriver.Spec.FSGroupPolicy` to be ignored for a driver that now has
190+
`VOLUME_MOUNT_GROUP` capability. Enabling this feature will cause volume to be mounted with `fsGroup`
191+
of the pod rather than kubelet performing permission/ownership change. This should result in quicker
192+
pod startup but still may surprise some users. This will be covered via release notes.
193+
194+
We expect that once a CSI driver accepts provided `fsGroup` via `volume_mount_group` and mount
195+
operation is successful - the permissions should be correct on the volume. A user may write additional
196+
healthcheck to determine the permissions if necessary.
197+
198+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
199+
the enablement)?**
200+
Yes - feature gate can be disabled once enabled and kubelet will fallback to its current behavior, which will
201+
result in any CSI driver that requires using `fsGroup` as a mount option to not use the mount option. In which case
202+
mounted volumes will be unwritable by any pods other than those running as root.
203+
204+
* **What happens if we reenable the feature if it was previously rolled back?**
205+
It will cause kubelet to use `volume_mount_group` field of CSI whenver applicable as discussed in above design. The pods running on affected
206+
nods have to restarted for feature to take affect though.
207+
208+
* **Are there any tests for feature enablement/disablement?**
209+
Not yet.
210+
211+
### Rollout, Upgrade and Rollback Planning
212+
213+
* **How can a rollout fail? Can it impact already running workloads?**
214+
One of the ways the rollout could fail is CSI driver defines `VOLUME_MOUNT_GROUP` capability but somehow does not implement correctly.
215+
This change should not affect running workloads (i.e running Pods).
216+
217+
Rolling out the feature however may cause group permissions to be applied correctly which weren't applied before.
218+
219+
* **What specific metrics should inform a rollback?**
220+
If after enabling this feature a spike in `storage_operation_status_count{operation_name="volume_mount", status="fail-unknown"}` metric is observed
221+
then cluster admin should look into identifying root cause and rolling back the feature.
222+
223+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
224+
Describe manual testing that was done and the outcomes.
225+
Longer term, we may want to require automated upgrade/rollback tests, but we
226+
are missing a bunch of machinery and tooling and can't do that now.
227+
228+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
229+
fields of API types, flags, etc.?**
230+
Even if applying deprecation policies, they may still surprise some users.
231+
232+
### Monitoring Requirements
233+
234+
_This section must be completed when targeting beta graduation to a release._
235+
236+
* **How can an operator determine if the feature is in use by workloads?**
237+
We are going to split the metric that captures mount and permission timings. The full details are available in - https://github.com/kubernetes/kubernetes/issues/98667
238+
239+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
240+
the health of the service?**
241+
- [ ] Metrics
242+
- Metric name:
243+
- [Optional] Aggregation method:
244+
- Components exposing the metric:
245+
- [ ] Other (treat as last resort)
246+
- Details:
247+
248+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
249+
At a high level, this usually will be in the form of "high percentile of SLI
250+
per day <= X". It's impossible to provide comprehensive guidance, but at the very
251+
high level (needs more precise definitions) those may be things like:
252+
- per-day percentage of API calls finishing with 5XX errors <= 1%
253+
- 99% percentile over day of absolute value from (job creation time minus expected
254+
job creation time) for cron job <= 10%
255+
- 99,9% of /health requests per day finish with 200 code
256+
257+
* **Are there any missing metrics that would be useful to have to improve observability
258+
of this feature?**
259+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
260+
implementation difficulties, etc.).
261+
262+
### Dependencies
263+
264+
* **Does this feature depend on any specific services running in the cluster?**
265+
This feature depends on presence of `fsGroup` field in Pod and CSI driver having `VOLUME_MOUNT_GROUP`
266+
capability.
267+
268+
- [CSI driver]
269+
- CSI driver must have `VOLUME_MOUNT_GROUP` capability:
270+
- If this capability is not available in CSI driver then kubelet will try to use default mechanism of applying `fsGroup` to volume (which is basically `chown` and `chomid`). If underlying driver however does not support applying group permissions via `chown` and `chmod` then Pods will not run correctly.
271+
- Workloads may not run correctly or volume has to be mounted with permissions `777`.
272+
273+
274+
### Scalability
275+
276+
_For alpha, this section is encouraged: reviewers should consider these questions
277+
and attempt to answer them._
278+
279+
_For beta, this section is required: reviewers must answer these questions._
280+
281+
_For GA, this section is required: approvers should be able to confirm the
282+
previous answers based on experience in the field._
283+
284+
* **Will enabling / using this feature result in any new API calls?**
285+
Describe them, providing:
286+
- API call type (e.g. PATCH pods)
287+
- estimated throughput
288+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
289+
focusing mostly on:
290+
- components listing and/or watching resources they didn't before
291+
- API calls that may be triggered by changes of some Kubernetes resources
292+
(e.g. update of object X triggers new updates of object Y)
293+
- periodic API calls to reconcile state (e.g. periodic fetching state,
294+
heartbeats, leader election, etc.)
295+
296+
* **Will enabling / using this feature result in introducing new API types?**
297+
Describe them, providing:
298+
- API type
299+
- Supported number of objects per cluster
300+
- Supported number of objects per namespace (for namespace-scoped objects)
301+
302+
* **Will enabling / using this feature result in any new calls to the cloud
303+
provider?**
304+
305+
* **Will enabling / using this feature result in increasing size or count of
306+
the existing API objects?**
307+
Describe them, providing:
308+
- API type(s):
309+
- Estimated increase in size: (e.g., new annotation of size 32B)
310+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
311+
312+
* **Will enabling / using this feature result in increasing time taken by any
313+
operations covered by [existing SLIs/SLOs]?**
314+
Think about adding additional work or introducing new steps in between
315+
(e.g. need to do X to start a container), etc. Please describe the details.
316+
317+
* **Will enabling / using this feature result in non-negligible increase of
318+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
319+
Things to keep in mind include: additional in-memory state, additional
320+
non-trivial computations, excessive access to disks (including increased log
321+
volume), significant amount of data sent and/or received over network, etc.
322+
This through this both in small and large cases, again with respect to the
323+
[supported limits].
324+
325+
### Troubleshooting
326+
327+
The Troubleshooting section currently serves the `Playbook` role. We may consider
328+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
329+
details). For now, we leave it here.
330+
331+
_This section must be completed when targeting beta graduation to a release._
332+
333+
* **How does this feature react if the API server and/or etcd is unavailable?**
334+
335+
* **What are other known failure modes?**
336+
For each of them, fill in the following information by copying the below template:
337+
- [Failure mode brief description]
338+
- Detection: How can it be detected via metrics? Stated another way:
339+
how can an operator troubleshoot without logging into a master or worker node?
340+
- Mitigations: What can be done to stop the bleeding, especially for already
341+
running user workloads?
342+
- Diagnostics: What are the useful log messages and their required logging
343+
levels that could help debug the issue?
344+
Not required until feature graduated to beta.
345+
- Testing: Are there any tests for failure mode? If not, describe why.
346+
347+
* **What steps should be taken if SLOs are not being met to determine the problem?**
348+
349+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
350+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
351+
352+
## Implementation History
353+
354+
<!--
355+
Major milestones in the lifecycle of a KEP should be tracked in this section.
356+
Major milestones might include:
357+
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
358+
- the `Proposal` section being merged, signaling agreement on a proposed design
359+
- the date implementation started
360+
- the first Kubernetes release where an initial version of the KEP was available
361+
- the version of Kubernetes where the KEP graduated to general availability
362+
- when the KEP was retired or superseded
363+
-->
364+
365+
## Drawbacks
366+
367+
<!--
368+
Why should this KEP _not_ be implemented?
369+
-->
370+
371+
## Alternatives
372+
373+
<!--
374+
What other approaches did you consider, and why did you rule them out? These do
375+
not need to be as detailed as the proposal, but should include enough
376+
information to express the idea and why it was not acceptable.
377+
-->
378+
379+
## Infrastructure Needed (Optional)
380+
381+
<!--
382+
Use this section if you need things from the project/SIG. Examples include a
383+
new subproject, repos requested, or GitHub details. Listing these here allows a
384+
SIG to get the process for these resources started right away.
385+
-->

0 commit comments

Comments
 (0)