|
| 1 | +# Provide fsgroup of pod to CSI driver on mount |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | + |
| 5 | +<!-- toc --> |
| 6 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 7 | +- [Summary](#summary) |
| 8 | +- [Motivation](#motivation) |
| 9 | + - [Goals](#goals) |
| 10 | + - [Non-Goals](#non-goals) |
| 11 | +- [Proposal](#proposal) |
| 12 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 13 | +- [Design Details](#design-details) |
| 14 | + - [Test Plan](#test-plan) |
| 15 | + - [Graduation Criteria](#graduation-criteria) |
| 16 | + - [Alpha -> Beta Graduation](#alpha---beta-graduation) |
| 17 | + - [Beta -> GA Graduation](#beta---ga-graduation) |
| 18 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 19 | + - [Version Skew Strategy](#version-skew-strategy) |
| 20 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 21 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 22 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 23 | + - [Monitoring Requirements](#monitoring-requirements) |
| 24 | + - [Dependencies](#dependencies) |
| 25 | + - [Scalability](#scalability) |
| 26 | + - [Troubleshooting](#troubleshooting) |
| 27 | +- [Implementation History](#implementation-history) |
| 28 | +- [Drawbacks](#drawbacks) |
| 29 | +- [Alternatives](#alternatives) |
| 30 | +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) |
| 31 | +<!-- /toc --> |
| 32 | + |
| 33 | +## Release Signoff Checklist |
| 34 | + |
| 35 | +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 36 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 37 | +- [ ] (R) Design details are appropriately documented |
| 38 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 39 | +- [ ] (R) Graduation criteria is in place |
| 40 | +- [ ] (R) Production readiness review completed |
| 41 | +- [ ] Production readiness review approved |
| 42 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 43 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 44 | +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 45 | + |
| 46 | +## Summary |
| 47 | + |
| 48 | +Currently for most volume plugins kubelet applies fsgroup ownership and permission based changes by recursively `chown`ing and `chmod`ing the files and directories inside a volume. For certain CSI drivers this may not be possible because `chown` and `chmod` are unix primitives and underlying CSI driver |
| 49 | +may not support them. This enhancement proposes providing the CSI driver with fsgroup as an explicit field so as CSI driver can apply this on mount time. |
| 50 | + |
| 51 | +## Motivation |
| 52 | + |
| 53 | +Since some CSI drivers(AzureFile for example) - don't support chmod/chown - we propose that fsgroup of pod to be provided to the CSI driver on `NodeStageVolume` |
| 54 | +and `NodePublishVolume` CSI RPC calls. This will allow the CSI driver to apply `fsgroup` as a mount option on `NodeStageVolume` or `NodePublishVolume` and kubelet can be freed |
| 55 | +from responsibility of applying recursive ownership and permission change. |
| 56 | + |
| 57 | +This feature hence becomes a prerequisite of CSI migration of Azure file driver and removal of Azure Cloud Provider. |
| 58 | + |
| 59 | +### Goals |
| 60 | + |
| 61 | +- Allow CSI driver to mount volumes with provided fsgroup. |
| 62 | + |
| 63 | +### Non-Goals |
| 64 | + |
| 65 | +- We are not supplying `fsGroup` as a generic ownerhip and permission handle to the CSI driver. We do not expect CSI drivers to `chown` or `chmod` files. |
| 66 | + |
| 67 | +## Proposal |
| 68 | + |
| 69 | +We are updating CSI specs by adding additional field called `volume_mount_group` to `NodeStageVolume` and `NodePublishVolume` RPC calls. The CSI proposal is available at - https://github.com/container-storage-interface/spec/pull/468 . |
| 70 | + |
| 71 | +The CSI spec change is deliberately trying to avoid asking drivers to use supplied `fsGroup` as a generic handle for ownership and permissions. The reason being - Kubernetes may expect ownership and permissions to be in a way that is very platform/OS specific. We do not think CSI driver is right place to enforce all kind of different permissions expected by Kubernetes. The full scope of that discussion is out of scope for this enhancement and interested folks can follow along on - https://github.com/container-storage-interface/spec/issues/449 |
| 72 | + |
| 73 | + |
| 74 | +### Risks and Mitigations |
| 75 | + |
| 76 | +I am not aware of any associated risks. If a driver can not support using `fsgroup` as a mount option, it can always use `FileFSGroupPolicy` and let kubelet handle the ownership and permissions. |
| 77 | + |
| 78 | +## Design Details |
| 79 | + |
| 80 | +We are proposing that when kubelet determines a CSI driver has `VOLUME_MOUNT_GROUP` node capability, the kubelet will use proposed CSI field `volume_mount_group` to pass pod's `fsGroup` to the CSI driver. Kubelet will expect that driver will use |
| 81 | +this field for mounting volume with given `fsGroup` and no further permission/ownerhip change will be necessary. |
| 82 | + |
| 83 | +It should be noted that if a CSI driver advertises `VOLUME_MOUNT_GROUP` node capability then value defined in `CSIDriver.Spec.FSGroupPolicy` will be ignored and kubelet will always use `fsGroup` as a mount option. |
| 84 | + |
| 85 | +### Test Plan |
| 86 | + |
| 87 | +Unit test: |
| 88 | +1. Test that whenever supported pod's `fsGroup` should be passed to CSI driver via `volume_mount_group` field. |
| 89 | + |
| 90 | +For alpha feature: |
| 91 | +1. Update Azure File CSI driver to support supplying `fsGroup` via `NodeStageVolume` and `NodePublishVolume`. |
| 92 | +1. Run manual tests against azurefile CSI driver. |
| 93 | + |
| 94 | +For beta: |
| 95 | +1. E2E tests that verify volume readability/writability using azurefile CSI driver. |
| 96 | +2. E2E tests using CSI mock driver. |
| 97 | + |
| 98 | +We already have quite a few e2e tests that verify generic fsgroup functionality for existing drivers - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/testsuites/fsgroupchangepolicy.go . This should give us a reasonable |
| 99 | +confidence that we won't break any existing drivers. |
| 100 | + |
| 101 | + |
| 102 | +### Graduation Criteria |
| 103 | + |
| 104 | +#### Alpha -> Beta Graduation |
| 105 | + |
| 106 | +- Since this feature is a must-have for azurefile CSI migration, we will perform testing of the driver. |
| 107 | +- Currently CSI spec change is being introduced as alpha change and we will work to move the API change in CSI spec to stable. |
| 108 | + |
| 109 | +#### Beta -> GA Graduation |
| 110 | + |
| 111 | +- CSI spec change should be stable. |
| 112 | +- Tested via e2e and manually using azurefile CSI driver. |
| 113 | + |
| 114 | +### Upgrade / Downgrade Strategy |
| 115 | + |
| 116 | +Currently there is no way to make a volume readable/writable using azurefile and `fsGroup` unless |
| 117 | +pod was running as root. |
| 118 | + |
| 119 | +When feature-gate is disabled, kubelet will no longer pass `fsGroup` to CSI drivers and such volumes will not be readable/writable by the Pod. This feature is currently broken anyways. |
| 120 | + |
| 121 | + |
| 122 | +<!-- |
| 123 | +If applicable, how will the component be upgraded and downgraded? Make sure |
| 124 | +this is in the test plan. |
| 125 | +
|
| 126 | +Consider the following in developing an upgrade/downgrade strategy for this |
| 127 | +enhancement: |
| 128 | +- What changes (in invocations, configurations, API use, etc.) is an existing |
| 129 | + cluster required to make on upgrade, in order to maintain previous behavior? |
| 130 | +- What changes (in invocations, configurations, API use, etc.) is an existing |
| 131 | + cluster required to make on upgrade, in order to make use of the enhancement? |
| 132 | +--> |
| 133 | + |
| 134 | +### Version Skew Strategy |
| 135 | + |
| 136 | +<!-- |
| 137 | +If applicable, how will the component handle version skew with other |
| 138 | +components? What are the guarantees? Make sure this is in the test plan. |
| 139 | +
|
| 140 | +Consider the following in developing a version skew strategy for this |
| 141 | +enhancement: |
| 142 | +- Does this enhancement involve coordinating behavior in the control plane and |
| 143 | + in the kubelet? How does an n-2 kubelet without this feature available behave |
| 144 | + when this feature is used? |
| 145 | +- Will any other components on the node change? For example, changes to CSI, |
| 146 | + CRI or CNI may require updating that component before the kubelet. |
| 147 | +--> |
| 148 | + |
| 149 | +## Production Readiness Review Questionnaire |
| 150 | + |
| 151 | +<!-- |
| 152 | +
|
| 153 | +Production readiness reviews are intended to ensure that features merging into |
| 154 | +Kubernetes are observable, scalable and supportable; can be safely operated in |
| 155 | +production environments, and can be disabled or rolled back in the event they |
| 156 | +cause increased failures in production. See more in the PRR KEP at |
| 157 | +https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness. |
| 158 | +
|
| 159 | +The production readiness review questionnaire must be completed and approved |
| 160 | +for the KEP to move to `implementable` status and be included in the release. |
| 161 | +
|
| 162 | +In some cases, the questions below should also have answers in `kep.yaml`. This |
| 163 | +is to enable automation to verify the presence of the review, and to reduce review |
| 164 | +burden and latency. |
| 165 | +
|
| 166 | +The KEP must have a approver from the |
| 167 | +[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES) |
| 168 | +team. Please reach out on the |
| 169 | +[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if |
| 170 | +you need any help or guidance. |
| 171 | +
|
| 172 | +--> |
| 173 | + |
| 174 | +### Feature Enablement and Rollback |
| 175 | + |
| 176 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 177 | + - [x] Feature gate (also fill in values in `kep.yaml`) |
| 178 | + - Feature gate name: MountWithFSGroup |
| 179 | + - Components depending on the feature gate: |
| 180 | + - Kubelet |
| 181 | + - [ ] Other |
| 182 | + - Describe the mechanism: |
| 183 | + - Will enabling / disabling the feature require downtime of the control |
| 184 | + plane? |
| 185 | + - Will enabling / disabling the feature require downtime or reprovisioning |
| 186 | + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). |
| 187 | + |
| 188 | +* **Does enabling the feature change any default behavior?** |
| 189 | + Enabling this feature-gate could result in `CSIDriver.Spec.FSGroupPolicy` to be ignored for a driver that now has |
| 190 | + `VOLUME_MOUNT_GROUP` capability. Enabling this feature will cause volume to be mounted with `fsGroup` |
| 191 | + of the pod rather than kubelet performing permission/ownership change. This should result in quicker |
| 192 | + pod startup but still may surprise some users. This will be covered via release notes. |
| 193 | + |
| 194 | + We expect that once a CSI driver accepts provided `fsGroup` via `volume_mount_group` and mount |
| 195 | + operation is successful - the permissions should be correct on the volume. A user may write additional |
| 196 | + healthcheck to determine the permissions if necessary. |
| 197 | + |
| 198 | +* **Can the feature be disabled once it has been enabled (i.e. can we roll back |
| 199 | + the enablement)?** |
| 200 | + Yes - feature gate can be disabled once enabled and kubelet will fallback to its current behavior, which will |
| 201 | + result in any CSI driver that requires using `fsGroup` as a mount option to not use the mount option. In which case |
| 202 | + mounted volumes will be unwritable by any pods other than those running as root. |
| 203 | + |
| 204 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 205 | + It will cause kubelet to use `volume_mount_group` field of CSI whenver applicable as discussed in above design. The pods running on affected |
| 206 | + nods have to restarted for feature to take affect though. |
| 207 | + |
| 208 | +* **Are there any tests for feature enablement/disablement?** |
| 209 | + Not yet. |
| 210 | + |
| 211 | +### Rollout, Upgrade and Rollback Planning |
| 212 | + |
| 213 | +* **How can a rollout fail? Can it impact already running workloads?** |
| 214 | + One of the ways the rollout could fail is CSI driver defines `VOLUME_MOUNT_GROUP` capability but somehow does not implement correctly. |
| 215 | + This change should not affect running workloads (i.e running Pods). |
| 216 | + |
| 217 | + Rolling out the feature however may cause group permissions to be applied correctly which weren't applied before. |
| 218 | + |
| 219 | +* **What specific metrics should inform a rollback?** |
| 220 | + If after enabling this feature a spike in `storage_operation_status_count{operation_name="volume_mount", status="fail-unknown"}` metric is observed |
| 221 | + then cluster admin should look into identifying root cause and rolling back the feature. |
| 222 | + |
| 223 | +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** |
| 224 | + Describe manual testing that was done and the outcomes. |
| 225 | + Longer term, we may want to require automated upgrade/rollback tests, but we |
| 226 | + are missing a bunch of machinery and tooling and can't do that now. |
| 227 | + |
| 228 | +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, |
| 229 | +fields of API types, flags, etc.?** |
| 230 | + Even if applying deprecation policies, they may still surprise some users. |
| 231 | + |
| 232 | +### Monitoring Requirements |
| 233 | + |
| 234 | +_This section must be completed when targeting beta graduation to a release._ |
| 235 | + |
| 236 | +* **How can an operator determine if the feature is in use by workloads?** |
| 237 | + We are going to split the metric that captures mount and permission timings. The full details are available in - https://github.com/kubernetes/kubernetes/issues/98667 |
| 238 | + |
| 239 | +* **What are the SLIs (Service Level Indicators) an operator can use to determine |
| 240 | +the health of the service?** |
| 241 | + - [ ] Metrics |
| 242 | + - Metric name: |
| 243 | + - [Optional] Aggregation method: |
| 244 | + - Components exposing the metric: |
| 245 | + - [ ] Other (treat as last resort) |
| 246 | + - Details: |
| 247 | + |
| 248 | +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
| 249 | + At a high level, this usually will be in the form of "high percentile of SLI |
| 250 | + per day <= X". It's impossible to provide comprehensive guidance, but at the very |
| 251 | + high level (needs more precise definitions) those may be things like: |
| 252 | + - per-day percentage of API calls finishing with 5XX errors <= 1% |
| 253 | + - 99% percentile over day of absolute value from (job creation time minus expected |
| 254 | + job creation time) for cron job <= 10% |
| 255 | + - 99,9% of /health requests per day finish with 200 code |
| 256 | + |
| 257 | +* **Are there any missing metrics that would be useful to have to improve observability |
| 258 | +of this feature?** |
| 259 | + Describe the metrics themselves and the reasons why they weren't added (e.g., cost, |
| 260 | + implementation difficulties, etc.). |
| 261 | + |
| 262 | +### Dependencies |
| 263 | + |
| 264 | +* **Does this feature depend on any specific services running in the cluster?** |
| 265 | + This feature depends on presence of `fsGroup` field in Pod and CSI driver having `VOLUME_MOUNT_GROUP` |
| 266 | + capability. |
| 267 | + |
| 268 | + - [CSI driver] |
| 269 | + - CSI driver must have `VOLUME_MOUNT_GROUP` capability: |
| 270 | + - If this capability is not available in CSI driver then kubelet will try to use default mechanism of applying `fsGroup` to volume (which is basically `chown` and `chomid`). If underlying driver however does not support applying group permissions via `chown` and `chmod` then Pods will not run correctly. |
| 271 | + - Workloads may not run correctly or volume has to be mounted with permissions `777`. |
| 272 | + |
| 273 | + |
| 274 | +### Scalability |
| 275 | + |
| 276 | +_For alpha, this section is encouraged: reviewers should consider these questions |
| 277 | +and attempt to answer them._ |
| 278 | + |
| 279 | +_For beta, this section is required: reviewers must answer these questions._ |
| 280 | + |
| 281 | +_For GA, this section is required: approvers should be able to confirm the |
| 282 | +previous answers based on experience in the field._ |
| 283 | + |
| 284 | +* **Will enabling / using this feature result in any new API calls?** |
| 285 | + Describe them, providing: |
| 286 | + - API call type (e.g. PATCH pods) |
| 287 | + - estimated throughput |
| 288 | + - originating component(s) (e.g. Kubelet, Feature-X-controller) |
| 289 | + focusing mostly on: |
| 290 | + - components listing and/or watching resources they didn't before |
| 291 | + - API calls that may be triggered by changes of some Kubernetes resources |
| 292 | + (e.g. update of object X triggers new updates of object Y) |
| 293 | + - periodic API calls to reconcile state (e.g. periodic fetching state, |
| 294 | + heartbeats, leader election, etc.) |
| 295 | + |
| 296 | +* **Will enabling / using this feature result in introducing new API types?** |
| 297 | + Describe them, providing: |
| 298 | + - API type |
| 299 | + - Supported number of objects per cluster |
| 300 | + - Supported number of objects per namespace (for namespace-scoped objects) |
| 301 | + |
| 302 | +* **Will enabling / using this feature result in any new calls to the cloud |
| 303 | +provider?** |
| 304 | + |
| 305 | +* **Will enabling / using this feature result in increasing size or count of |
| 306 | +the existing API objects?** |
| 307 | + Describe them, providing: |
| 308 | + - API type(s): |
| 309 | + - Estimated increase in size: (e.g., new annotation of size 32B) |
| 310 | + - Estimated amount of new objects: (e.g., new Object X for every existing Pod) |
| 311 | + |
| 312 | +* **Will enabling / using this feature result in increasing time taken by any |
| 313 | +operations covered by [existing SLIs/SLOs]?** |
| 314 | + Think about adding additional work or introducing new steps in between |
| 315 | + (e.g. need to do X to start a container), etc. Please describe the details. |
| 316 | + |
| 317 | +* **Will enabling / using this feature result in non-negligible increase of |
| 318 | +resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 319 | + Things to keep in mind include: additional in-memory state, additional |
| 320 | + non-trivial computations, excessive access to disks (including increased log |
| 321 | + volume), significant amount of data sent and/or received over network, etc. |
| 322 | + This through this both in small and large cases, again with respect to the |
| 323 | + [supported limits]. |
| 324 | + |
| 325 | +### Troubleshooting |
| 326 | + |
| 327 | +The Troubleshooting section currently serves the `Playbook` role. We may consider |
| 328 | +splitting it into a dedicated `Playbook` document (potentially with some monitoring |
| 329 | +details). For now, we leave it here. |
| 330 | + |
| 331 | +_This section must be completed when targeting beta graduation to a release._ |
| 332 | + |
| 333 | +* **How does this feature react if the API server and/or etcd is unavailable?** |
| 334 | + |
| 335 | +* **What are other known failure modes?** |
| 336 | + For each of them, fill in the following information by copying the below template: |
| 337 | + - [Failure mode brief description] |
| 338 | + - Detection: How can it be detected via metrics? Stated another way: |
| 339 | + how can an operator troubleshoot without logging into a master or worker node? |
| 340 | + - Mitigations: What can be done to stop the bleeding, especially for already |
| 341 | + running user workloads? |
| 342 | + - Diagnostics: What are the useful log messages and their required logging |
| 343 | + levels that could help debug the issue? |
| 344 | + Not required until feature graduated to beta. |
| 345 | + - Testing: Are there any tests for failure mode? If not, describe why. |
| 346 | + |
| 347 | +* **What steps should be taken if SLOs are not being met to determine the problem?** |
| 348 | + |
| 349 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 350 | +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
| 351 | + |
| 352 | +## Implementation History |
| 353 | + |
| 354 | +<!-- |
| 355 | +Major milestones in the lifecycle of a KEP should be tracked in this section. |
| 356 | +Major milestones might include: |
| 357 | +- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance |
| 358 | +- the `Proposal` section being merged, signaling agreement on a proposed design |
| 359 | +- the date implementation started |
| 360 | +- the first Kubernetes release where an initial version of the KEP was available |
| 361 | +- the version of Kubernetes where the KEP graduated to general availability |
| 362 | +- when the KEP was retired or superseded |
| 363 | +--> |
| 364 | + |
| 365 | +## Drawbacks |
| 366 | + |
| 367 | +<!-- |
| 368 | +Why should this KEP _not_ be implemented? |
| 369 | +--> |
| 370 | + |
| 371 | +## Alternatives |
| 372 | + |
| 373 | +<!-- |
| 374 | +What other approaches did you consider, and why did you rule them out? These do |
| 375 | +not need to be as detailed as the proposal, but should include enough |
| 376 | +information to express the idea and why it was not acceptable. |
| 377 | +--> |
| 378 | + |
| 379 | +## Infrastructure Needed (Optional) |
| 380 | + |
| 381 | +<!-- |
| 382 | +Use this section if you need things from the project/SIG. Examples include a |
| 383 | +new subproject, repos requested, or GitHub details. Listing these here allows a |
| 384 | +SIG to get the process for these resources started right away. |
| 385 | +--> |
0 commit comments