|
49 | 49 | - [Change container runtime](#change-container-runtime)
|
50 | 50 | - [Move SELinux label management to kubelet](#move-selinux-label-management-to-kubelet)
|
51 | 51 | - [Merge <code>FSGroupChangePolicy</code> and <code>SELinuxRelabelPolicy</code>](#merge-fsgroupchangepolicy-and-selinuxrelabelpolicy)
|
| 52 | + - [Implement kubelet admission](#implement-kubelet-admission) |
52 | 53 | <!-- /toc -->
|
53 | 54 |
|
54 | 55 | ## Release Signoff Checklist
|
@@ -351,7 +352,6 @@ Apart from the obvious API change and behavior described above, kubelet + volume
|
351 | 352 | * Reconciler must check also SELinux context used to mount a volume (both mounted devices and volumes) before considering what operation to take on a volume (`MountVolume` or `UnmountVolume`/`UnmountDevice` or nothing).
|
352 | 353 | It must throw proper error message telling that a Pod can't start because its volume is used by another Pod with a different SELinux context.
|
353 | 354 | * This is a good point to capture any metrics proposed below.
|
354 |
| - * Kubelet will reject Pods that use a volume that is already mounted with a different SELinux context during pod admission. |
355 | 355 | * Volume plugins will get SELinux context as a new parameter of `MountDevice` and `SetUp`/`SetupAt` calls (resp. as a new field in `DeviceMounterArgs` / `MounterArgs`).
|
356 | 356 | * Each volume plugin can choose to use the mount option `-o context=` (e.g. when `CSIDriver.SELinuxRelabelPolicy` is `true`) or ignore it (e.g. in-tree volume plugins for shared filesystems or when `CSIDriver.SELinuxRelabelPolicy` is `false` or `nil`).
|
357 | 357 | * Each volume plugin then returns `SupportsSELinux` from `GetAttributes()` call, depending on if it wants the container runtime to relabel the volume (`true`) or not (`false`; the volume was already mounted with the right label or it does not support SELinux at all).
|
@@ -409,16 +409,13 @@ Due to change of Kubernetes behavior, we will implement the feature only for cas
|
409 | 409 | Such volumes can be used only in a single pod and two pods can't ever conflict when using it.
|
410 | 410 | - Collect metrics of how many other pods would fail because they use a RWO/RWX volume that's used by a pod with different SELinux context on the same node.
|
411 | 411 | TBD: create Info level alert ("please consider re-architecting your app not to share volumes this way and/or report it to sig-storage")?
|
412 |
| -- During Pod admission, only bump a metric that a Pod would be rejected when it uses already mounted volume, but with a different SELinux context. |
413 |
| - - Note that there already is Pod admission check that rejects Pods that use a already used `ReadWriteOncePod` PV. |
414 | 412 | This phase can go Beta (be enabled by default) or even GA without breaking anything.
|
415 | 413 |
|
416 | 414 | #### Phase 2
|
417 | 415 | Based on Phase 1 results:
|
418 | 416 | - Extend the implementation to all volumes, i.e. to in-line volumes and PVCs with any access mode.
|
419 | 417 | - Bump severity of the alert to Warning.
|
420 | 418 | - Announce the behavior change and deprecate the old behavior.
|
421 |
| -- Enable the Pod admission. |
422 | 419 |
|
423 | 420 | If Phase 1 shows that too many applications would be broken, then go GA only with Phase 1, i.e. `ReadWriteOncePod` PVCs.
|
424 | 421 | Even that will help users to avoid recursive relabeling of volumes if their application can use `ReadWriteOncePod` PVCs.
|
@@ -751,8 +748,6 @@ _This section must be completed when targeting beta graduation to a release._
|
751 | 748 | * Implement bare minimum of `SELinuxMount` for experiments, including:
|
752 | 749 | * Extend SELinux mount to all volume access modes.
|
753 | 750 | * Add label with volume access mode to `volume_manager_selinux_volume_context_mismatch_errors_total` and similar metrics.
|
754 |
| - * Implement aforementioned kubelet admission to reject Pods that use a volume that is already mounted with a different SELinux context. |
755 |
| - This admission was useless with RWOP volumes, because kubelet already rejects Pods that use a RWOP volume that's used by another Pod. |
756 | 751 |
|
757 | 752 | ## Drawbacks [optional]
|
758 | 753 |
|
@@ -820,3 +815,29 @@ At the same time, `FSGroupChangePolicy: OnRootMismatch` may not be desirable for
|
820 | 815 | where various files on the volume may get random owners.
|
821 | 816 |
|
822 | 817 | With a single `VolumeChangePolicy`, user has to fall back to `Recursive` policy and SELinux labels would be unnecessarily changed.
|
| 818 | + |
| 819 | +### Implement kubelet admission |
| 820 | + |
| 821 | +It's not really an alternative, but it's worth mentioning that we originally proposed to implement kubelet admission for Pods with SELinux contexts. |
| 822 | + |
| 823 | +Kubelet _could_ reject a Pod that uses a volume with a different SELinux context than the volume is currently mounted with on the node, right during Pod admission. |
| 824 | +Such a Pod would end in `Failed` state with appropriate message, and it would never reach kubelet internal states (e.g. DSW, ASW, MountDevice, SetUp, etc.). |
| 825 | + |
| 826 | +We thought it would send a clear message to the user that such a pod can't start, and why. |
| 827 | +During implementation and testing it has shown that it's very noisy. Consider this experiment: |
| 828 | + |
| 829 | +1. User runs a single-replica Deployment with no SELinux context set. The Deployment uses a single volume. The Deployment runs a Pod A on Node 1. |
| 830 | +1. User edits the Deployment and sets the SELinux context of the Pod, because they want to benefit from `SELinuxMount` feature. |
| 831 | +1. Deployment / ReplicaSet controllers start a new Pod B with the new SELinux context. |
| 832 | +1. Scheduler (accidentally?) puts Pod B to Node 1, where Pod A still runs. |
| 833 | +1. Kubelet rejects the new Pod, because the volume is already mounted without any SELinux context, while the new Pod wants a specific one. |
| 834 | +1. Deployment / ReplicaSet controllers start another Pod with the new SELinux context. Scheduler puts it to Node 1 again. |
| 835 | +1. Kubelet rejects the Pod again. This loop can continue forever, as longs as the scheduler picks Node 1 and Pod A still runs there. |
| 836 | + |
| 837 | +I got hundreds Pods just created and rejected in a minute or so. While it sends a clear message to the user that something is wrong, it's just too noisy for the API server. |
| 838 | +
|
| 839 | +**As result, we decided to remove pod admission in kubelet from this KEP.** |
| 840 | +
|
| 841 | +Expected behavior without Pod admission is that kubelet admits the Pod B and eventually its volumes reach DSW. |
| 842 | +The Pod B will be `ContainerCreating` until Pod A terminates and the volume can be mounted with the right SELinux context. |
| 843 | +The same behavior was already implemented for `ReadWriteOncePod` AccessMode. |
0 commit comments