You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/1710-selinux-relabeling/README.md
+45-14Lines changed: 45 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -118,6 +118,10 @@ Further in this KEP we assume that the SELinux is enabled on the system. This KE
118
118
119
119
See [SELinux documentation](https://selinuxproject.org/page/NB_MLS) for more details.
120
120
121
+
In this document we use `container_t` and `container_file_t` labels for container processes / files, which are the default labels on Fedora based distributions (AlmaLinux, CentOS, Red Hat Enterprise Linux, Rocky Linux, ...).
122
+
For example, Debian uses `svirt_lxc_net_t` and `svirt_lxc_file_t` as the default labels for containers, but the principles are the same.
123
+
The implementation of this KEP does not depend on the actual labels used in the system.
124
+
121
125
### SELinux label assignment
122
126
In Kubernetes, the SELinux label of a pod is assigned in two ways:
123
127
1. Either it is set by user in PodSpec or Container: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/.
@@ -465,13 +469,13 @@ spec:
465
469
* Same as the previous story. Kubelet mounts the volume without any SELinux option + the container runtime relabels the volumes recursively.
* If `myclaim` is a RWOP volume (`Spec.AccessModes == ["ReadWriteOncePod']`) *and* the corresponding CSI drivers support SELinux mount, kubelet mounts the volume with `-o context=system_u:object_r:container_file_t:s0:c10,c0`.
472
+
* If `myclaim` is a RWOP volume (`Spec.AccessModes == ["ReadWriteOncePod']`) *and* the corresponding CSI drivers support SELinux mount, kubelet fills the blanks in the `seLinuxOptions` from the system defaults (`user: system_u`, `role: object_r`, `type: container_t` on Fedora based distros), translates them to a file label (`container_t` -> `container_file_t`) and mounts the volume with `-o context=system_u:object_r:container_file_t:s0:c10,c0`.
469
473
* If `myclaim` is any other volume, kubelet mounts the volume without any SELinux option + the container runtime relabels the volume recursively.
470
474
* The secret token volume is relabeled by the container runtime, because Secret and Projected volumes do not support SELinux mount.
* Since there is no `SELinuxChangePolicy` set, kubelet implies `MountOption`.
474
-
If the corresponding CSI driver (or in-tree volume plugin) support SELinux mount, the volume is mounted with `-o context=system_u:object_r:container_file_t:s0:c10,c0`.
478
+
If the corresponding CSI driver (or in-tree volume plugin) support SELinux mount, kubelet fills the blanks in the `seLinuxOptions` from the system defaults as described above and the volume is mounted with `-o context=system_u:object_r:container_file_t:s0:c10,c0`.
475
479
* Otherwise, kubelet mounts the volume without any SELinux option + the container runtime relabels the volume recursively.
476
480
* The secret token volume is relabeled by the container runtime, because Secret and Projected volumes do not support SELinux mount.
477
481
@@ -602,7 +606,12 @@ Drawbacks:
602
606
* The controller may report a conflict when two Pods are scheduled to the same node, but they will run serially there.
603
607
For example, one pod is already being deleted and the other has just been scheduled there.
604
608
Kubelet's `volume_manager_selinux_volume_context_mismatch_warnings_total` metric is more accurate in this case.
605
-
609
+
* The controller cannot read the SELinux default container labels from the operating system.
610
+
KCM often runs in a container and does not have access to `/etc/selinux` on the worker nodes.
611
+
As consequence, two labels that are equivalent from the SELinux point of view, may be reported as different, such as these two `seLinuxOptions` snippets: `{"type": "container_t", "level": "s0:c10,c0"}`and `{"level": "s0:c10,c1"}`.
612
+
`container_t`is the default type label for containers on Fedora, so kubelet is able to fill it in the `seLinuxOptions` when it is not set and see they're equivalent.
613
+
KCM does not know the default on nodes and treats empty fields in `seLinuxOptions` as *uncomparable* - it does not emit any event in the above example.
614
+
606
615
### Implementation phases
607
616
608
617
Due to change of Kubernetes behavior, we will implement the feature only for cases where it can't break anything first.
@@ -647,32 +656,47 @@ No existing / new tests for volume mounting there.
647
656
648
657
* Check no recursive `chcon` is done on a volume when not needed.
649
658
* Check recursive `chcon` is done on a volume when needed.
650
-
* Check that proper metric is emitted when kubelet can't start two pods with different SELinux labels using the same volume on the same node._
651
-
* These tests might use only CSI volumes, GCE PD in-tree volume plugin that we use for e2e tests might be already migrated to CSI by that time.
659
+
* Check that kubelet emits proper metrics when it can't start two pods with different SELinux labels using the same volume on the same node._
660
+
* Check that the SELinux warning controller emits events when pods conflict + emit the described metrics.
652
661
* Prepare e2e job that runs with SELinux in Enforcing mode.
653
662
* Done:
654
663
* https://testgrid.k8s.io/kops-k8s-ci#kops-aws-selinux: for features enabled by default.
655
-
* https://testgrid.k8s.io/kops-k8s-ci#kops-aws-selinux-alpha: for alpha features.
664
+
* https://testgrid.k8s.io/kops-k8s-ci#kops-aws-selinux-alpha: for all alpha features enabled.
665
+
* https://testgrid.k8s.io/kops-distro-rhel8#kops-aws-selinux-changepolicy: for `SELinuxChangePolicy` enabled + `SELinuxMount` disabled.
656
666
* https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-storage-selinux: for PRs (needs explicit `/test ` in a PR).
657
667
668
+
All these e2e tests use only CSI volumes. All in-tree volume types that support SELinux and dynamic provisioning were migrated to CSI already.
669
+
658
670
### Graduation Criteria
659
671
660
672
* Alpha of Phase 1:
661
673
* Provided all tests defined above are passing and gated by the feature gate `SELinuxMountReadWriteOncePod` and set to a default of `false`.
662
674
* Documentation exists.
663
675
* Beta of Phase 1:
676
+
* E2e tests implemented + green.
664
677
* The feature gate is `true` by default.
665
678
* Evaluation:
666
679
* During the next release after Phase 1 is beta (= the feature is enabled by default), collect reports from users about possible breakage.
667
680
* KEP author has access to usage data from OpenShift, a Kubernetes distro that runs with SELinux in enforcing mode.
668
681
* Alpha of Phase 2 + 3:
669
682
* Implemented `SELinuxChangePolicy` **with a separate alpha feature gate `SELinuxChangePolicy`** as preparation for `SELinuxMount` feature gate graduation.
670
683
* Implemented SELinuxController.
671
-
* Beta of Phase 2, alpha of phase 3:
684
+
* Beta of Phase 2 + 3 (`SELinuxChangePolicy` is beta and enabled by default; `SELinuxMount` is beta, but disabled by default).
685
+
* E2e tests implemented + green.
672
686
* Telemetry numbers from OpenShift show that <5% of clusters would need to change any of their Pods.
673
-
* GA:
687
+
* This phase signalizes that the feature is ready for real testing.
688
+
Only non-breaking parts (`SELinuxChangePolicy`) are enabled by default.
689
+
Users willing to test `SELinuxMount` must enable it explicitly.
690
+
* GA of Phase 2 (`SELinuxChangePolicy` + `SELinuxMountReadWriteOncePod` are GA and locked to default, `SELinuxMount` is beta and disabled by default):
674
691
* All known issues fixed. Otherwise, we will GA Phase 1 only.
692
+
* Users can update their clusters safely, there is no breaking change yet.
693
+
Users willing to test `SELinuxMount` must enable it explicitly.
694
+
* This phase allows production clusters to check what Pods (Deployments, StatefulSets) need update and fix them before the breaking part (`SELinuxMount`) is enabled by default in the next phase.
695
+
* GA of Phase 3 (`SELinuxMount` is GA and locked to default):
696
+
* At least 1 release after `SELinuxChangePolicy` is GA to give cluster admins enough time to apply `SELinuxChangePolicy` to their Pods.
675
697
* Telemetry numbers from OpenShift show that <2% of clusters would need to change any of their Pods (i.e. most clusters already applied opt-out).
698
+
* This is the phase that may break existing applications during cluster upgrade.
699
+
Users that use SELinux should carefully evaluate the metrics emitted by kubelet and SELinuxWarningController and fix their workloads before upgrade to this version.
676
700
677
701
### Upgrade / Downgrade Strategy
678
702
@@ -711,9 +735,9 @@ _This section must be completed when targeting alpha to a release._
711
735
* **How can this feature be enabled / disabled in a live cluster?**
712
736
- [X] Feature gate (also fill in values in `kep.yaml`)
713
737
- Feature gate name: `SELinuxMountReadWriteOncePod`(beta in 1.28)
714
-
- Feature gate name: `SELinuxChangePolicy`(alpha in 1.30)
738
+
- Feature gate name: `SELinuxChangePolicy`(alpha in 1.30, proposing beta in 1.33)
715
739
- To enable `SELinuxChangePolicy` feature gate, `SELinuxMountReadWriteOncePod` **must** be enabled too.
716
-
- Feature gate name: `SELinuxMount`(alpha in 1.30)
740
+
- Feature gate name: `SELinuxMount`(alpha in 1.30, proposing beta in 1.33)
717
741
- To enable `SELinuxMount` feature gate, `SELinuxMountReadWriteOncePod` and `SELinuxChangePolicy` **must** be enabled too.
718
742
- Components depending on the feature gate: apiserver (API validation only), kubelet
719
743
- [ ] Other
@@ -728,6 +752,7 @@ _This section must be completed when targeting alpha to a release._
728
752
automations, so be extremely careful here.
729
753
730
754
**Yes.** See [Conflict with other Pods](#conflicts-with-other-pods) for details.
755
+
We offer metrics + events + proactive opt-out per Pod before the breaking part (`SELinuxMount`) is enabled by default.
731
756
732
757
* **Can the feature be disabled once it has been enabled (i.e. can we rollback
733
758
the enablement)?**
@@ -896,7 +921,8 @@ previous answers based on experience in the field._
896
921
897
922
* **Will enabling / using this feature result in any new API calls?**
898
923
899
-
No new API calls are required. Kubelet / CSI volume plugin already has CSIDriver informer.
924
+
* No new API calls are required in kubelet, its CSI volume plugin already has CSIDriver informer.
925
+
* KCM will emit new events when SELinuxWarningController is enabled. It already has Pod, PV, PVC, CSIDriver informers and does not do other API calls.
900
926
901
927
* **Will enabling / using this feature result in introducing new API types?**
902
928
@@ -909,8 +935,9 @@ previous answers based on experience in the field._
909
935
910
936
* **Will enabling / using this feature result in increasing size or count of the existing API objects?**
911
937
912
-
CSIDriver gets one new field. We expect only few CSIDriver objects in a cluster.
913
-
PodSpec gets one new field, and we expect it to be `null` for the vast majority of Pods.
938
+
* CSIDriver gets one new field. We expect only few CSIDriver objects in a cluster.
939
+
* PodSpec gets one new field, and we expect it to be `null` for the vast majority of Pods.
940
+
* Event(s) will be created for every conflicting Pod pair when SELinuxWarningController is enabled.
914
941
915
942
* **Will enabling / using this feature result in increasing time taken by any
916
943
operations covered by [existing SLIs/SLOs][]?**
@@ -927,7 +954,7 @@ previous answers based on experience in the field._
927
954
This through this both in small and large cases, again with respect to the
928
955
[supported limits][].
929
956
930
-
No. Kubelet already has a cache of desired / existing mounts, we need to add
957
+
No. KCM and Kubelet already has a cache of desired / existing mounts, we need to add
931
958
a string with SELinux label to each one, which should be negligible.
932
959
933
960
* **Can enabling / using this feature result in resource exhaustion of some node
@@ -968,6 +995,7 @@ _This section must be completed when targeting beta graduation to a release._
968
995
969
996
- *Kubelet des not start new Pods*
970
997
- Detection: `volume_manager_selinux_container_errors_total`, `volume_manager_selinux_pod_context_mismatch_errors_total` or `volume_manager_selinux_volume_context_mismatch_errors_total` grows.
998
+
In addition, each such Pod has an event about SELinux label mismatch.
971
999
- Mitigations: What can be done to stop the bleeding, especially for already
972
1000
running user workloads?
973
1001
Workloads that run keep running, only new Pods can't start.
@@ -998,6 +1026,9 @@ _This section must be completed when targeting beta graduation to a release._
998
1026
* We discovered that sharing volumes between privileged and unprivileged containers as described [here](#privileged-containers) is a valid use case.
999
1027
we cannot mount *all* volumes with `-o context` and it must be an explicit opt-out using `SELinuxChangePolicy: Recursive`.
1000
1028
* Implement `SELinuxChangePolicy` as an alpha field.
1029
+
* 1.33: Graduate `SELinuxMount` to beta / disabled by default, `SELinuxChangePolicy` to beta / enabled by default.
1030
+
* Add e2e tests for the SELinuxWarningController.
1031
+
* Test on non-Fedora based Linux distribution (e.g. Debian) with SELinux enabled.
0 commit comments