|
| 1 | +--- |
| 2 | +title: Linux kernel security constraints for Pods and containers |
| 3 | +description: > |
| 4 | + Overview of Linux kernel security modules and constraints that you can use to |
| 5 | + harden your Pods and containers. |
| 6 | +content_type: concept |
| 7 | +weight: 100 |
| 8 | +--- |
| 9 | + |
| 10 | +<!-- overview --> |
| 11 | + |
| 12 | +This page describes some of the security features that are built into the Linux |
| 13 | +kernel that you can use in your Kubernetes workloads. To learn how to apply |
| 14 | +these features to your Pods and containers, refer to |
| 15 | +[Configure a SecurityContext for a Pod or Container](/docs/tasks/configure-pod-container/security-context/). |
| 16 | +You should already be familiar with Linux and with the basics of Kubernetes |
| 17 | +workloads. |
| 18 | + |
| 19 | +<!-- body --> |
| 20 | + |
| 21 | +## Run workloads without root privileges {#run-without-root} |
| 22 | + |
| 23 | +When you deploy a workload in Kubernetes, use the Pod specification to restrict |
| 24 | +that workload from running as the root user on the node. You can use the Pod |
| 25 | +`securityContext` to define the specific Linux user and group for the processes in |
| 26 | +the Pod, and explicitly restrict containers from running as root users. Setting |
| 27 | +these values in the Pod manifest takes precedence over similar values in the |
| 28 | +container image, which is especially useful if you're running images that you |
| 29 | +don't own. |
| 30 | + |
| 31 | +{{< caution >}} |
| 32 | +Ensure that the user or group that you assign to the workload has the permissions |
| 33 | +required for the application to function correctly. Changing the user or group |
| 34 | +to one that doesn't have the correct permissions could lead to file access |
| 35 | +issues or failed operations. |
| 36 | +{{< /caution >}} |
| 37 | + |
| 38 | +Configuring the kernel security features on this page provides fine-grained |
| 39 | +control over the actions that processes in your cluster can take, but managing |
| 40 | +these configurations can be challenging at scale. Running containers as |
| 41 | +non-root, or in user namespaces if you need root privileges, helps to reduce the |
| 42 | +chance that you'll need to enforce your configured kernel security capabilities. |
| 43 | + |
| 44 | +## Security features in the Linux kernel {#linux-security-features} |
| 45 | + |
| 46 | +Kubernetes lets you configure and use Linux kernel features to improve isolation |
| 47 | +and harden your containerized workloads. Common features include the following: |
| 48 | + |
| 49 | +* **Secure computing mode (seccomp)**: Filter which system calls a process can |
| 50 | + make |
| 51 | +* **AppArmor**: Restrict the access privileges of individual programs |
| 52 | +* **Security Enhanced Linux (SELinux)**: Assign security labels to objects for |
| 53 | + more manageable security policy enforcement |
| 54 | + |
| 55 | +To configure settings for one of these features, the operating system that you |
| 56 | +choose for your nodes must enable the feature in the kernel. For example, |
| 57 | +Ubuntu 7.10 and later enable AppArmor by default. To learn whether your OS |
| 58 | +enables a specific feature, consult the OS documentation. |
| 59 | + |
| 60 | +You use the `securityContext` field in your Pod specification to define the |
| 61 | +constraints that apply to those processes. The `securityContext` field also |
| 62 | +supports other security settings, such as specific Linux capabilities or file |
| 63 | +access permissions using UIDs and GIDs. To learn more, refer to |
| 64 | +[Configure a SecurityContext for a Pod or Container](/docs/tasks/configure-pod-container/security-context/). |
| 65 | + |
| 66 | +### seccomp |
| 67 | + |
| 68 | +Some of your workloads might need privileges to perform specific actions as the |
| 69 | +root user on your node's host machine. Linux uses *capabilities* to divide the |
| 70 | +available privileges into categories, so that processes can get the privileges |
| 71 | +required to perform specific actions without being granted all privileges. Each |
| 72 | +capability has a set of system calls (syscalls) that a process can make. seccomp |
| 73 | +lets you restrict these individual syscalls. <!--Copied from seccomp tutorial--> |
| 74 | +It can be used to sandbox the privileges of a process, restricting the calls it |
| 75 | +is able to make from userspace into the kernel.<!--End copy--> |
| 76 | + |
| 77 | +In Kubernetes, you use a *container runtime* on each node to run your |
| 78 | +containers. Example runtimes include CRI-O, Docker, or containerd. Each runtime |
| 79 | +allows only a subset of Linux capabilities by default. You can further limit the |
| 80 | +allowed syscalls individually by using a seccomp profile. Container runtimes |
| 81 | +usually include a default seccomp profile. <!--Copied from seccomp tutorial--> |
| 82 | +Kubernetes lets you automatically |
| 83 | +apply seccomp profiles loaded onto a node to your Pods and containers.<!--End copy--> |
| 84 | + |
| 85 | +{{<note>}} |
| 86 | +Kubernetes also has the `allowPrivilegeEscalation` setting for Pods and |
| 87 | +containers. When set to `false`, this prevents processes from gaining new |
| 88 | +capabilities and restricts unprivileged users from changing the applied seccomp |
| 89 | +profile to a more permissive profile. |
| 90 | +{{</note>}} |
| 91 | + |
| 92 | +To learn how to implement seccomp in Kubernetes, refer to |
| 93 | +[Restrict a Container's Syscalls with seccomp](/docs/tutorials/security/seccomp/). |
| 94 | + |
| 95 | +To learn more about seccomp, see |
| 96 | +[Seccomp BPF](https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html) |
| 97 | +in the Linux kernel documentation. |
| 98 | + |
| 99 | +#### Considerations for seccomp {#seccomp-considerations} |
| 100 | + |
| 101 | +seccomp is a low-level security configuration that you should only configure |
| 102 | +yourself if you require fine-grained control over Linux syscalls. Using |
| 103 | +seccomp, especially at scale, has the following risks: |
| 104 | + |
| 105 | +* Configurations might break during application updates |
| 106 | +* Attackers can still use allowed syscalls to exploit vulnerabilities |
| 107 | +* Profile management for individual applications becomes challenging at scale |
| 108 | + |
| 109 | +**Recommendation**: Use the default seccomp profile that's bundled with your |
| 110 | +container runtime. If you need a more isolated environment, consider using a |
| 111 | +sandbox, such as gVisor. Sandboxes solve the preceding risks with custom |
| 112 | +seccomp profiles, but require more compute resources on your nodes and might |
| 113 | +have compatibility issues with GPUs and other specialized hardware. |
| 114 | + |
| 115 | +### AppArmor and SELinux: policy-based mandatory access control {#policy-based-mac} |
| 116 | + |
| 117 | +You can use Linux policy-based mandatory access control (MAC) mechanisms, such |
| 118 | +as AppArmor and SELinux, to harden your Kubernetes workloads. |
| 119 | + |
| 120 | +#### AppArmor |
| 121 | + |
| 122 | +<!-- Original text from https://kubernetes.io/docs/tutorials/security/apparmor/ --> |
| 123 | + |
| 124 | +[AppArmor](https://apparmor.net/) is a Linux kernel security module that |
| 125 | +supplements the standard Linux user and group based permissions to confine |
| 126 | +programs to a limited set of resources. AppArmor can be configured for any |
| 127 | +application to reduce its potential attack surface and provide greater in-depth |
| 128 | +defense. It is configured through profiles tuned to allow the access needed by a |
| 129 | +specific program or container, such as Linux capabilities, network access, and |
| 130 | +file permissions. Each profile can be run in either enforcing mode, which blocks |
| 131 | +access to disallowed resources, or complain mode, which only reports violations. |
| 132 | + |
| 133 | +AppArmor can help you to run a more secure deployment by restricting what |
| 134 | +containers are allowed to do, and/or provide better auditing through system |
| 135 | +logs. The container runtime that you use might ship with a default AppArmor |
| 136 | +profile, or you can use a custom profile. |
| 137 | + |
| 138 | +To learn how to use AppArmor in Kubernetes, refer to |
| 139 | +[Restrict a Container's Access to Resources with AppArmor](/docs/tutorials/security/apparmor/). |
| 140 | + |
| 141 | +#### SELinux |
| 142 | + |
| 143 | +SELinux is a Linux kernel security module that lets you restrict the access |
| 144 | +that a specific *subject*, such as a process, has to the files on your system. |
| 145 | +You define security policies that apply to subjects that have specific SELinux |
| 146 | +labels. When a process that has an SELinux label attempts to access a file, the |
| 147 | +SELinux server checks whether that process' security policy allows the access |
| 148 | +and makes an authorization decision. |
| 149 | + |
| 150 | +In Kubernetes, you can set an SELinux label in the `securityContext` field of |
| 151 | +your manifest. The specified labels are assigned to those processes. If you |
| 152 | +have configured security policies that affect those labels, the host OS kernel |
| 153 | +enforces these policies. |
| 154 | + |
| 155 | +To learn how to use SELinux in Kubernetes, refer to |
| 156 | +[Assign SELinux labels to a container](/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container). |
| 157 | + |
| 158 | +#### Differences between AppArmor and SELinux {#apparmor-selinux-diff} |
| 159 | + |
| 160 | +The operating system on your Linux nodes usually includes one of either |
| 161 | +AppArmor or SELinux. Both mechanisms provide similar types of protection, but |
| 162 | +have differences such as the following: |
| 163 | + |
| 164 | +* **Configuration**: AppArmor uses profiles to define access to resources. |
| 165 | + SELinux uses policies that apply to specific labels. |
| 166 | +* **Policy application**: In AppArmor, you define resources using file paths. |
| 167 | + SELinux uses the index node (inode) of a resource to identify the resource. |
| 168 | + |
| 169 | +### Summary of features {#summary} |
| 170 | + |
| 171 | +The following table describes the use cases and scope of each security control. |
| 172 | +You can use all of these controls together to build a more hardened system. |
| 173 | + |
| 174 | +<table> |
| 175 | + <caption>Summary of Linux kernel security features</caption> |
| 176 | + <thead> |
| 177 | + <tr> |
| 178 | + <th>Security feature</th> |
| 179 | + <th>Description</th> |
| 180 | + <th>How to use</th> |
| 181 | + <th>Example</th> |
| 182 | + </tr> |
| 183 | + </thead> |
| 184 | + <tbody> |
| 185 | + <tr> |
| 186 | + <td>seccomp</td> |
| 187 | + <td>Restrict individual kernel calls in the userspace. Reduces the |
| 188 | + likelihood that a vulnerability that uses a restricted syscall would |
| 189 | + compromise the system.</td> |
| 190 | + <td>Specify a loaded seccomp profile in the Pod or container specification |
| 191 | + to apply its constraints to the processes in the Pod.</td> |
| 192 | + <td>Reject the <code>unshare</code> syscall, which was used in |
| 193 | + <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0185">CVE-2022-0185</a>.</td> |
| 194 | + </tr> |
| 195 | + <tr> |
| 196 | + <td>AppArmor</td> |
| 197 | + <td>Restrict program access to specific resources. Reduces the attack |
| 198 | + surface of the program. Improves audit logging.</td> |
| 199 | + <td>Specify a loaded AppArmor profile in the container specification.</td> |
| 200 | + <td>Restrict a read-only program from writing to any file path |
| 201 | + in the system.</td> |
| 202 | + </tr> |
| 203 | + <tr> |
| 204 | + <td>SELinux</td> |
| 205 | + <td>Restrict access to resources such as files, applications, ports, and |
| 206 | + processes using labels and security policies.</td> |
| 207 | + <td>Specify access restrictions for specific labels. Tag processes with |
| 208 | + those labels to enforce the access restrictions related to the label.</td> |
| 209 | + <td>Restrict a container from accessing files outside its own filesystem.</td> |
| 210 | + </tr> |
| 211 | + </tbody> |
| 212 | +</table> |
| 213 | + |
| 214 | +{{< note >}} |
| 215 | +Mechanisms like AppArmor and SELinux can provide protection that extends beyond |
| 216 | +the container. For example, you can use SELinux to help mitigate |
| 217 | +[CVE-2019-5736](https://access.redhat.com/security/cve/cve-2019-5736). |
| 218 | +{{< /note >}} |
| 219 | + |
| 220 | +### Considerations for managing custom configurations {#considerations-custom-configurations} |
| 221 | + |
| 222 | +seccomp, AppArmor, and SELinux usually have a default configuration that offers |
| 223 | +basic protections. You can also create custom profiles and policies that meet |
| 224 | +the requirements of your workloads. Managing and distributing these custom |
| 225 | +configurations at scale might be challenging, especially if you use all three |
| 226 | +features together. To help you to manage these configurations at scale, use a |
| 227 | +tool like the |
| 228 | +[Kubernetes Security Profiles Operator](https://github.com/kubernetes-sigs/security-profiles-operator). |
| 229 | + |
| 230 | +## Kernel-level security features and privileged containers {#kernel-security-features-privileged-containers} |
| 231 | + |
| 232 | +Kubernetes lets you specify that some trusted containers can run in |
| 233 | +*privileged* mode. Any container in a Pod can run in privileged mode to use |
| 234 | +operating system administrative capabilities that would otherwise be |
| 235 | +inaccessible. This is available for both Windows and Linux. |
| 236 | + |
| 237 | +Privileged containers explicitly override some of the Linux kernel constraints |
| 238 | +that you might use in your workloads, as follows: |
| 239 | + |
| 240 | +* **seccomp**: Privileged containers run as the `Unconfined` seccomp profile, |
| 241 | + overriding any seccomp profile that you specified in your manifest. |
| 242 | +* **AppArmor**: Privileged containers ignore any applied AppArmor profiles. |
| 243 | +* **SELinux**: Privileged containers run as the `unconfined_t` domain. |
| 244 | + |
| 245 | +### Privileged containers {#privileged-containers} |
| 246 | + |
| 247 | +<!-- Content from https://kubernetes.io/docs/concepts/workloads/pods/#privileged-mode-for-containers --> |
| 248 | + |
| 249 | +Any container in a Pod can enable *Privileged mode* if you set the |
| 250 | +`privileged: true` field in the |
| 251 | +[`securityContext`](/docs/tasks/configure-pod-container/security-context/) |
| 252 | +field for the container. Privileged containers override or undo many other hardening settings such as the applied seccomp profile, AppArmor profile, or |
| 253 | +SELinux constraints. Privileged containers are given all Linux capabilities, |
| 254 | +including capabilities that they don't require. For example, a root user in a |
| 255 | +privileged container might be able to use the `CAP_SYS_ADMIN` and |
| 256 | +`CAP_NET_ADMIN` capabilities on the node, bypassing the runtime seccomp |
| 257 | +configuration and other restrictions. |
| 258 | + |
| 259 | +In most cases, you should avoid using privileged containers, and instead grant |
| 260 | +the specific capabilities required by your container using the `capabilities` |
| 261 | +field in the `securityContext` field. Only use privileged mode if you have a |
| 262 | +capability that you can't grant with the securityContext. This is useful for |
| 263 | +containers that want to use operating system administrative capabilities such |
| 264 | +as manipulating the network stack or accessing hardware devices. |
| 265 | + |
| 266 | +In Kubernetes version 1.26 and later, you can also run Windows containers in a |
| 267 | +similarly privileged mode by setting the `windowsOptions.hostProcess` flag on |
| 268 | +the security context of the Pod spec. For details and instructions, see |
| 269 | +[Create a Windows HostProcess Pod](/docs/tasks/configure-pod-container/create-hostprocess-pod/). |
| 270 | + |
| 271 | +## Recommendations and best practices {#recommendations-best-practices} |
| 272 | + |
| 273 | +* Before configuring kernel-level security capabilities, you should consider |
| 274 | + implementing network-level isolation. For more information, read the |
| 275 | + [Security Checklist](/docs/concepts/security/security-checklist/#network-security). |
| 276 | +* Unless necessary, run Linux workloads as non-root by setting specific user and |
| 277 | + group IDs in your Pod manifest and by specifying `runAsNonRoot: true`. |
| 278 | + |
| 279 | +Additionally, you can run workloads in user namespaces by setting |
| 280 | +`hostUsers: false` in your Pod manifest. This lets you run containers as root |
| 281 | +users in the user namespace, but as non-root users in the host namespace on the |
| 282 | +node. This is still in early stages of development and might not have the level |
| 283 | +of support that you need. For instructions, refer to |
| 284 | +[Use a User Namespace With a Pod](/docs/tasks/configure-pod-container/user-namespaces/). |
| 285 | + |
| 286 | +## {{% heading "whatsnext" %}} |
| 287 | + |
| 288 | +* [Learn how to use AppArmor](/docs/tutorials/security/apparmor/) |
| 289 | +* [Learn how to use seccomp](/docs/tutorials/security/seccomp/) |
| 290 | +* [Learn how to use SELinux](/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container) |
0 commit comments