Merge pull request #43214 from shannonxtreme/apparmor-seccomp

k8s-ci-robot · web-flow · commit ea4444a8498b · 2024-04-30T04:56:15.000-07:00
Add new page for kernel-level constraints
diff --git a/content/en/docs/concepts/security/linux-kernel-security-constraints.md b/content/en/docs/concepts/security/linux-kernel-security-constraints.md
@@ -0,0 +1,290 @@
+---
+title: Linux kernel security constraints for Pods and containers
+description: >
+  Overview of Linux kernel security modules and constraints that you can use to
+  harden your Pods and containers.
+content_type: concept
+weight: 100
+---
+
+<!-- overview -->
+
+This page describes some of the security features that are built into the Linux
+kernel that you can use in your Kubernetes workloads. To learn how to apply
+these features to your Pods and containers, refer to
+[Configure a SecurityContext for a Pod or Container](/docs/tasks/configure-pod-container/security-context/).
+You should already be familiar with Linux and with the basics of Kubernetes
+workloads.
+
+<!-- body -->
+
+## Run workloads without root privileges {#run-without-root}
+
+When you deploy a workload in Kubernetes, use the Pod specification to restrict
+that workload from running as the root user on the node. You can use the Pod
+`securityContext` to define the specific Linux user and group for the processes in
+the Pod, and explicitly restrict containers from running as root users. Setting
+these values in the Pod manifest takes precedence over similar values in the
+container image, which is especially useful if you're running images that you
+don't own.
+
+{{< caution >}}
+Ensure that the user or group that you assign to the workload has the permissions
+required for the application to function correctly. Changing the user or group
+to one that doesn't have the correct permissions could lead to file access
+issues or failed operations.
+{{< /caution >}}
+
+Configuring the kernel security features on this page provides fine-grained
+control over the actions that processes in your cluster can take, but managing
+these configurations can be challenging at scale. Running containers as
+non-root, or in user namespaces if you need root privileges, helps to reduce the
+chance that you'll need to enforce your configured kernel security capabilities.
+
+## Security features in the Linux kernel {#linux-security-features}
+
+Kubernetes lets you configure and use Linux kernel features to improve isolation
+and harden your containerized workloads. Common features include the following:
+
+* **Secure computing mode (seccomp)**: Filter which system calls a process can
+  make
+* **AppArmor**: Restrict the access privileges of individual programs
+* **Security Enhanced Linux (SELinux)**: Assign security labels to objects for
+  more manageable security policy enforcement
+
+To configure settings for one of these features, the operating system that you
+choose for your nodes must enable the feature in the kernel. For example,
+Ubuntu 7.10 and later enable AppArmor by default. To learn whether your OS
+enables a specific feature, consult the OS documentation.
+
+You use the `securityContext` field in your Pod specification to define the
+constraints that apply to those processes. The `securityContext` field also
+supports other security settings, such as specific Linux capabilities or file
+access permissions using UIDs and GIDs. To learn more, refer to
+[Configure a SecurityContext for a Pod or Container](/docs/tasks/configure-pod-container/security-context/).
+
+### seccomp
+
+Some of your workloads might need privileges to perform specific actions as the
+root user on your node's host machine. Linux uses *capabilities* to divide the
+available privileges into categories, so that processes can get the privileges
+required to perform specific actions without being granted all privileges. Each
+capability has a set of system calls (syscalls) that a process can make. seccomp
+lets you restrict these individual syscalls. <!--Copied from seccomp tutorial-->
+It can be used to sandbox the privileges of a process, restricting the calls it
+is able to make from userspace into the kernel.<!--End copy-->
+
+In Kubernetes, you use a *container runtime* on each node to run your
+containers. Example runtimes include CRI-O, Docker, or containerd. Each runtime
+allows only a subset of Linux capabilities by default. You can further limit the
+allowed syscalls individually by using a seccomp profile. Container runtimes
+usually include a default seccomp profile. <!--Copied from seccomp tutorial-->
+Kubernetes lets you automatically
+apply seccomp profiles loaded onto a node to your Pods and containers.<!--End copy-->
+
+{{<note>}}
+Kubernetes also has the `allowPrivilegeEscalation` setting for Pods and
+containers. When set to `false`, this prevents processes from gaining new
+capabilities and restricts unprivileged users from changing the applied seccomp
+profile to a more permissive profile.
+{{</note>}}
+
+To learn how to implement seccomp in Kubernetes, refer to
+[Restrict a Container's Syscalls with seccomp](/docs/tutorials/security/seccomp/).
+
+To learn more about seccomp, see
+[Seccomp BPF](https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html)
+in the Linux kernel documentation.
+
+#### Considerations for seccomp {#seccomp-considerations}
+
+seccomp is a low-level security configuration that you should only configure
+yourself if you require fine-grained control over Linux syscalls. Using
+seccomp, especially at scale, has the following risks:
+
+* Configurations might break during application updates
+* Attackers can still use allowed syscalls to exploit vulnerabilities
+* Profile management for individual applications becomes challenging at scale
+
+**Recommendation**: Use the default seccomp profile that's bundled with your
+container runtime. If you need a more isolated environment, consider using a
+sandbox, such as gVisor. Sandboxes solve the preceding risks with custom
+seccomp profiles, but require more compute resources on your nodes and might
+have compatibility issues with GPUs and other specialized hardware.
+
+### AppArmor and SELinux: policy-based mandatory access control {#policy-based-mac}
+
+You can use Linux policy-based mandatory access control (MAC) mechanisms, such
+as AppArmor and SELinux, to harden your Kubernetes workloads.
+
+#### AppArmor
+
+<!-- Original text from https://kubernetes.io/docs/tutorials/security/apparmor/ -->
+
+[AppArmor](https://apparmor.net/) is a Linux kernel security module that
+supplements the standard Linux user and group based permissions to confine
+programs to a limited set of resources. AppArmor can be configured for any
+application to reduce its potential attack surface and provide greater in-depth
+defense. It is configured through profiles tuned to allow the access needed by a
+specific program or container, such as Linux capabilities, network access, and
+file permissions. Each profile can be run in either enforcing mode, which blocks
+access to disallowed resources, or complain mode, which only reports violations.
+
+AppArmor can help you to run a more secure deployment by restricting what
+containers are allowed to do, and/or provide better auditing through system
+logs. The container runtime that you use might ship with a default AppArmor
+profile, or you can use a custom profile.
+
+To learn how to use AppArmor in Kubernetes, refer to
+[Restrict a Container's Access to Resources with AppArmor](/docs/tutorials/security/apparmor/).
+
+#### SELinux
+
+SELinux is a Linux kernel security module that lets you restrict the access
+that a specific *subject*, such as a process, has to the files on your system.
+You define security policies that apply to subjects that have specific SELinux
+labels. When a process that has an SELinux label attempts to access a file, the
+SELinux server checks whether that process' security policy allows the access
+and makes an authorization decision.
+
+In Kubernetes, you can set an SELinux label in the `securityContext` field of
+your manifest. The specified labels are assigned to those processes. If you
+have configured security policies that affect those labels, the host OS kernel
+enforces these policies.
+
+To learn how to use SELinux in Kubernetes, refer to
+[Assign SELinux labels to a container](/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container).
+
+#### Differences between AppArmor and SELinux {#apparmor-selinux-diff}
+
+The operating system on your Linux nodes usually includes one of either
+AppArmor or SELinux. Both mechanisms provide similar types of protection, but
+have differences such as the following:
+
+* **Configuration**: AppArmor uses profiles to define access to resources.
+  SELinux uses policies that apply to specific labels.
+* **Policy application**: In AppArmor, you define resources using file paths.
+  SELinux uses the index node (inode) of a resource to identify the resource.
+
+### Summary of features {#summary}
+
+The following table describes the use cases and scope of each security control.
+You can use all of these controls together to build a more hardened system.
+
+<table>
+  <caption>Summary of Linux kernel security features</caption>
+  <thead>
+    <tr>
+      <th>Security feature</th>
+      <th>Description</th>
+      <th>How to use</th>
+      <th>Example</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>seccomp</td>
+      <td>Restrict individual kernel calls in the userspace. Reduces the
+      likelihood that a vulnerability that uses a restricted syscall would
+      compromise the system.</td>
+      <td>Specify a loaded seccomp profile in the Pod or container specification
+      to apply its constraints to the processes in the Pod.</td>
+      <td>Reject the <code>unshare</code> syscall, which was used in
+      <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0185">CVE-2022-0185</a>.</td>
+    </tr>
+    <tr>
+      <td>AppArmor</td>
+      <td>Restrict program access to specific resources. Reduces the attack
+      surface of the program. Improves audit logging.</td>
+      <td>Specify a loaded AppArmor profile in the container specification.</td>
+      <td>Restrict a read-only program from writing to any file path
+      in the system.</td>
+    </tr>
+    <tr>
+      <td>SELinux</td>
+      <td>Restrict access to resources such as files, applications, ports, and
+      processes using labels and security policies.</td>
+      <td>Specify access restrictions for specific labels. Tag processes with
+      those labels to enforce the access restrictions related to the label.</td>
+      <td>Restrict a container from accessing files outside its own filesystem.</td>
+    </tr>
+  </tbody>
+</table>
+
+{{< note >}}
+Mechanisms like AppArmor and SELinux can provide protection that extends beyond
+the container. For example, you can use SELinux to help mitigate
+[CVE-2019-5736](https://access.redhat.com/security/cve/cve-2019-5736).
+{{< /note >}}
+
+### Considerations for managing custom configurations {#considerations-custom-configurations}
+
+seccomp, AppArmor, and SELinux usually have a default configuration that offers
+basic protections.  You can also create custom profiles and policies that meet
+the requirements of your workloads. Managing and distributing these custom
+configurations at scale might be challenging, especially if you use all three
+features together. To help you to manage these configurations at scale, use a
+tool like the
+[Kubernetes Security Profiles Operator](https://github.com/kubernetes-sigs/security-profiles-operator).
+
+## Kernel-level security features and privileged containers {#kernel-security-features-privileged-containers}
+
+Kubernetes lets you specify that some trusted containers can run in
+*privileged* mode. Any container in a Pod can run in privileged mode to use
+operating system administrative capabilities that would otherwise be
+inaccessible. This is available for both Windows and Linux.
+
+Privileged containers explicitly override some of the Linux kernel constraints
+that you might use in your workloads, as follows:
+
+* **seccomp**: Privileged containers run as the `Unconfined` seccomp profile,
+  overriding any seccomp profile that you specified in your manifest.
+* **AppArmor**: Privileged containers ignore any applied AppArmor profiles.
+* **SELinux**: Privileged containers run as the `unconfined_t` domain.
+
+### Privileged containers {#privileged-containers}
+
+<!-- Content from https://kubernetes.io/docs/concepts/workloads/pods/#privileged-mode-for-containers  -->
+
+Any container in a Pod can enable *Privileged mode* if you set the
+`privileged: true` field in the
+[`securityContext`](/docs/tasks/configure-pod-container/security-context/)
+field for the container. Privileged containers override or undo many other hardening settings such as the applied seccomp profile, AppArmor profile, or
+SELinux constraints. Privileged containers are given all Linux capabilities,
+including capabilities that they don't require. For example, a root user in a
+privileged container might be able to use the `CAP_SYS_ADMIN` and
+`CAP_NET_ADMIN` capabilities on the node, bypassing the runtime seccomp
+configuration and other restrictions.
+
+In most cases, you should avoid using privileged containers, and instead grant
+the specific capabilities required by your container using the `capabilities`
+field in the `securityContext` field. Only use privileged mode if you have a
+capability that you can't grant with the securityContext. This is useful for
+containers that want to use operating system administrative capabilities such
+as manipulating the network stack or accessing hardware devices.
+
+In Kubernetes version 1.26 and later, you can also run Windows containers in a
+similarly privileged mode by setting the `windowsOptions.hostProcess` flag on
+the security context of the Pod spec. For details and instructions, see
+[Create a Windows HostProcess Pod](/docs/tasks/configure-pod-container/create-hostprocess-pod/).
+
+## Recommendations and best practices {#recommendations-best-practices}
+
+* Before configuring kernel-level security capabilities, you should consider
+  implementing network-level isolation. For more information, read the
+  [Security Checklist](/docs/concepts/security/security-checklist/#network-security).
+* Unless necessary, run Linux workloads as non-root by setting specific user and
+  group IDs in your Pod manifest and by specifying `runAsNonRoot: true`.
+
+Additionally, you can run workloads in user namespaces by setting
+`hostUsers: false` in your Pod manifest. This lets you run containers as root
+users in the user namespace, but as non-root users in the host namespace on the
+node. This is still in early stages of development and might not have the level
+of support that you need. For instructions, refer to
+[Use a User Namespace With a Pod](/docs/tasks/configure-pod-container/user-namespaces/).
+
+## {{% heading "whatsnext" %}}
+
+* [Learn how to use AppArmor](/docs/tutorials/security/apparmor/)
+* [Learn how to use seccomp](/docs/tutorials/security/seccomp/)
+* [Learn how to use SELinux](/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container)
diff --git a/content/en/docs/concepts/workloads/pods/_index.md b/content/en/docs/concepts/workloads/pods/_index.md
@@ -276,30 +276,34 @@ Containers within the Pod see the system hostname as being the same as the confi
 `name` for the Pod. There's more about this in the [networking](/docs/concepts/cluster-administration/networking/)
 section.
 
-## Privileged mode for containers
-
-{{< note >}}
-Your {{< glossary_tooltip text="container runtime" term_id="container-runtime" >}} must support the concept of a privileged container for this setting to be relevant.
-{{< /note >}}
-
-Any container in a pod can run in privileged mode to use operating system administrative capabilities
-that would otherwise be inaccessible. This is available for both Windows and Linux.
-
-### Linux privileged containers
-
-In Linux, any container in a Pod can enable privileged mode using the `privileged` (Linux) flag
-on the [security context](/docs/tasks/configure-pod-container/security-context/) of the
-container spec. This is useful for containers that want to use operating system administrative
-capabilities such as manipulating the network stack or accessing hardware devices.
-
-### Windows privileged containers
-
-{{< feature-state for_k8s_version="v1.26" state="stable" >}}
-
-In Windows, you can create a [Windows HostProcess pod](/docs/tasks/configure-pod-container/create-hostprocess-pod) by setting the 
-`windowsOptions.hostProcess` flag on the security context of the pod spec. All containers in these
-pods must run as Windows HostProcess containers. HostProcess pods run directly on the host and can also be used
-to perform administrative tasks as is done with Linux privileged containers.
+## Pod security settings {#pod-security}
+
+To set security constraints on Pods and containers, you use the
+`securityContext` field in the Pod specification. This field gives you
+granular control over what a Pod or individual containers can do. For example:
+
+* Drop specific Linux capabilities to avoid the impact of a CVE.
+* Force all processes in the Pod to run as a non-root user or as a specific
+  user or group ID.
+* Set a specific seccomp profile.
+* Set Windows security options, such as whether containers run as HostProcess.
+
+{{< caution >}}
+You can also use the Pod securityContext to enable
+[_privileged mode_](/docs/concepts/security/linux-kernel-security-constraints/#privileged-containers)
+in Linux containers. Privileged mode overrides many of the other security
+settings in the securityContext. Avoid using this setting unless you can't grant
+the equivalent permissions by using other fields in the securityContext.
+In Kubernetes 1.26 and later, you can run Windows containers in a similarly
+privileged mode by setting the `windowsOptions.hostProcess` flag on the
+security context of the Pod spec. For details and instructions, see
+[Create a Windows HostProcess Pod](/docs/tasks/configure-pod-container/create-hostprocess-pod/).
+{{< /caution >}}
+
+* To learn about kernel-level security constraints that you can use,
+  see [Linux kernel security constraints for Pods and containers](/docs/concepts/security/linux-kernel-security-constraints).
+* To learn more about the Pod security context, see
+  [Configure a Security Context for a Pod or Container](/docs/tasks/configure-pod-container/security-context/).
 
 ## Static Pods
 
diff --git a/content/en/docs/tutorials/security/apparmor.md b/content/en/docs/tutorials/security/apparmor.md
@@ -10,22 +10,10 @@ weight: 30
 
 {{< feature-state feature_gate_name="AppArmor" >}}
 
-
-[AppArmor](https://apparmor.net/) is a Linux kernel security module that supplements the standard Linux user and group based
-permissions to confine programs to a limited set of resources. AppArmor can be configured for any
-application to reduce its potential attack surface and provide greater in-depth defense. It is
-configured through profiles tuned to allow the access needed by a specific program or container,
-such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either
-*enforcing* mode, which blocks access to disallowed resources, or *complain* mode, which only reports
-violations.
-
-On Kubernetes, AppArmor can help you to run a more secure deployment by restricting what containers are allowed to
-do, and/or provide better auditing through system logs. However, it is important to keep in mind
-that AppArmor is not a silver bullet and can only do so much to protect against exploits in your
-application code. It is important to provide good, restrictive profiles, and harden your
-applications and cluster from other angles as well.
-
-
+This page shows you how to load AppArmor profiles on your nodes and enforce
+those profiles in Pods. To learn more about how Kubernetes can confine Pods using
+AppArmor, see
+[Linux kernel security constraints for Pods and containers](/docs/concepts/security/linux-kernel-security-constraints/#apparmor).
 
 ## {{% heading "objectives" %}}