|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Enable seccomp for all workloads with a new v1.22 alpha feature" |
| 4 | +date: 2021-08-25 |
| 5 | +slug: seccomp-default |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Sascha Grunert, Red Hat |
| 9 | + |
| 10 | +This blog post is about a new Kubernetes feature introduced in v1.22, which adds |
| 11 | +an additional security layer on top of the existing seccomp support. Seccomp is |
| 12 | +a security mechanism for Linux processes to filter system calls (syscalls) based |
| 13 | +on a set of defined rules. Applying seccomp profiles to containerized workloads |
| 14 | +is one of the key tasks when it comes to enhancing the security of the |
| 15 | +application deployment. Developers, site reliability engineers and |
| 16 | +infrastructure administrators have to work hand in hand to create, distribute |
| 17 | +and maintain the profiles over the applications life-cycle. |
| 18 | + |
| 19 | +You can use the [`securityContext`][seccontext] field of Pods and their |
| 20 | +containers can be used to adjust security related configurations of the |
| 21 | +workload. Kubernetes introduced dedicated [seccomp related API |
| 22 | +fields][seccontext] in this `SecurityContext` with the [graduation of seccomp to |
| 23 | +General Availability (GA)][ga] in v1.19.0. This enhancement allowed an easier |
| 24 | +way to specify if the whole pod or a specific container should run as: |
| 25 | + |
| 26 | +[seccontext]: /docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1 |
| 27 | +[ga]: https://kubernetes.io/blog/2020/08/26/kubernetes-release-1.19-accentuate-the-paw-sitive/#graduated-to-stable |
| 28 | + |
| 29 | +- `Unconfined`: seccomp will not be enabled |
| 30 | +- `RuntimeDefault`: the container runtimes default profile will be used |
| 31 | +- `Localhost`: a node local profile will be applied, which is being referenced |
| 32 | + by a relative path to the seccomp profile root (`<kubelet-root-dir>/seccomp`) |
| 33 | + of the kubelet |
| 34 | + |
| 35 | +With the graduation of seccomp, nothing has changed from an overall security |
| 36 | +perspective, because `Unconfined` is still the default. This is totally fine if |
| 37 | +you consider this from the upgrade path and backwards compatibility perspective of |
| 38 | +Kubernetes releases. But it also means that it is more likely that a workload |
| 39 | +runs without seccomp at all, which should be fixed in the long term. |
| 40 | + |
| 41 | +## `SeccompDefault` to the rescue |
| 42 | + |
| 43 | +Kubernetes v1.22.0 introduces a new kubelet [feature gate][gate] |
| 44 | +`SeccompDefault`, which has been added in `alpha` state as every other new |
| 45 | +feature. This means that it is disabled by default and can be enabled manually |
| 46 | +for every single Kubernetes node. |
| 47 | + |
| 48 | +[gate]: /docs/reference/command-line-tools-reference/feature-gates |
| 49 | + |
| 50 | +What does the feature do? Well, it just changes the default seccomp profile from |
| 51 | +`Unconfined` to `RuntimeDefault`. If not specified differently in the pod |
| 52 | +manifest, then the feature will add a higher set of security constraints by |
| 53 | +using the default profile of the container runtime. These profiles may differ |
| 54 | +between runtimes like [CRI-O][crio] or [containerd][ctrd]. They also differ for |
| 55 | +its used hardware architectures. But generally speaking, those default profiles |
| 56 | +allow a common amount of syscalls while blocking the more dangerous ones, which |
| 57 | +are unlikely or unsafe to be used in a containerized application. |
| 58 | + |
| 59 | +[crio]: https://github.com/cri-o/cri-o/blob/fe30d62/vendor/github.com/containers/common/pkg/seccomp/default_linux.go#L45 |
| 60 | +[ctrd]: https://github.com/containerd/containerd/blob/e1445df/contrib/seccomp/seccomp_default.go#L51 |
| 61 | + |
| 62 | +### Enabling the feature |
| 63 | + |
| 64 | +Two kubelet configuration changes have to be made to enable the feature: |
| 65 | + |
| 66 | +1. **Enable the feature** gate by setting the `SeccompDefault=true` via the command |
| 67 | + line (`--feature-gates`) or the [kubelet configuration][kubelet] file. |
| 68 | +2. **Turn on the feature** by enabling the feature by adding the |
| 69 | + `--seccomp-default` command line flag or via the [kubelet |
| 70 | + configuration][kubelet] file (`seccompDefault: true`). |
| 71 | + |
| 72 | +[kubelet]: /docs/tasks/administer-cluster/kubelet-config-file |
| 73 | + |
| 74 | +The kubelet will error on startup if only one of the above steps have been done. |
| 75 | + |
| 76 | +### Trying it out |
| 77 | + |
| 78 | +If the feature is enabled on a node, then you can create a new workload like |
| 79 | +this: |
| 80 | + |
| 81 | +```yaml |
| 82 | +apiVersion: v1 |
| 83 | +kind: Pod |
| 84 | +metadata: |
| 85 | + name: test-pod |
| 86 | +spec: |
| 87 | + containers: |
| 88 | + - name: test-container |
| 89 | + image: nginx:1.21 |
| 90 | +``` |
| 91 | +
|
| 92 | +Now it is possible to inspect the used seccomp profile by using |
| 93 | +[`crictl`][crictl] while investigating the containers [runtime |
| 94 | +specification][rspec]: |
| 95 | + |
| 96 | +[crictl]: https://github.com/kubernetes-sigs/cri-tools |
| 97 | +[rspec]: https://github.com/opencontainers/runtime-spec/blob/0c021c1/config-linux.md#seccomp |
| 98 | + |
| 99 | +```bash |
| 100 | +CONTAINER_ID=$(sudo crictl ps -q --name=test-container) |
| 101 | +sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp |
| 102 | +``` |
| 103 | + |
| 104 | +```yaml |
| 105 | +{ |
| 106 | + "defaultAction": "SCMP_ACT_ERRNO", |
| 107 | + "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"], |
| 108 | + "syscalls": [ |
| 109 | + { |
| 110 | + "names": ["_llseek", "_newselect", "accept", …, "write", "writev"], |
| 111 | + "action": "SCMP_ACT_ALLOW" |
| 112 | + }, |
| 113 | + … |
| 114 | + ] |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +You can see that the lower level container runtime ([CRI-O][crio-home] and |
| 119 | +[runc][runc] in our case), successfully applied the default seccomp profile. |
| 120 | +This profile denies all syscalls per default, while allowing commonly used ones |
| 121 | +like [`accept`][accept] or [`write`][write]. |
| 122 | + |
| 123 | +[crio-home]: https://github.com/cri-o/cri-o |
| 124 | +[runc]: https://github.com/opencontainers/runc |
| 125 | +[accept]: https://man7.org/linux/man-pages/man2/accept.2.html |
| 126 | +[write]: https://man7.org/linux/man-pages/man2/write.2.html |
| 127 | + |
| 128 | +Please note that the feature will not influence any Kubernetes API for now. |
| 129 | +Therefore, it is not possible to retrieve the used seccomp profile via `kubectl` |
| 130 | +`get` or `describe` if the [`SeccompProfile`][api] field is unset within the |
| 131 | +`SecurityContext`. |
| 132 | + |
| 133 | +[api]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1 |
| 134 | + |
| 135 | +The feature also works when using multiple containers within a pod, for example |
| 136 | +if you create a pod like this: |
| 137 | + |
| 138 | +```yaml |
| 139 | +apiVersion: v1 |
| 140 | +kind: Pod |
| 141 | +metadata: |
| 142 | + name: test-pod |
| 143 | +spec: |
| 144 | + containers: |
| 145 | + - name: test-container-nginx |
| 146 | + image: nginx:1.21 |
| 147 | + securityContext: |
| 148 | + seccompProfile: |
| 149 | + type: Unconfined |
| 150 | + - name: test-container-redis |
| 151 | + image: redis:6.2 |
| 152 | +``` |
| 153 | + |
| 154 | +then you should see that the `test-container-nginx` runs without a seccomp profile: |
| 155 | + |
| 156 | +```bash |
| 157 | +sudo crictl inspect $(sudo crictl ps -q --name=test-container-nginx) | |
| 158 | + jq '.info.runtimeSpec.linux.seccomp == null' |
| 159 | +true |
| 160 | +``` |
| 161 | + |
| 162 | +Whereas the container `test-container-redis` runs with `RuntimeDefault`: |
| 163 | + |
| 164 | +```bash |
| 165 | +sudo crictl inspect $(sudo crictl ps -q --name=test-container-redis) | |
| 166 | + jq '.info.runtimeSpec.linux.seccomp != null' |
| 167 | +true |
| 168 | +``` |
| 169 | + |
| 170 | +The same applies to the pod itself, which also runs with the default profile: |
| 171 | + |
| 172 | +```bash |
| 173 | +sudo crictl inspectp (sudo crictl pods -q --name test-pod) | |
| 174 | + jq '.info.runtimeSpec.linux.seccomp != null' |
| 175 | +true |
| 176 | +``` |
| 177 | + |
| 178 | +### Upgrade strategy |
| 179 | + |
| 180 | +It is recommended to enable the feature in multiple steps, whereas different |
| 181 | +risks and mitigations exist for each one. |
| 182 | + |
| 183 | +#### Feature gate enabling |
| 184 | + |
| 185 | +Enabling the feature gate at the kubelet level will not turn on the feature, but |
| 186 | +will make it possible by using the `SeccompDefault` kubelet configuration or the |
| 187 | +`--seccomp-default` CLI flag. This can be done by an administrator for the whole |
| 188 | +cluster or only a set of nodes. |
| 189 | + |
| 190 | +#### Testing the Application |
| 191 | + |
| 192 | +If you're trying this within a dedicated test environment, you have to ensure |
| 193 | +that the application code does not trigger syscalls blocked by the |
| 194 | +`RuntimeDefault` profile before enabling the feature on a node. This can be done |
| 195 | +by: |
| 196 | + |
| 197 | +- _Recommended_: Analyzing the code (manually or by running the application with |
| 198 | + [strace][strace]) for any executed syscalls which may be blocked by the |
| 199 | + default profiles. If that's the case, then you can override the default by |
| 200 | + explicitly setting the pod or container to run as `Unconfined`. Alternatively, |
| 201 | + you can create a custom seccomp profile (see optional step below). |
| 202 | + profile based on the default by adding the additional syscalls to the |
| 203 | + `"action": "SCMP_ACT_ALLOW"` section. |
| 204 | +
|
| 205 | +- _Recommended_: Manually set the profile to the target workload and use a |
| 206 | + rolling upgrade to deploy into production. Rollback the deployment if the |
| 207 | + application does not work as intended. |
| 208 | +
|
| 209 | +- _Optional_: Run the application against an end-to-end test suite to trigger |
| 210 | + all relevant code paths with `RuntimeDefault` enabled. If a test fails, use |
| 211 | + the same mitigation as mentioned above. |
| 212 | + |
| 213 | +- _Optional_: Create a custom seccomp profile based on the default and change |
| 214 | + its default action from `SCMP_ACT_ERRNO` to `SCMP_ACT_LOG`. This means that |
| 215 | + the seccomp filter for unknown syscalls will have no effect on the application |
| 216 | + at all, but the system logs will now indicate which syscalls may be blocked. |
| 217 | + This requires at least a Kernel version 4.14 as well as a recent [runc][runc] |
| 218 | + release. Monitor the application hosts audit logs (defaults to |
| 219 | + `/var/log/audit/audit.log`) or syslog entries (defaults to `/var/log/syslog`) |
| 220 | + for syscalls via `type=SECCOMP` (for audit) or `type=1326` (for syslog). |
| 221 | + Compare the syscall ID with those [listed in the Linux Kernel |
| 222 | + sources][syscalls] and add them to the custom profile. Be aware that custom |
| 223 | + audit policies may lead into missing syscalls, depending on the configuration |
| 224 | + of auditd. |
| 225 | + |
| 226 | +- _Optional_: Use cluster additions like the [Security Profiles Operator][spo] |
| 227 | + for profiling the application via its [log enrichment][logs] capabilities or |
| 228 | + recording a profile by using its [recording feature][rec]. This makes the |
| 229 | + above mentioned manual log investigation obsolete. |
| 230 | + |
| 231 | +[syscalls]: https://github.com/torvalds/linux/blob/7bb7f2a/arch/x86/entry/syscalls/syscall_64.tbl |
| 232 | +[spo]: https://github.com/kubernetes-sigs/security-profiles-operator |
| 233 | +[logs]: https://github.com/kubernetes-sigs/security-profiles-operator/blob/c90ef3a/installation-usage.md#record-profiles-from-workloads-with-profilerecordings |
| 234 | +[rec]: https://github.com/kubernetes-sigs/security-profiles-operator/blob/c90ef3a/installation-usage.md#using-the-log-enricher |
| 235 | +[strace]: https://man7.org/linux/man-pages/man1/strace.1.html |
| 236 | + |
| 237 | +#### Deploying the modified application |
| 238 | + |
| 239 | +Based on the outcome of the application tests, it may be required to change the |
| 240 | +application deployment by either specifying `Unconfined` or a custom seccomp |
| 241 | +profile. This is not the case if the application works as intended with |
| 242 | +`RuntimeDefault`. |
| 243 | + |
| 244 | +#### Enable the kubelet configuration |
| 245 | + |
| 246 | +If everything went well, then the feature is ready to be enabled by the kubelet |
| 247 | +configuration or its corresponding CLI flag. This should be done on a per-node |
| 248 | +basis to reduce the overall risk of missing a syscall during the investigations |
| 249 | +when running the application tests. If it's possible to monitor audit logs |
| 250 | +within the cluster, then it's recommended to do this for eventually missed |
| 251 | +seccomp events. If the application works as intended then the feature can be |
| 252 | +enabled for further nodes within the cluster. |
| 253 | + |
| 254 | +## Conclusion |
| 255 | + |
| 256 | +Thank you for reading this blog post! I hope you enjoyed to see how the usage of |
| 257 | +seccomp profiles has been evolved in Kubernetes over the past releases as much |
| 258 | +as I do. On your own cluster, change the default seccomp profile to |
| 259 | +`RuntimeDefault` (using this new feature) and see the security benefits, and, of |
| 260 | +course, feel free to reach out any time for feedback or questions. |
| 261 | + |
| 262 | +--- |
| 263 | + |
| 264 | +_Editor's note: If you have any questions or feedback about this blog post, feel |
| 265 | +free to reach out via the [Kubernetes slack in #sig-node][slack]._ |
| 266 | + |
| 267 | +[slack]: https://kubernetes.slack.com/messages/sig-node |
0 commit comments