|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Finding suspicious syscalls with the seccomp notifier" |
| 4 | +date: 2022-12-02 |
| 5 | +slug: seccomp-notifier |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Sascha Grunert |
| 9 | + |
| 10 | +Debugging software in production is one of the biggest challenges we have to |
| 11 | +face in our containerized environments. Being able to understand the impact of |
| 12 | +the available security options, especially when it comes to configuring our |
| 13 | +deployments, is one of the key aspects to make the default security in |
| 14 | +Kubernetes stronger. We have all those logging, tracing and metrics data already |
| 15 | +at hand, but how do we assemble the information they provide into something |
| 16 | +human readable and actionable? |
| 17 | + |
| 18 | +[Seccomp][seccomp] is one of the standard mechanisms to protect a Linux based |
| 19 | +Kubernetes application from malicious actions by interfering with its [system |
| 20 | +calls][syscalls]. This allows us to restrict the application to a defined set of |
| 21 | +actionable items, like modifying files or responding to HTTP requests. Linking |
| 22 | +the knowledge of which set of syscalls is required to, for example, modify a |
| 23 | +local file, to the actual source code is in the same way non-trivial. Seccomp |
| 24 | +profiles for Kubernetes have to be written in [JSON][json] and can be understood |
| 25 | +as an architecture specific allow-list with superpowers, for example: |
| 26 | + |
| 27 | +[seccomp]: https://en.wikipedia.org/wiki/Seccomp |
| 28 | +[syscalls]: https://en.wikipedia.org/wiki/Syscall |
| 29 | +[json]: https://www.json.org |
| 30 | + |
| 31 | +```json |
| 32 | +{ |
| 33 | + "defaultAction": "SCMP_ACT_ERRNO", |
| 34 | + "defaultErrnoRet": 38, |
| 35 | + "defaultErrno": "ENOSYS", |
| 36 | + "syscalls": [ |
| 37 | + { |
| 38 | + "names": ["chmod", "chown", "open", "write"], |
| 39 | + "action": "SCMP_ACT_ALLOW" |
| 40 | + } |
| 41 | + ] |
| 42 | +} |
| 43 | +``` |
| 44 | + |
| 45 | +The above profile errors by default specifying the `defaultAction` of |
| 46 | +`SCMP_ACT_ERRNO`. This means we have to allow a set of syscalls via |
| 47 | +`SCMP_ACT_ALLOW`, otherwise the application would not be able to do anything at |
| 48 | +all. Okay cool, for being able to allow file operations, all we have to do is |
| 49 | +adding a bunch of file specific syscalls like `open` or `write`, and probably |
| 50 | +also being able to change the permissions via `chmod` and `chown`, right? |
| 51 | +Basically yes, but there are issues with the simplicity of that approach: |
| 52 | + |
| 53 | +Seccomp profiles need to include the minimum set of syscalls required to start |
| 54 | +the application. This also includes some syscalls from the lower level |
| 55 | +[Open Container Initiative (OCI)][oci] container runtime, for example |
| 56 | +[runc][runc] or [crun][crun]. Beside that, we can only guarantee the required |
| 57 | +syscalls for a very specific version of the runtimes and our application, |
| 58 | +because the code parts can change between releases. The same applies to the |
| 59 | +termination of the application as well as the target architecture we're |
| 60 | +deploying on. Features like executing commands within containers also require |
| 61 | +another subset of syscalls. Not to mention that there are multiple versions for |
| 62 | +syscalls doing slightly different things and the seccomp profiles are able to |
| 63 | +modify their arguments. It's also not always clearly visible to the developers |
| 64 | +which syscalls are used by their own written code parts, because they rely on |
| 65 | +programming language abstractions or frameworks. |
| 66 | + |
| 67 | +[oci]: https://opencontainers.org |
| 68 | +[runc]: https://github.com/opencontainers/runc |
| 69 | +[crun]: https://github.com/containers/crun |
| 70 | + |
| 71 | +_How can we know which syscalls are even required then? Who should create and |
| 72 | +maintain those profiles during its development life-cycle?_ |
| 73 | + |
| 74 | +Well, recording and distributing seccomp profiles is one of the problem domains |
| 75 | +of the [Security Profiles Operator][spo], which is already solving that. The |
| 76 | +operator is able to record [seccomp][seccomp], [SELinux][selinux] and even |
| 77 | +[AppArmor][apparmor] profiles into a [Custom Resource Definition (CRD)][crd], |
| 78 | +reconciles them to each node and makes them available for usage. |
| 79 | + |
| 80 | +[spo]: https://github.com/kubernetes-sigs/security-profiles-operator |
| 81 | +[selinux]: https://en.wikipedia.org/wiki/Security-Enhanced_Linux |
| 82 | +[apparmor]: https://en.wikipedia.org/wiki/AppArmor |
| 83 | +[crd]: https://k8s.io/docs/concepts/extend-kubernetes/api-extension/custom-resources |
| 84 | + |
| 85 | +The biggest challenge about creating security profiles is to catch all code |
| 86 | +paths which execute syscalls. We could achieve that by having **100%** logical |
| 87 | +coverage of the application when running an end-to-end test suite. You get the |
| 88 | +problem with the previous statement: It's too idealistic to be ever fulfilled, |
| 89 | +even without taking all the moving parts during application development and |
| 90 | +deployment into account. |
| 91 | + |
| 92 | +Missing a syscall in the seccomp profiles' allow list can have tremendously |
| 93 | +negative impact on the application. It's not only that we can encounter crashes, |
| 94 | +which are trivially detectable. It can also happen that they slightly change |
| 95 | +logical paths, change the business logic, make parts of the application |
| 96 | +unusable, slow down performance or even expose security vulnerabilities. We're |
| 97 | +simply not able to see the whole impact of that, especially because blocked |
| 98 | +syscalls via `SCMP_ACT_ERRNO` do not provide any additional [audit][audit] |
| 99 | +logging on the system. |
| 100 | + |
| 101 | +[audit]: https://linux.die.net/man/8/auditd |
| 102 | + |
| 103 | +Does that mean we're lost? Is it just not realistic to dream about a Kubernetes |
| 104 | +where [everyone uses the default seccomp profile][seccomp-default]? Should we |
| 105 | +stop striving towards maximum security in Kubernetes and accept that it's not |
| 106 | +meant to be secure by default? |
| 107 | + |
| 108 | +[seccomp-default]: https://github.com/kubernetes/enhancements/issues/2413 |
| 109 | + |
| 110 | +**Definitely not.** Technology evolves over time and there are many folks |
| 111 | +working behind the scenes of Kubernetes to indirectly deliver features to |
| 112 | +address such problems. One of the mentioned features is the _seccomp notifier_, |
| 113 | +which can be used to find suspicious syscalls in Kubernetes. |
| 114 | + |
| 115 | +The seccomp notify feature consists of a set of changes introduced in Linux 5.9. |
| 116 | +It makes the kernel capable of communicating seccomp related events to the user |
| 117 | +space. That allows applications to act based on the syscalls and opens for a |
| 118 | +wide range of possible use cases. We not only need the right kernel version, |
| 119 | +but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier |
| 120 | +work at all. The Kubernetes container runtime [CRI-O][cri-o] gets [support for |
| 121 | +the seccomp notifier in v1.26.0][cri-o-notifier]. The new feature allows us to |
| 122 | +identify possibly malicious syscalls in our application, and therefore makes it |
| 123 | +possible to verify profiles for consistency and completeness. Let's give that a |
| 124 | +try. |
| 125 | + |
| 126 | +[cri-o]: https://cri-o.io |
| 127 | +[cri-o-notifier]: https://github.com/cri-o/cri-o/pull/6120 |
| 128 | + |
| 129 | +First of all we need to run the latest `main` version of CRI-O, because v1.26.0 |
| 130 | +has not been released yet at time of writing. You can do that by either |
| 131 | +compiling it from the [source code][sources] or by using the pre-built binary |
| 132 | +bundle via [the get-script][script]. The seccomp notifier feature of CRI-O is |
| 133 | +guarded by an annotation, which has to be explicitly allowed, for example by |
| 134 | +using a configuration drop-in like this: |
| 135 | + |
| 136 | +```console |
| 137 | +> cat /etc/crio/crio.conf.d/02-runtimes.conf |
| 138 | +``` |
| 139 | + |
| 140 | +```toml |
| 141 | +[crio.runtime] |
| 142 | +default_runtime = "runc" |
| 143 | + |
| 144 | +[crio.runtime.runtimes.runc] |
| 145 | +allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ] |
| 146 | +``` |
| 147 | + |
| 148 | +[sources]: https://github.com/cri-o/cri-o/blob/main/install.md#build-and-install-cri-o-from-source |
| 149 | +[script]: https://github.com/cri-o/cri-o#installing-cri-o |
| 150 | + |
| 151 | +If CRI-O is up and running, then it should indicate that the seccomp notifier is |
| 152 | +available as well: |
| 153 | + |
| 154 | +```console |
| 155 | +> sudo ./bin/crio --enable-metrics |
| 156 | +… |
| 157 | +INFO[…] Starting seccomp notifier watcher |
| 158 | +INFO[…] Serving metrics on :9090 via HTTP |
| 159 | +… |
| 160 | +``` |
| 161 | + |
| 162 | +We also enable the metrics, because they provide additional telemetry data about |
| 163 | +the notifier. Now we need a running Kubernetes cluster for demonstration |
| 164 | +purposes. For this demo, we mainly stick to the |
| 165 | +[`hack/local-up-cluster.sh`][local-up] approach to locally spawn a single node |
| 166 | +Kubernetes cluster. |
| 167 | + |
| 168 | +[local-up]: https://github.com/cri-o/cri-o#running-kubernetes-with-cri-o |
| 169 | + |
| 170 | +If everything is up and running, then we would have to define a seccomp profile |
| 171 | +for testing purposes. But we do not have to create our own, we can just use the |
| 172 | +`RuntimeDefault` profile which gets shipped with each container runtime. For |
| 173 | +example the `RuntimeDefault` profile for CRI-O can be found in the |
| 174 | +[containers/common][runtime-default] library. |
| 175 | + |
| 176 | +[runtime-default]: https://github.com/containers/common/blob/afff1d6/pkg/seccomp/seccomp.json |
| 177 | + |
| 178 | +Now we need a test container, which can be a simple [nginx][nginx] pod like |
| 179 | +this: |
| 180 | + |
| 181 | +[nginx]: https://www.nginx.com |
| 182 | + |
| 183 | +```yaml |
| 184 | +apiVersion: v1 |
| 185 | +kind: Pod |
| 186 | +metadata: |
| 187 | + name: nginx |
| 188 | + annotations: |
| 189 | + io.kubernetes.cri-o.seccompNotifierAction: "stop" |
| 190 | +spec: |
| 191 | + restartPolicy: Never |
| 192 | + containers: |
| 193 | + - name: nginx |
| 194 | + image: nginx:1.23.2 |
| 195 | + securityContext: |
| 196 | + seccompProfile: |
| 197 | + type: RuntimeDefault |
| 198 | +``` |
| 199 | +
|
| 200 | +Please note the annotation `io.kubernetes.cri-o.seccompNotifierAction`, which |
| 201 | +enables the seccomp notifier for this workload. The value of the annotation can |
| 202 | +be either `stop` for stopping the workload or anything else for doing nothing |
| 203 | +else than logging and throwing metrics. Because of the termination we also use |
| 204 | +the `restartPolicy: Never` to not automatically recreate the container on |
| 205 | +failure. |
| 206 | + |
| 207 | +Let's run the pod and check if it works: |
| 208 | + |
| 209 | +```console |
| 210 | +> kubectl apply -f nginx.yaml |
| 211 | +``` |
| 212 | + |
| 213 | +```console |
| 214 | +> kubectl get pods -o wide |
| 215 | +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES |
| 216 | +nginx 1/1 Running 0 3m39s 10.85.0.3 127.0.0.1 <none> <none> |
| 217 | +``` |
| 218 | + |
| 219 | +We can also test if the web server itself works as intended: |
| 220 | + |
| 221 | +```console |
| 222 | +> curl 10.85.0.3 |
| 223 | +<!DOCTYPE html> |
| 224 | +<html> |
| 225 | +<head> |
| 226 | +<title>Welcome to nginx!</title> |
| 227 | +… |
| 228 | +``` |
| 229 | + |
| 230 | +While everything is now up and running, CRI-O also indicates that it has started |
| 231 | +the seccomp notifier: |
| 232 | + |
| 233 | +``` |
| 234 | +… |
| 235 | +INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb |
| 236 | +… |
| 237 | +``` |
| 238 | + |
| 239 | +If we would now run a forbidden syscall inside of the container, then we can |
| 240 | +expect that the workload gets terminated. Let's give that a try by running |
| 241 | +`chroot` in the containers namespaces: |
| 242 | + |
| 243 | +```console |
| 244 | +> kubectl exec -it nginx -- bash |
| 245 | +``` |
| 246 | + |
| 247 | +```console |
| 248 | +root@nginx:/# chroot /tmp |
| 249 | +chroot: cannot change root directory to '/tmp': Function not implemented |
| 250 | +root@nginx:/# command terminated with exit code 137 |
| 251 | +``` |
| 252 | + |
| 253 | +The exec session got terminated, so it looks like the container is not running |
| 254 | +any more: |
| 255 | + |
| 256 | +```console |
| 257 | +> kubectl get pods |
| 258 | +NAME READY STATUS RESTARTS AGE |
| 259 | +nginx 0/1 seccomp killed 0 96s |
| 260 | +``` |
| 261 | + |
| 262 | +Alright, the container got killed by seccomp, do we get any more information |
| 263 | +about what was going on? |
| 264 | + |
| 265 | +```console |
| 266 | +> kubectl describe pod nginx |
| 267 | +Name: nginx |
| 268 | +… |
| 269 | +Containers: |
| 270 | + nginx: |
| 271 | + … |
| 272 | + State: Terminated |
| 273 | + Reason: seccomp killed |
| 274 | + Message: Used forbidden syscalls: chroot (1x) |
| 275 | + Exit Code: 137 |
| 276 | + Started: Mon, 14 Nov 2022 12:19:46 +0100 |
| 277 | + Finished: Mon, 14 Nov 2022 12:20:26 +0100 |
| 278 | +… |
| 279 | +``` |
| 280 | + |
| 281 | +The seccomp notifier feature of CRI-O correctly set the termination reason and |
| 282 | +message, including which forbidden syscall has been used how often (`1x`). How |
| 283 | +often? Yes, the notifier gives the application up to 5 seconds after the last |
| 284 | +seen syscall until it starts the termination. This means that it's possible to |
| 285 | +catch multiple forbidden syscalls within one test by avoiding time-consuming |
| 286 | +trial and errors. |
| 287 | + |
| 288 | +```console |
| 289 | +> kubectl exec -it nginx -- chroot /tmp |
| 290 | +chroot: cannot change root directory to '/tmp': Function not implemented |
| 291 | +command terminated with exit code 125 |
| 292 | +> kubectl exec -it nginx -- chroot /tmp |
| 293 | +chroot: cannot change root directory to '/tmp': Function not implemented |
| 294 | +command terminated with exit code 125 |
| 295 | +> kubectl exec -it nginx -- swapoff -a |
| 296 | +command terminated with exit code 32 |
| 297 | +> kubectl exec -it nginx -- swapoff -a |
| 298 | +command terminated with exit code 32 |
| 299 | +``` |
| 300 | + |
| 301 | +```console |
| 302 | +> kubectl describe pod nginx | grep Message |
| 303 | + Message: Used forbidden syscalls: chroot (2x), swapoff (2x) |
| 304 | +``` |
| 305 | + |
| 306 | +The CRI-O metrics will also reflect that: |
| 307 | + |
| 308 | +```console |
| 309 | +> curl -sf localhost:9090/metrics | grep seccomp_notifier |
| 310 | +# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name |
| 311 | +# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter |
| 312 | +container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1 |
| 313 | +container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1 |
| 314 | +``` |
| 315 | + |
| 316 | +How does it work in detail? CRI-O uses the chosen seccomp profile and injects |
| 317 | +the action `SCMP_ACT_NOTIFY` instead of `SCMP_ACT_ERRNO`, `SCMP_ACT_KILL`, |
| 318 | +`SCMP_ACT_KILL_PROCESS` or `SCMP_ACT_KILL_THREAD`. It also sets a local listener |
| 319 | +path which will be used by the lower level OCI runtime (runc or crun) to create |
| 320 | +the seccomp notifier socket. If the connection between the socket and CRI-O has |
| 321 | +been established, then CRI-O will receive notifications for each syscall being |
| 322 | +interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for |
| 323 | +them to arrive and then terminates the container if the chosen |
| 324 | +`seccompNotifierAction=stop`. Unfortunately, the seccomp notifier is not able to |
| 325 | +notify on the `defaultAction`, which means that it's required to have |
| 326 | +a list of syscalls to test for custom profiles. CRI-O does also state that |
| 327 | +limitation in the logs: |
| 328 | + |
| 329 | +```log |
| 330 | +INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY, |
| 331 | + which means that syscalls using that default action can't be traced by the notifier |
| 332 | +``` |
| 333 | + |
| 334 | +As a conclusion, the seccomp notifier implementation in CRI-O can be used to |
| 335 | +verify if your applications behave correctly when using `RuntimeDefault` or any |
| 336 | +other custom profile. Alerts can be created based on the metrics to create long |
| 337 | +running test scenarios around that feature. Making seccomp understandable and |
| 338 | +easier to use will increase adoption as well as help us to move towards a more |
| 339 | +secure Kubernetes by default! |
| 340 | + |
| 341 | +Thank you for reading this blog post. If you'd like to read more about the |
| 342 | +seccomp notifier, checkout the following resources: |
| 343 | + |
| 344 | +- The Seccomp Notifier - New Frontiers in Unprivileged Container Development: https://brauner.io/2020/07/23/seccomp-notify.html |
| 345 | +- Bringing Seccomp Notify to Runc and Kubernetes: https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes |
| 346 | +- Seccomp Agent reference implementation: https://github.com/opencontainers/runc/tree/6b16d00/contrib/cmd/seccompagent |
0 commit comments