Skip to content

Commit 8ef95bf

Browse files
committed
Add seccomp notifier blog post
Signed-off-by: Sascha Grunert <[email protected]>
1 parent d3a2bb6 commit 8ef95bf

File tree

1 file changed

+346
-0
lines changed

1 file changed

+346
-0
lines changed
Lines changed: 346 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,346 @@
1+
---
2+
layout: blog
3+
title: "Finding suspicious syscalls with the seccomp notifier"
4+
date: 2022-12-02
5+
slug: seccomp-notifier
6+
---
7+
8+
**Authors:** Sascha Grunert
9+
10+
Debugging software in production is one of the biggest challenges we have to
11+
face in our containerized environments. Being able to understand the impact of
12+
the available security options, especially when it comes to configuring our
13+
deployments, is one of the key aspects to make the default security in
14+
Kubernetes stronger. We have all those logging, tracing and metrics data already
15+
at hand, but how do we assemble the information they provide into something
16+
human readable and actionable?
17+
18+
[Seccomp][seccomp] is one of the standard mechanisms to protect a Linux based
19+
Kubernetes application from malicious actions by interfering with its [system
20+
calls][syscalls]. This allows us to restrict the application to a defined set of
21+
actionable items, like modifying files or responding to HTTP requests. Linking
22+
the knowledge of which set of syscalls is required to, for example, modify a
23+
local file, to the actual source code is in the same way non-trivial. Seccomp
24+
profiles for Kubernetes have to be written in [JSON][json] and can be understood
25+
as an architecture specific allow-list with superpowers, for example:
26+
27+
[seccomp]: https://en.wikipedia.org/wiki/Seccomp
28+
[syscalls]: https://en.wikipedia.org/wiki/Syscall
29+
[json]: https://www.json.org
30+
31+
```json
32+
{
33+
"defaultAction": "SCMP_ACT_ERRNO",
34+
"defaultErrnoRet": 38,
35+
"defaultErrno": "ENOSYS",
36+
"syscalls": [
37+
{
38+
"names": ["chmod", "chown", "open", "write"],
39+
"action": "SCMP_ACT_ALLOW"
40+
}
41+
]
42+
}
43+
```
44+
45+
The above profile errors by default specifying the `defaultAction` of
46+
`SCMP_ACT_ERRNO`. This means we have to allow a set of syscalls via
47+
`SCMP_ACT_ALLOW`, otherwise the application would not be able to do anything at
48+
all. Okay cool, for being able to allow file operations, all we have to do is
49+
adding a bunch of file specific syscalls like `open` or `write`, and probably
50+
also being able to change the permissions via `chmod` and `chown`, right?
51+
Basically yes, but there are issues with the simplicity of that approach:
52+
53+
Seccomp profiles need to include the minimum set of syscalls required to start
54+
the application. This also includes some syscalls from the lower level
55+
[Open Container Initiative (OCI)][oci] container runtime, for example
56+
[runc][runc] or [crun][crun]. Beside that, we can only guarantee the required
57+
syscalls for a very specific version of the runtimes and our application,
58+
because the code parts can change between releases. The same applies to the
59+
termination of the application as well as the target architecture we're
60+
deploying on. Features like executing commands within containers also require
61+
another subset of syscalls. Not to mention that there are multiple versions for
62+
syscalls doing slightly different things and the seccomp profiles are able to
63+
modify their arguments. It's also not always clearly visible to the developers
64+
which syscalls are used by their own written code parts, because they rely on
65+
programming language abstractions or frameworks.
66+
67+
[oci]: https://opencontainers.org
68+
[runc]: https://github.com/opencontainers/runc
69+
[crun]: https://github.com/containers/crun
70+
71+
_How can we know which syscalls are even required then? Who should create and
72+
maintain those profiles during its development life-cycle?_
73+
74+
Well, recording and distributing seccomp profiles is one of the problem domains
75+
of the [Security Profiles Operator][spo], which is already solving that. The
76+
operator is able to record [seccomp][seccomp], [SELinux][selinux] and even
77+
[AppArmor][apparmor] profiles into a [Custom Resource Definition (CRD)][crd],
78+
reconciles them to each node and makes them available for usage.
79+
80+
[spo]: https://github.com/kubernetes-sigs/security-profiles-operator
81+
[selinux]: https://en.wikipedia.org/wiki/Security-Enhanced_Linux
82+
[apparmor]: https://en.wikipedia.org/wiki/AppArmor
83+
[crd]: https://k8s.io/docs/concepts/extend-kubernetes/api-extension/custom-resources
84+
85+
The biggest challenge about creating security profiles is to catch all code
86+
paths which execute syscalls. We could achieve that by having **100%** logical
87+
coverage of the application when running an end-to-end test suite. You get the
88+
problem with the previous statement: It's too idealistic to be ever fulfilled,
89+
even without taking all the moving parts during application development and
90+
deployment into account.
91+
92+
Missing a syscall in the seccomp profiles' allow list can have tremendously
93+
negative impact on the application. It's not only that we can encounter crashes,
94+
which are trivially detectable. It can also happen that they slightly change
95+
logical paths, change the business logic, make parts of the application
96+
unusable, slow down performance or even expose security vulnerabilities. We're
97+
simply not able to see the whole impact of that, especially because blocked
98+
syscalls via `SCMP_ACT_ERRNO` do not provide any additional [audit][audit]
99+
logging on the system.
100+
101+
[audit]: https://linux.die.net/man/8/auditd
102+
103+
Does that mean we're lost? Is it just not realistic to dream about a Kubernetes
104+
where [everyone uses the default seccomp profile][seccomp-default]? Should we
105+
stop striving towards maximum security in Kubernetes and accept that it's not
106+
meant to be secure by default?
107+
108+
[seccomp-default]: https://github.com/kubernetes/enhancements/issues/2413
109+
110+
**Definitely not.** Technology evolves over time and there are many folks
111+
working behind the scenes of Kubernetes to indirectly deliver features to
112+
address such problems. One of the mentioned features is the _seccomp notifier_,
113+
which can be used to find suspicious syscalls in Kubernetes.
114+
115+
The seccomp notify feature consists of a set of changes introduced in Linux 5.9.
116+
It makes the kernel capable of communicating seccomp related events to the user
117+
space. That allows applications to act based on the syscalls and opens for a
118+
wide range of possible use cases. We not only need the right kernel version,
119+
but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier
120+
work at all. The Kubernetes container runtime [CRI-O][cri-o] gets [support for
121+
the seccomp notifier in v1.26.0][cri-o-notifier]. The new feature allows us to
122+
identify possibly malicious syscalls in our application, and therefore makes it
123+
possible to verify profiles for consistency and completeness. Let's give that a
124+
try.
125+
126+
[cri-o]: https://cri-o.io
127+
[cri-o-notifier]: https://github.com/cri-o/cri-o/pull/6120
128+
129+
First of all we need to run the latest `main` version of CRI-O, because v1.26.0
130+
has not been released yet at time of writing. You can do that by either
131+
compiling it from the [source code][sources] or by using the pre-built binary
132+
bundle via [the get-script][script]. The seccomp notifier feature of CRI-O is
133+
guarded by an annotation, which has to be explicitly allowed, for example by
134+
using a configuration drop-in like this:
135+
136+
```console
137+
> cat /etc/crio/crio.conf.d/02-runtimes.conf
138+
```
139+
140+
```toml
141+
[crio.runtime]
142+
default_runtime = "runc"
143+
144+
[crio.runtime.runtimes.runc]
145+
allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]
146+
```
147+
148+
[sources]: https://github.com/cri-o/cri-o/blob/main/install.md#build-and-install-cri-o-from-source
149+
[script]: https://github.com/cri-o/cri-o#installing-cri-o
150+
151+
If CRI-O is up and running, then it should indicate that the seccomp notifier is
152+
available as well:
153+
154+
```console
155+
> sudo ./bin/crio --enable-metrics
156+
157+
INFO[…] Starting seccomp notifier watcher
158+
INFO[…] Serving metrics on :9090 via HTTP
159+
160+
```
161+
162+
We also enable the metrics, because they provide additional telemetry data about
163+
the notifier. Now we need a running Kubernetes cluster for demonstration
164+
purposes. For this demo, we mainly stick to the
165+
[`hack/local-up-cluster.sh`][local-up] approach to locally spawn a single node
166+
Kubernetes cluster.
167+
168+
[local-up]: https://github.com/cri-o/cri-o#running-kubernetes-with-cri-o
169+
170+
If everything is up and running, then we would have to define a seccomp profile
171+
for testing purposes. But we do not have to create our own, we can just use the
172+
`RuntimeDefault` profile which gets shipped with each container runtime. For
173+
example the `RuntimeDefault` profile for CRI-O can be found in the
174+
[containers/common][runtime-default] library.
175+
176+
[runtime-default]: https://github.com/containers/common/blob/afff1d6/pkg/seccomp/seccomp.json
177+
178+
Now we need a test container, which can be a simple [nginx][nginx] pod like
179+
this:
180+
181+
[nginx]: https://www.nginx.com
182+
183+
```yaml
184+
apiVersion: v1
185+
kind: Pod
186+
metadata:
187+
name: nginx
188+
annotations:
189+
io.kubernetes.cri-o.seccompNotifierAction: "stop"
190+
spec:
191+
restartPolicy: Never
192+
containers:
193+
- name: nginx
194+
image: nginx:1.23.2
195+
securityContext:
196+
seccompProfile:
197+
type: RuntimeDefault
198+
```
199+
200+
Please note the annotation `io.kubernetes.cri-o.seccompNotifierAction`, which
201+
enables the seccomp notifier for this workload. The value of the annotation can
202+
be either `stop` for stopping the workload or anything else for doing nothing
203+
else than logging and throwing metrics. Because of the termination we also use
204+
the `restartPolicy: Never` to not automatically recreate the container on
205+
failure.
206+
207+
Let's run the pod and check if it works:
208+
209+
```console
210+
> kubectl apply -f nginx.yaml
211+
```
212+
213+
```console
214+
> kubectl get pods -o wide
215+
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
216+
nginx 1/1 Running 0 3m39s 10.85.0.3 127.0.0.1 <none> <none>
217+
```
218+
219+
We can also test if the web server itself works as intended:
220+
221+
```console
222+
> curl 10.85.0.3
223+
<!DOCTYPE html>
224+
<html>
225+
<head>
226+
<title>Welcome to nginx!</title>
227+
228+
```
229+
230+
While everything is now up and running, CRI-O also indicates that it has started
231+
the seccomp notifier:
232+
233+
```
234+
235+
INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
236+
237+
```
238+
239+
If we would now run a forbidden syscall inside of the container, then we can
240+
expect that the workload gets terminated. Let's give that a try by running
241+
`chroot` in the containers namespaces:
242+
243+
```console
244+
> kubectl exec -it nginx -- bash
245+
```
246+
247+
```console
248+
root@nginx:/# chroot /tmp
249+
chroot: cannot change root directory to '/tmp': Function not implemented
250+
root@nginx:/# command terminated with exit code 137
251+
```
252+
253+
The exec session got terminated, so it looks like the container is not running
254+
any more:
255+
256+
```console
257+
> kubectl get pods
258+
NAME READY STATUS RESTARTS AGE
259+
nginx 0/1 seccomp killed 0 96s
260+
```
261+
262+
Alright, the container got killed by seccomp, do we get any more information
263+
about what was going on?
264+
265+
```console
266+
> kubectl describe pod nginx
267+
Name: nginx
268+
269+
Containers:
270+
nginx:
271+
272+
State: Terminated
273+
Reason: seccomp killed
274+
Message: Used forbidden syscalls: chroot (1x)
275+
Exit Code: 137
276+
Started: Mon, 14 Nov 2022 12:19:46 +0100
277+
Finished: Mon, 14 Nov 2022 12:20:26 +0100
278+
279+
```
280+
281+
The seccomp notifier feature of CRI-O correctly set the termination reason and
282+
message, including which forbidden syscall has been used how often (`1x`). How
283+
often? Yes, the notifier gives the application up to 5 seconds after the last
284+
seen syscall until it starts the termination. This means that it's possible to
285+
catch multiple forbidden syscalls within one test by avoiding time-consuming
286+
trial and errors.
287+
288+
```console
289+
> kubectl exec -it nginx -- chroot /tmp
290+
chroot: cannot change root directory to '/tmp': Function not implemented
291+
command terminated with exit code 125
292+
> kubectl exec -it nginx -- chroot /tmp
293+
chroot: cannot change root directory to '/tmp': Function not implemented
294+
command terminated with exit code 125
295+
> kubectl exec -it nginx -- swapoff -a
296+
command terminated with exit code 32
297+
> kubectl exec -it nginx -- swapoff -a
298+
command terminated with exit code 32
299+
```
300+
301+
```console
302+
> kubectl describe pod nginx | grep Message
303+
Message: Used forbidden syscalls: chroot (2x), swapoff (2x)
304+
```
305+
306+
The CRI-O metrics will also reflect that:
307+
308+
```console
309+
> curl -sf localhost:9090/metrics | grep seccomp_notifier
310+
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
311+
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
312+
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
313+
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1
314+
```
315+
316+
How does it work in detail? CRI-O uses the chosen seccomp profile and injects
317+
the action `SCMP_ACT_NOTIFY` instead of `SCMP_ACT_ERRNO`, `SCMP_ACT_KILL`,
318+
`SCMP_ACT_KILL_PROCESS` or `SCMP_ACT_KILL_THREAD`. It also sets a local listener
319+
path which will be used by the lower level OCI runtime (runc or crun) to create
320+
the seccomp notifier socket. If the connection between the socket and CRI-O has
321+
been established, then CRI-O will receive notifications for each syscall being
322+
interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for
323+
them to arrive and then terminates the container if the chosen
324+
`seccompNotifierAction=stop`. Unfortunately, the seccomp notifier is not able to
325+
notify on the `defaultAction`, which means that it's required to have
326+
a list of syscalls to test for custom profiles. CRI-O does also state that
327+
limitation in the logs:
328+
329+
```log
330+
INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
331+
which means that syscalls using that default action can't be traced by the notifier
332+
```
333+
334+
As a conclusion, the seccomp notifier implementation in CRI-O can be used to
335+
verify if your applications behave correctly when using `RuntimeDefault` or any
336+
other custom profile. Alerts can be created based on the metrics to create long
337+
running test scenarios around that feature. Making seccomp understandable and
338+
easier to use will increase adoption as well as help us to move towards a more
339+
secure Kubernetes by default!
340+
341+
Thank you for reading this blog post. If you'd like to read more about the
342+
seccomp notifier, checkout the following resources:
343+
344+
- The Seccomp Notifier - New Frontiers in Unprivileged Container Development: https://brauner.io/2020/07/23/seccomp-notify.html
345+
- Bringing Seccomp Notify to Runc and Kubernetes: https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes
346+
- Seccomp Agent reference implementation: https://github.com/opencontainers/runc/tree/6b16d00/contrib/cmd/seccompagent

0 commit comments

Comments
 (0)