Skip to content

Commit b9a4727

Browse files
committed
contrib: memfd-bind: add helper for memfd-sealed-bind trick
This really isn't ideal but it can be used to avoid the largest issues with the memfd-based runc binary protection. There are several caveats with using this tool, see the help page for the new binary for details. Signed-off-by: Aleksa Sarai <[email protected]>
1 parent dac4171 commit b9a4727

File tree

6 files changed

+330
-8
lines changed

6 files changed

+330
-8
lines changed

.gitignore

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
vendor/pkg
22
/runc
33
/runc-*
4-
contrib/cmd/recvtty/recvtty
5-
contrib/cmd/sd-helper/sd-helper
6-
contrib/cmd/seccompagent/seccompagent
7-
contrib/cmd/fs-idmap/fs-idmap
4+
/contrib/cmd/recvtty/recvtty
5+
/contrib/cmd/sd-helper/sd-helper
6+
/contrib/cmd/seccompagent/seccompagent
7+
/contrib/cmd/fs-idmap/fs-idmap
8+
/contrib/cmd/memfd-bind/memfd-bind
89
man/man8
910
release
1011
Vagrantfile

Makefile

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,9 +67,9 @@ runc: runc-dmz
6767
$(GO_BUILD) -o runc .
6868
make verify-dmz-arch
6969

70-
all: runc recvtty sd-helper seccompagent fs-idmap
70+
all: runc recvtty sd-helper seccompagent fs-idmap memfd-bind
7171

72-
recvtty sd-helper seccompagent fs-idmap:
72+
recvtty sd-helper seccompagent fs-idmap memfd-bind:
7373
$(GO_BUILD) -o contrib/cmd/$@/$@ ./contrib/cmd/$@
7474

7575
static: runc-dmz
@@ -161,10 +161,11 @@ install-man: man
161161

162162
clean:
163163
rm -f runc runc-* libcontainer/dmz/runc-dmz
164+
rm -f contrib/cmd/fs-idmap/fs-idmap
164165
rm -f contrib/cmd/recvtty/recvtty
165166
rm -f contrib/cmd/sd-helper/sd-helper
166167
rm -f contrib/cmd/seccompagent/seccompagent
167-
rm -f contrib/cmd/fs-idmap/fs-idmap
168+
rm -f contrib/cmd/memfd-bind/memfd-bind
168169
sudo rm -rf release
169170
rm -rf man/man8
170171

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,13 +68,15 @@ make BUILDTAGS=""
6868
| Build Tag | Feature | Enabled by Default | Dependencies |
6969
|---------------|---------------------------------------|--------------------|---------------------|
7070
| `seccomp` | Syscall filtering using `libseccomp`. | yes | `libseccomp` |
71-
| `!runc_nodmz` | Reduce memory usage for CVE-2019-5736 protection by using a small C binary. `runc_nodmz` disables this feature and causes runc to use a different protection mechanism which will further increases memory usage temporarily during container startup. This feature can also be disabled at runtime by setting the `RUNC_DMZ=legacy` environment variable. | yes ||
71+
| `!runc_nodmz` | Reduce memory usage for CVE-2019-5736 protection by using a small C binary, [see `memfd-bind` for more details][contrib-memfd-bind]. `runc_nodmz` disables this feature and causes runc to use a different protection mechanism which will further increases memory usage temporarily during container startup. This feature can also be disabled at runtime by setting the `RUNC_DMZ=legacy` environment variable. | yes ||
7272

7373
The following build tags were used earlier, but are now obsoleted:
7474
- **nokmem** (since runc v1.0.0-rc94 kernel memory settings are ignored)
7575
- **apparmor** (since runc v1.0.0-rc93 the feature is always enabled)
7676
- **selinux** (since runc v1.0.0-rc93 the feature is always enabled)
7777

78+
[contrib-memfd-bind]: /contrib/memfd-bind/README.md
79+
7880
### Running the test suite
7981

8082
`runc` currently supports running its test suite via Docker.

contrib/cmd/memfd-bind/README.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
## memfd-bind ##
2+
3+
`runc` normally has to make a binary copy of itself (or of a smaller helper
4+
binary called `runc-dmz`) when constructing a container process in order to
5+
defend against certain container runtime attacks such as CVE-2019-5736.
6+
7+
This cloned binary only exists until the container process starts (this means
8+
for `runc run` and `runc exec`, it only exists for a few hundred milliseconds
9+
-- for `runc create` it exists until `runc start` is called). However, because
10+
the clone is done using a memfd (or by creating files in directories that are
11+
likely to be a `tmpfs`), this can lead to temporary increases in *host* memory
12+
usage. Unless you are running on a cgroupv1 system with the cgroupv1 memory
13+
controller enabled and the (deprecated) `memory.move_charge_at_immigrate`
14+
enabled, there is no effect on the container's memory.
15+
16+
However, for certain configurations this can still be undesirable. This daemon
17+
allows you to create a sealed memfd copy of the `runc` binary, which will cause
18+
`runc` to skip all binary copying, resulting in no additional memory usage for
19+
each container process (instead there is a single in-memory copy of the
20+
binary). It should be noted that (strictly speaking) this is slightly less
21+
secure if you are concerned about Dirty Cow-like 0-day kernel vulnerabilities,
22+
but for most users the security benefit is identical.
23+
24+
The provided `[email protected]` file can be used to get systemd to manage
25+
this daemon. You can supply the path like so:
26+
27+
```
28+
% systemctl start memfd-bind@/usr/bin/runc
29+
```
30+
31+
Thus, there are three ways of protecting against CVE-2019-5736, in order of how
32+
much memory usage they can use:
33+
34+
* `memfd-bind` only creates a single in-memory copy of the `runc` binary (about
35+
10MB), regardless of how many containers are running.
36+
37+
* `runc-dmz` is (depending on which libc it was compiled with) between 10kB and
38+
1MB in size, and a copy is created once per process spawned inside a
39+
container by runc (both the pid1 and every `runc exec`). There are
40+
circumstances where using `runc-dmz` will fail in ways that runc cannot
41+
predict ahead of time (such as restrictive LSMs applied to containers), in
42+
which case users can disable it with the `RUNC_DMZ=legacy` setting.
43+
`runc-dmz` also requires an additional `execve` over the other options,
44+
though since the binary is so small the cost is probably not even noticeable.
45+
46+
* The classic method of making a copy of the entire `runc` binary during
47+
container process setup takes up about 10MB per process spawned inside the
48+
container by runc (both pid1 and `runc exec`).
49+
50+
### Caveats ###
51+
52+
There are several downsides with using `memfd-bind` on the `runc` binary:
53+
54+
* The `memfd-bind` process needs to continue to run indefinitely in order for
55+
the memfd reference to stay alive. If the process is forcefully killed, the
56+
bind-mount on top of the `runc` binary will become stale and nobody will be
57+
able to execute it (you can use `memfd-bind --cleanup` to clean up the stale
58+
mount).
59+
60+
* Only root can execute the cloned binary due to permission restrictions on
61+
accessing other process's files. More specifically, only users with ptrace
62+
privileges over the memfd-bind daemon can access the file (but in practice
63+
this is usually only root).
64+
65+
* When updating `runc`, the daemon needs to be stopped before the update (so
66+
the package manager can access the underlying file) and then restarted after
67+
the update.
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
/*
2+
* Copyright (c) 2023 SUSE LLC
3+
* Copyright (c) 2023 Aleksa Sarai <[email protected]>
4+
*
5+
* Licensed under the Apache License, Version 2.0 (the "License");
6+
* you may not use this file except in compliance with the License.
7+
* You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package main
19+
20+
import (
21+
"errors"
22+
"fmt"
23+
"io"
24+
"os"
25+
"os/signal"
26+
"runtime"
27+
"strings"
28+
"time"
29+
30+
"github.com/opencontainers/runc/libcontainer/dmz"
31+
32+
"github.com/sirupsen/logrus"
33+
"github.com/urfave/cli"
34+
"golang.org/x/sys/unix"
35+
)
36+
37+
// version will be populated by the Makefile, read from
38+
// VERSION file of the source code.
39+
var version = ""
40+
41+
// gitCommit will be the hash that the binary was built from
42+
// and will be populated by the Makefile.
43+
var gitCommit = ""
44+
45+
const (
46+
usage = `Open Container Initiative contrib/cmd/memfd-bind
47+
48+
In order to protect against certain container attacks, every runc invocation
49+
that involves creating or joining a container will cause runc to make a copy of
50+
the runc binary in memory (usually to a memfd). While "runc init" is very
51+
short-lived, this extra memory usage can cause problems for containers with
52+
very small memory limits (or containers that have many "runc exec" invocations
53+
applied to them at the same time).
54+
55+
memfd-bind is a tool to create a persistent memfd-sealed-copy of the runc binary,
56+
which will cause runc to not make its own copy. This means you can get the
57+
benefits of using a sealed memfd as runc's binary (even in a container breakout
58+
attack to get write access to the runc binary, neither the underlying binary
59+
nor the memfd copy can be changed).
60+
61+
To use memfd-bind, just specify which path you want to create a socket path at
62+
which you want to receive terminals:
63+
64+
$ sudo memfd-bind /usr/bin/runc
65+
66+
Note that (due to kernel restrictions on bind-mounts), this program must remain
67+
running on the host in order for the binary to be readable (it is recommended
68+
you use a systemd unit to keep this process around).
69+
70+
If this program dies, there will be a leftover mountpoint that always returns
71+
-EINVAL when attempting to access it. You need to use memfd-bind --cleanup on the
72+
path in order to unmount the path (regular umount(8) will not work):
73+
74+
$ sudo memfd-bind --cleanup /usr/bin/runc
75+
76+
Note that (due to restrictions on /proc/$pid/fd/$fd magic-link resolution),
77+
only privileged users (specifically, those that have ptrace privileges over the
78+
memfd-bind daemon) can access the memfd bind-mount. This means that using this
79+
tool to harden your /usr/bin/runc binary would result in unprivileged users
80+
being unable to execute the binary. If this is an issue, you could make all
81+
privileged process use a different copy of runc (by making a copy in somewhere
82+
like /usr/sbin/runc) and only using memfd-bind for the version used by
83+
privileged users.
84+
`
85+
)
86+
87+
func cleanup(path string) error {
88+
file, err := os.OpenFile(path, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
89+
if err != nil {
90+
return fmt.Errorf("cleanup: failed to open runc binary path: %w", err)
91+
}
92+
defer file.Close()
93+
fdPath := fmt.Sprintf("/proc/self/fd/%d", file.Fd())
94+
95+
// Keep umounting until we hit a umount error.
96+
for unix.Unmount(fdPath, unix.MNT_DETACH) == nil {
97+
// loop...
98+
logrus.Debugf("memfd-bind: path %q unmount succeeded...", path)
99+
}
100+
logrus.Infof("memfd-bind: path %q has been cleared of all old bind-mounts", path)
101+
return nil
102+
}
103+
104+
// memfdClone is a memfd-only implementation of dmz.CloneBinary.
105+
func memfdClone(path string) (*os.File, error) {
106+
binFile, err := os.Open(path)
107+
if err != nil {
108+
return nil, fmt.Errorf("failed to open runc binary path: %w", err)
109+
}
110+
defer binFile.Close()
111+
stat, err := binFile.Stat()
112+
if err != nil {
113+
return nil, fmt.Errorf("checking %s size: %w", path, err)
114+
}
115+
size := stat.Size()
116+
memfd, sealFn, err := dmz.Memfd("/proc/self/exe")
117+
if err != nil {
118+
return nil, fmt.Errorf("creating memfd failed: %w", err)
119+
}
120+
copied, err := io.Copy(memfd, binFile)
121+
if err != nil {
122+
return nil, fmt.Errorf("copy binary: %w", err)
123+
} else if copied != size {
124+
return nil, fmt.Errorf("copied binary size mismatch: %d != %d", copied, size)
125+
}
126+
if err := sealFn(&memfd); err != nil {
127+
return nil, fmt.Errorf("could not seal fd: %w", err)
128+
}
129+
if !dmz.IsCloned(memfd) {
130+
return nil, fmt.Errorf("cloned memfd is not properly sealed")
131+
}
132+
return memfd, nil
133+
}
134+
135+
func mount(path string) error {
136+
memfdFile, err := memfdClone(path)
137+
if err != nil {
138+
return fmt.Errorf("memfd clone: %w", err)
139+
}
140+
defer memfdFile.Close()
141+
memfdPath := fmt.Sprintf("/proc/self/fd/%d", memfdFile.Fd())
142+
143+
// We have to open an O_NOFOLLOW|O_PATH to the memfd magic-link because we
144+
// cannot bind-mount the memfd itself (it's in the internal kernel mount
145+
// namespace and cross-mount-namespace bind-mounts are not allowed). This
146+
// also requires that this program stay alive continuously for the
147+
// magic-link to stay alive...
148+
memfdLink, err := os.OpenFile(memfdPath, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
149+
if err != nil {
150+
return fmt.Errorf("mount: failed to /proc/self/fd magic-link for memfd: %w", err)
151+
}
152+
defer memfdLink.Close()
153+
memfdLinkFdPath := fmt.Sprintf("/proc/self/fd/%d", memfdLink.Fd())
154+
155+
exeFile, err := os.OpenFile(path, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
156+
if err != nil {
157+
return fmt.Errorf("mount: failed to open target runc binary path: %w", err)
158+
}
159+
defer exeFile.Close()
160+
exeFdPath := fmt.Sprintf("/proc/self/fd/%d", exeFile.Fd())
161+
162+
err = unix.Mount(memfdLinkFdPath, exeFdPath, "", unix.MS_BIND, "")
163+
if err != nil {
164+
return fmt.Errorf("mount: failed to mount memfd on top of runc binary path target: %w", err)
165+
}
166+
167+
// If there is a signal we want to do cleanup.
168+
sigCh := make(chan os.Signal, 1)
169+
signal.Notify(sigCh, os.Interrupt, unix.SIGTERM, unix.SIGINT)
170+
go func() {
171+
<-sigCh
172+
logrus.Infof("memfd-bind: exit signal caught! cleaning up the bind-mount on %q...", path)
173+
_ = cleanup(path)
174+
os.Exit(0)
175+
}()
176+
177+
// Clean up things we don't need...
178+
_ = exeFile.Close()
179+
_ = memfdLink.Close()
180+
181+
// We now have to stay alive to keep the magic-link alive...
182+
logrus.Infof("memfd-bind: bind-mount of memfd over %q created -- looping forever!", path)
183+
for {
184+
// loop forever...
185+
time.Sleep(time.Duration(1<<63 - 1))
186+
// make sure the memfd isn't gc'd
187+
runtime.KeepAlive(memfdFile)
188+
}
189+
}
190+
191+
func main() {
192+
app := cli.NewApp()
193+
app.Name = "memfd-bind"
194+
app.Usage = usage
195+
196+
// Set version to be the same as runC.
197+
var v []string
198+
if version != "" {
199+
v = append(v, version)
200+
}
201+
if gitCommit != "" {
202+
v = append(v, "commit: "+gitCommit)
203+
}
204+
app.Version = strings.Join(v, "\n")
205+
206+
// Set the flags.
207+
app.Flags = []cli.Flag{
208+
cli.BoolFlag{
209+
Name: "cleanup",
210+
Usage: "Do not create a new memfd-sealed file, only clean up an existing one at <path>.",
211+
},
212+
cli.BoolFlag{
213+
Name: "debug",
214+
Usage: "Enable debug logging.",
215+
},
216+
}
217+
218+
app.Action = func(ctx *cli.Context) error {
219+
args := ctx.Args()
220+
if len(args) != 1 {
221+
return errors.New("need to specify a single path to the runc binary")
222+
}
223+
path := ctx.Args()[0]
224+
225+
if ctx.Bool("debug") {
226+
logrus.SetLevel(logrus.DebugLevel)
227+
}
228+
229+
err := cleanup(path)
230+
// We only care about cleanup errors when doing --cleanup.
231+
if ctx.Bool("cleanup") {
232+
return err
233+
}
234+
return mount(path)
235+
}
236+
if err := app.Run(os.Args); err != nil {
237+
fmt.Fprintf(os.Stderr, "memfd-bind: %v\n", err)
238+
os.Exit(1)
239+
}
240+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
[Unit]
2+
Description=Manage memfd-bind of %I
3+
Documentation=https://github.com/opencontainers/runc
4+
5+
[Service]
6+
Type=simple
7+
ExecStart=memfd-bind "%I"
8+
ExecStop=memfd-bind --cleanup "%I"
9+
10+
[Install]
11+
WantedBy=multi-user.target

0 commit comments

Comments
 (0)