Skip to content

Commit 48233bc

Browse files
committed
Use private cgroup namespaces for cgroup v2
Using the host's cgroup namespace along with a writable mount of the entire cgroup fs messes with container isolation quite a bit. The main purpose of this is to get a writable mount of the cgroup fs inside containers, so that init systems are able to set up their own cgroups accordingly. Use a different approach to achieve the same effect: Use a private cgroup namespace. Privileged containers will automatically have write access. A read-write mount is only performed when running non-privileged containers. Signed-off-by: Tom Wieczorek <[email protected]>
1 parent 781d1c2 commit 48233bc

File tree

1 file changed

+52
-4
lines changed

1 file changed

+52
-4
lines changed

pkg/cluster/cluster.go

Lines changed: 52 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import (
1212
"fmt"
1313
"io"
1414
"os"
15+
"path"
1516
"path/filepath"
1617
"regexp"
1718
"strconv"
@@ -307,10 +308,57 @@ func (c *Cluster) createMachineRunArgs(machine *Machine, name string, i int) []s
307308
"--tmpfs", "/tmp:exec,mode=777",
308309
}
309310
if docker.CgroupVersion() == "2" {
310-
runArgs = append(runArgs, "--cgroupns", "host",
311-
"--cgroup-parent", "bootloose.slice",
312-
"-v", "/sys/fs/cgroup:/sys/fs/cgroup:rw")
313-
311+
runArgs = append(runArgs, "--cgroupns", "private")
312+
313+
if !machine.spec.Privileged {
314+
// Non-privileged containers will have their /sys/fs/cgroup folder
315+
// mounted read-only, even when running in private cgroup
316+
// namespaces. This is a bummer for init systems. Containers could
317+
// probably remount the cgroup fs in read-write mode, but that would
318+
// require CAP_SYS_ADMIN _and_ a custom logic in the container's
319+
// entry point. Podman has `--security-opt unmask=/sys/fs/cgroup`,
320+
// but that's not a thing for Docker. The only other way to get a
321+
// writable cgroup fs inside the container is to explicitly mount
322+
// it. Some references:
323+
// - https://github.com/moby/moby/issues/42275
324+
// - https://serverfault.com/a/1054414
325+
326+
// Docker will use cgroups like
327+
// <cgroup-parent>/docker-{{ContainerID}}.scope.
328+
//
329+
// Ideally, we could mount those to /sys/fs/cgroup inside the
330+
// containers. But there's some chicken-and-egg problem, as we only
331+
// know the container ID _after_ the container creation. As a
332+
// duct-tape solution, we mount our own cgroup as the root, which is
333+
// unrelated to the Docker-managed one:
334+
// <cgroup-parent>/cluster-{{ClusterID}}.scope/machine-{{MachineID}}.scope
335+
336+
// FIXME: How to clean this up? Especially when Docker is being run
337+
// on a different machine?
338+
339+
// Just assume that the cgroup fs is mounted at its default
340+
// location. We could try to figure this out via
341+
// /proc/self/mountinfo, but it's really not worth the hassle.
342+
const cgroupMountpoint = "/sys/fs/cgroup"
343+
344+
// Use this as the parent cgroup for everything. Note that if Docker
345+
// uses the systemd cgroup driver, the cgroup name has to end with
346+
// .slice. This is not a requirement for the cgroupfs driver; it
347+
// won't care. Hence, just always use the .slice suffix, no matter
348+
// if it's required or not.
349+
const cgroupParent = "bootloose.slice"
350+
351+
cg := path.Join(
352+
cgroupMountpoint, cgroupParent,
353+
fmt.Sprintf("cluster-%s.scope", c.spec.Cluster.Name),
354+
fmt.Sprintf("machine-%s.scope", name),
355+
)
356+
357+
runArgs = append(runArgs,
358+
"--cgroup-parent", cgroupParent,
359+
"-v", fmt.Sprintf("%s:%s:rw", cg, cgroupMountpoint),
360+
)
361+
}
314362
} else {
315363
runArgs = append(runArgs, "-v", "/sys/fs/cgroup:/sys/fs/cgroup:ro")
316364
}

0 commit comments

Comments
 (0)