container escape and denial of service due to arbitrary write gadgets and procfs write redirects

@lifubang

Impact

This attack is primarily a more sophisticated version of CVE-2019-19921, which
was a flaw which allowed an attacker to trick runc into writing the LSM process
labels for a container process into a dummy tmpfs file and thus not apply the
correct LSM labels to the container process. The mitigation we applied for
CVE-2019-19921 was fairly limited and effectively only caused runc to verify
that when we write LSM labels that those labels are actual procfs files.

Rather than using a fake tmpfs file for /proc/self/attr/<label>, an
attacker could instead (through various means) make /proc/self/attr/<label>
reference a real procfs file, but one that would still be a no-op (such as
/proc/self/sched). This would have the same effect but would clear the "is a
procfs file" check. We were aware that this kind of attack would be possible
(even going so far as to discuss this publicly as "future work" at
conferences), and we were working on a far more comprehensive mitigation of
this attack, but this security issue was disclosed before we could complete
this work.

In all known versions of runc, an attacker can trick runc into misdirecting
writes to /proc to other procfs files through the use of a racing container
with shared mounts (we have also verified this attack is possible to exploit
using a standard Dockerfile with docker buildx build as that also permits
triggering parallel execution of containers with custom shared mounts
configured). This redirect could be through symbolic links in a tmpfs or
theoretically other methods such as regular bind-mounts.

Note that while /proc/self/attr/<label> was the example used above (which is
LSM-specific), this issue affect all writes to /proc in runc and thus also
affects sysctls (written to /proc/sys/...) and some other APIs.

Additional Impacts

While investigating this issue, we discovered that another risk with these
redirected writes is that they could be redirected to dangerous files such as
/proc/sysrq-trigger rather than just no-op files like /proc/self/sched.
For instance, the default AppArmor profile name in Docker is docker-default,
which when written to /proc/sysrq-trigger would cause the host system to
crash.

When this was discovered, we conducted an audit of other write operations
within runc and found several possible areas where runc could be used as a
semi-arbitrary write gadget when combined with the above race attacks. The most
concerning attack scenario was the configuration of sysctls. Because the
contents of the sysctl are free-form text, an attacker could use a misdirected
write to write to /proc/sys/kernel/core_pattern and break out of the
container (as described in CVE-2025-31133, kernel upcalls are not namespaced
and so coredump helpers will run with complete root privileges on the host).
Even if the attacker cannot configure custom sysctls, a valid sysctl string
(when redirected to /proc/sysrq-trigger) can easily cause the machine to
hang.

Note that the fact that this attack allows you to disable LSM labels makes it a
very useful attack to combine with CVE-2025-31133 (as one of the only
mitigations available to most users for that issue is AppArmor, and this attack
would let you bypass that). However, the misdirected write issue above means
that you could also achieve most of the same goals without needing to chain
together attacks.

Taking the above additional impacts into account, this attack was analysed as
having a CVSSv4 severity of 7.3 (High) using the vector
CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H.

Patches

This advisory is being published as part of a set of three advisories:

The patches fixing this issue have accordingly been combined into a single
patchset. The following patches from that patchset resolve the issues in this
advisory:

db19bbe ("internal/sys: add VerifyInode helper")
6fc1914 ("internal: move utils.MkdirAllInRoot to internal/pathrs")
ff94f99 ("*: switch to safer securejoin.Reopen")
44a0fcf ("go.mod: update to github.com/cyphar/[email protected]")
77889b5 ("internal: add wrappers for securejoin.Proc*")
fdcc9d3 ("apparmor: use safe procfs API for labels")
ff6fe13 ("utils: use safe procfs for /proc/self/fd loop code")
b3dd1bc ("utils: remove unneeded EnsureProcHandle")
77d217c ("init: write sysctls using safe procfs API")
435cc81 ("init: use securejoin for /proc/self/setgroups")
d61fd29 ("libct/system: use securejoin for /proc/$pid/stat")
4b37cd9 ("libct: align param type for mountCgroupV1/V2 functions")
d40b343 ("rootfs: switch to fd-based handling of mountpoint targets")
ed6b169 ("selinux: use safe procfs API for labels")
- Please note that this patch includes a private patch for
  github.com/opencontainers/selinux that could not be made public through
  a public pull request (as it would necessarily disclose this embargoed
  security issue).
  
  The patch includes a complete copy of the forked code and a replace
  directive (as well as go mod vendor applied), which should still work
  with downstream build systems. If you cannot apply this patch, you can
  safely drop it -- some of the other patches in this series should block
  these kinds of racing mount attacks entirely.
  
  See opencontainers/selinux#237 for the upstream patch.
3f92552 ("rootfs: re-allow dangling symlinks in mount targets")
a41366e ("openat2: improve resilience on busy systems")

runc 1.2.8, 1.3.3, and 1.4.0-rc.3 have been released and all contain fixes for these
issues. As per our new release model, runc 1.1.x and earlier are
no longer supported and thus have not been patched.

Mitigations

Do not run untrusted container images from unknown or unverified sources.
For the basic no-op attack, this attack allows a container process to run
with the same LSM labels as runc. For most AppArmor deployments this means
it will be unconfined, and for SELinux it will likely be
container_runtime_t. We haven't conducted in-depth testing of the impact
on SELinux -- it is possible that it provides some reasonable protection but
it seems likely that an attacker could cause harm to systems even with such
an SELinux setup.
For the more involved redirect and write gadget attacks, unfortunately most
LSM profiles (including the standard container-selinux profiles) provide the
container runtime access to sysctl files (including /proc/sysrq-trigger)
and so LSMs likely do not provide much protection against these attacks.
Using rootless containers provides some protection against these kinds of
bugs (privileged writes in runc being redirected) -- by having runc itself
be an unprivileged process, in general you would expect the impact scope of
a runc bug to be less severe as it would only have the privileges afforded
to the host user which spawned runc. For this particular bug, the privilege
escalation caused by the inadvertent write issue is entirely mitigated with
rootless containers because the unprivileged user that the runc process is
executing as cannot write to the aforementioned procfs files (even
intentionally).

References

Other Runtimes

As this vulnerability boils down to a fairly easy-to-make logic bug, we have
provided information to other OCI (crun, youki) and non-OCI (LXC) container
runtimes about this vulnerability.

Based on discussions with other runtimes, it seems that crun and youki may have
similar security issues and will release a co-ordinated security release along
with runc. LXC appears to use the host's /proc for all procfs operations, and
so is likely not vulnerable to this issue (this is a trade-off -- runc uses the
container's procfs to avoid CVE-2016-9962-style attacks).

Credits

Thanks to Li Fubang (@lifubang from acmcoder.com, CIIC) and Tõnis Tiigi
(@tonistiigi from Docker) for both independently discovering this
vulnerability, as well as Aleksa Sarai (@cyphar from SUSE) for the original
research into this class of security issues and solutions.

Additional thanks go to Tõnis Tiigi for finding some very useful exploit
templates for these kinds of race attacks using docker buildx build.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

container escape and denial of service due to arbitrary write gadgets and procfs write redirects

Package

Affected versions

Patched versions

Description

Impact

Additional Impacts

Patches

Mitigations

References

Other Runtimes

Credits

Severity

CVSS overall score

CVSS v4 base metrics

Exploitability Metrics

Vulnerable System Impact Metrics

Subsequent System Impact Metrics

CVSS v4 base metrics

Exploitability Metrics

Vulnerable System Impact Metrics

Subsequent System Impact Metrics

CVE ID

Weaknesses

UNIX Symbolic Link (Symlink) Following

Race Condition Enabling Link Following

Credits