Skip to content

Commit 4270d17

Browse files
Merge pull request kubernetes#639 from jessfraz/pr-180
Add support for `no_new_privs` via allowPrivilegeEscalation
2 parents 65e448e + 01390e7 commit 4270d17

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed

no-new-privs.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# No New Privileges
2+
3+
- [Description](#description)
4+
* [Interactions with other Linux primitives](#interactions-with-other-linux-primitives)
5+
- [Current Implementations](#current-implementations)
6+
* [Support in Docker](#support-in-docker)
7+
* [Support in rkt](#support-in-rkt)
8+
* [Support in OCI runtimes](#support-in-oci-runtimes)
9+
- [Existing SecurityContext objects](#existing-securitycontext-objects)
10+
- [Changes of SecurityContext objects](#changes-of-securitycontext-objects)
11+
- [Pod Security Policy changes](#pod-security-policy-changes)
12+
13+
14+
## Description
15+
16+
In Linux, the `execve` system call can grant more privileges to a newly-created
17+
process than its parent process. Considering security issues, since Linux kernel
18+
v3.5, there is a new flag named `no_new_privs` added to prevent those new
19+
privileges from being granted to the processes.
20+
21+
[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt)
22+
is inherited across `fork`, `clone` and `execve` and can not be unset. With
23+
`no_new_privs` set, `execve` promises not to grant the privilege to do anything
24+
that could not have been done without the `execve` call.
25+
26+
For more details about `no_new_privs`, please check the
27+
[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).
28+
29+
This is different from `NOSUID` in that `no_new_privs`can give permission to
30+
the container process to further restrict child processes with seccomp. This
31+
permission goes only one-way in that the container process can not grant more
32+
permissions, only further restrict.
33+
34+
### Interactions with other Linux primitives
35+
36+
- suid binaries: will break when `no_new_privs` is enabled
37+
- seccomp2 as a non root user: requires `no_new_privs`
38+
- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs`
39+
- ambient capabilities: requires `no_new_privs`
40+
- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969)
41+
42+
43+
## Current Implementations
44+
45+
### Support in Docker
46+
47+
Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs`
48+
while creating containers, for example
49+
`docker run --security-opt=no_new_privs busybox`.
50+
51+
Docker provides via their Go api an object named `ContainerCreateConfig` to
52+
configure container creation parameters. In this object, there is a string
53+
array `HostConfig.SecurityOpt` to specify the security options. Client can
54+
utilize this field to specify the arguments for security options while
55+
creating new containers.
56+
57+
This field did not scale well for the Docker client, so it's suggested that
58+
Kubernetes does not follow that design.
59+
60+
This is not on by default in Docker.
61+
62+
More details of the Docker implementation can be read
63+
[here](https://github.com/moby/moby/pull/20727) as well as the original
64+
discussion [here](https://github.com/moby/moby/issues/20329).
65+
66+
### Support in rkt
67+
68+
Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt.
69+
70+
More details of the rkt implementation can be read
71+
[here](https://github.com/rkt/rkt/pull/2677).
72+
73+
### Support in OCI runtimes
74+
75+
Since version 0.3.0 of the OCI runtime specification, a user can specify the
76+
`noNewPrivs` boolean flag in the configuration file.
77+
78+
More details of the OCI implementation can be read
79+
[here](https://github.com/opencontainers/runtime-spec/pull/290).
80+
81+
## Existing SecurityContext objects
82+
83+
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
84+
for `PodSpec`. `SecurityContext` objects define the related security options
85+
for Kubernetes containers, e.g. selinux options.
86+
87+
To support "no new privileges" options in Kubernetes, it is proposed to make
88+
the following changes:
89+
90+
## Changes of SecurityContext objects
91+
92+
Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext`
93+
definition.
94+
95+
By default, ie when `allowPrivilegeEscalation=nil`, we will set `no_new_privs=true`
96+
with the following exceptions:
97+
98+
- when a container is `privileged`
99+
- when `CAP_SYS_ADMIN` is added to a container
100+
- when a container is not run as root, uid `0` (to prevent breaking suid
101+
binaries)
102+
103+
The API will reject as invalid `privileged=true` and
104+
`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and
105+
`allowPrivilegeEscalation=false.`
106+
107+
When `allowPrivilegeEscalation` is set to `false` it will enable `no_new_privs`
108+
for that container.
109+
110+
`allowPrivilegeEscalation` in `SecurityContext` provides container level
111+
control of the `no_new_privs` flag and can override the default in both directions
112+
of the `allowPrivilegeEscalation` setting.
113+
114+
This requires changes to the Docker, rkt, and CRI runtime integrations so that
115+
kubelet will add the specific `no_new_privs` option.
116+
117+
## Pod Security Policy changes
118+
119+
The default can be set via a new `*bool` type field named `defaultAllowPrivilegeEscalation`
120+
in a Pod Security Policy.
121+
This would allow users to set `defaultAllowPrivilegeEscalation=false`, overriding the
122+
default `nil` behavior of `no_new_privs=false` for containers
123+
whose uids are not 0.
124+
125+
This would also keep the behavior of setting the security context as
126+
`allowPrivilegeEscalation=true`
127+
for privileged containers and those with `capAdd=CAP_SYS_ADMIN`.
128+
129+
To recap, below is a table defining the default behavior at the pod security
130+
policy level and what can be set as a default with a pod security policy.
131+
132+
| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN |
133+
|----------------------------------|--------------------|--------------------|--------------------------|
134+
| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false |
135+
| false | no_new_privs=true | no_new_privs=true | no_new_privs=false |
136+
| true | no_new_privs=false | no_new_privs=false | no_new_privs=false |
137+
138+
A new `bool` field named `allowPrivilegeEscalation` will be added to the Pod
139+
Security Policy as well to gate whether or not a user is allowed to set the
140+
security context to `allowPrivilegeEscalation=true`. This field will default to
141+
false.

0 commit comments

Comments
 (0)