Skip to content
Merged
148 changes: 79 additions & 69 deletions docs/jailer.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,59 +17,70 @@ The jailer is invoked in this manner:
jailer --id <id> \
--exec-file <exec_file> \
--uid <uid> \
--gid <gid>
[--parent-cgroup <relative_path>]
[--cgroup-version <cgroup-version>]
[--cgroup <cgroup>]
[--chroot-base-dir <chroot_base>]
[--netns <netns>]
[--resource-limit <resource=value>]
[--daemonize]
[--new-pid-ns]
--gid <gid> \
[--cgroup-version <cgroup_version>] \
[--cgroup <cgroup>] \
[--parent-cgroup <parent_cgroup>] \
[--chroot-base-dir <chroot_base>] \
[--netns <netns>] \
[--resource-limit <resource=value>] \
[--daemonize] \
[--new-pid-ns] \
[--...extra arguments for Firecracker]
```

- `id` is the unique VM identification string, which may contain alphanumeric
characters and hyphens. The maximum `id` length is currently 64 characters.
- `exec_file` is the path to the Firecracker binary that will be exec-ed by the
jailer. The filename must include the string `firecracker`. This is enforced
because the interaction with the jailer is Firecracker specific.
- `uid` and `gid` are the uid and gid the jailer switches to as it execs the
target binary.
- `parent-cgroup` is used to allow the placement of microvm cgroups in custom
nested hierarchies. By specifying this parameter, the jailer will create a new
cgroup named `id` for the microvm in the `<cgroup_base>/<parent_cgroup>`
subfolder. `cgroup_base` is the cgroup controller root for `cgroup v1` (e.g.
`/sys/fs/cgroup/cpu`) or the unified controller hierarchy for `cgroup v2` (
e.g. `/sys/fs/cgroup/unified`. `<parent_cgroup>` is a relative path within
that hierarchy. For example, if `--parent-cgroup all_uvms/external_uvms` is
specified, the jailer will write all cgroup parameters specified through
`--cgroup` in `/sys/fs/cgroup/<controller_name>/all_uvms/external_uvms/<id>`.
By default, the parent cgroup is `exec-file`. If there are no `--cgroup`
parameters specified and `--group-version=2` was passed, then the jailer will
move the process to the specified cgroup.
- `cgroup-version` is used to select which type of cgroup hierarchy to use for
- `--id` specifies the unique VM identification string, which may contain
alphanumeric characters and hyphens. The maximum length is currently 64
characters.
- `--exec-file` specifies the path to the Firecracker binary that will be
exec-ed by the jailer.
- `--uid` and `--gid` specify the uid and gid the jailer switches to as it execs
the target binary.
- `--cgroup-version` is used to select which type of cgroup hierarchy to use for
the creation of cgroups. The default value is "1" which means that cgroups
specified with the `cgroup` argument will be created within a v1 hierarchy.
Supported options are "1" for cgroup-v1 and "2" for cgroup-v2.
- `cgroup` cgroups can be passed to the jailer to let it set the values when the
microVM process is spawned. The `--cgroup` argument must follow this format:
specified with `--cgroup` will be created within a v1 hierarchy. Supported
options are "1" for cgroup-v1 and "2" for cgroup-v2.
- `--cgroup` can be passed to the jailer to let it set the values when the
microVM process is spawned. The argument must follow this format:
`<cgroup_file>=<value>` (e.g `cpuset.cpus=0`). This argument can be used
multiple times to set multiple cgroups. This is useful to avoid providing
privileged permissions to another process for setting the cgroups before or
after the jailer is executed. The `--cgroup` flag can help as well to set
Firecracker process cgroups before the VM starts running, with no need to
create the entire cgroup hierarchy manually (which requires privileged
permissions).
- `chroot_base` represents the base folder where chroot jails are built. The
default is `/srv/jailer`.
- `netns` represents the path to a network namespace handle. If present, the
- `--parent-cgroup` is used to allow the placement of microvm cgroups in custom
nested hierarchies. The default value is the filename of `<exec_file>`, which
will be henceforth referred to as `<exec_file_name>`. The behavior of this
parameter depends on the following condition:
- If either any `--cgroup` parameter is specifed or `--cgroup-version=1` is
passed, the jailer will create a new cgroup named `<id>` for the microvm in
the `<cgroup_base>/<parent_cgroup>` subfolder. `<cgroup_base>` is the cgroup
controller root for cgroup v1 (e.g. `/sys/fs/cgroup/cpu`) or the unified
controller hierarchy for cgroup v2 (e.g. `/sys/fs/cgroup/unified`).
`<parent_cgroup>` is a relative path within that hierarchy. For example, if
`--parent-cgroup all_uvms/external_uvms` is specified, the jailer will write
all cgroup parameters specified through `--cgroup` in
`/sys/fs/cgroup/<controller_name>/all_uvms/external_uvms/<id>`.
- If no `--cgroup` parameters are specified and `--cgroup-version=2` is
passed, the jailer will not create a new cgroup. If the cgroup specified
with `--parent-cgroup` exists, the jailer will move the process to the
specified cgroup, contrary to its name. This behavior can be used when users
want to configure a cgroup beforehand by themselves and move the process to
the configured cgroup. Note that, if the specified cgroup has domain
controllers (e.g. memory) enabled in `cgroup.subtree_control`, the move
fails due to ["no internal process constraint"][1] and jailer exits with an
error. If the cgroup spcified with `--parent-cgroup` does not exist, the
jailer does not move the process to any cgroup and proceeds without error.
- `--chroot-base-dir` specifies the base folder where chroot jails are built.
The default is `/srv/jailer`.
- `--netns` specifies the path to a network namespace handle. If present, the
jailer will use this to join the associated network namespace.
- For extra security and control over resource usage, `resource-limit` can be
used to set bounds to the process resources. The `--resource-limit` argument
must follow this format: `<resource>=<value>` (e.g `no-file=1024`) and can be
used multiple times to set multiple bounds. Current available resources that
can be limited using this argument are:
- For extra security and control over resource usage, `--resource-limit` can be
used to set bounds to the process resources. The argument must follow this
format: `<resource>=<value>` (e.g `no-file=1024`) and can be used multiple
times to set multiple bounds. Current available resources that can be limited
using this argument are:
- `fsize`: The maximum size in bytes for files created by the process.
- `no-file`: Specifies a value one greater than the maximum file descriptor
number that can be opened by this process.
Expand All @@ -80,13 +91,13 @@ Here is an example on how to set multiple resource limits using this argument:
--resource-limit fsize=250000000 --resource-limit no-file=1024
```

- When present, the `--daemonize` flag causes the jailer to call `setsid()` and
redirect all three standard I/O file descriptors to `/dev/null`.
- When present, the `--new-pid-ns` flag causes the jailer to spawn the provided
binary into a new PID namespace. It makes use of the libc `clone()` function
with the `CLONE_NEWPID` flag. As a result, the jailer and the process running
the exec file have different PIDs. The PID of the child process is stored in
the jail root directory inside `<exec_file_name>.pid`.
- When present, `--daemonize` causes the jailer to call `setsid()` and redirect
all three standard I/O file descriptors to `/dev/null`.
- When present, `--new-pid-ns` causes the jailer to spawn the provided binary
into a new PID namespace. It makes use of the libc `clone()` function with the
`CLONE_NEWPID` flag. As a result, the jailer and the process running the exec
file have different PIDs. The PID of the child process is stored in the jail
root directory inside `<exec_file_name>.pid`.
- The jailer adheres to the "end of command options" convention, meaning all
parameters specified after `--` are forwarded to Firecracker. For example,
this can be paired with the `--config-file` Firecracker argument to specify a
Expand All @@ -99,41 +110,38 @@ Here is an example on how to set multiple resource limits using this argument:

After starting, the Jailer goes through the following operations:

- Validate **all provided paths** and the VM `id`.
- Validate **all provided paths** and the VM ID.
- Close all open file descriptors based on `/proc/<jailer-pid>/fd` except input,
output and error.
- Cleanup all environment variables received from the parent process.
- Create the `<chroot_base>/<exec_file_name>/<id>/root` folder, which will be
henceforth referred to as `chroot_dir`. `exec_file_name` is the last path
component of `exec_file` (for example, that would be `firecracker` for
`/usr/bin/firecracker`). Nothing is done if the path already exists (it should
not, since `id` is supposed to be unique).
- Copy `exec_file` to
`<chroot_base>/<exec_file_name>/<id>/root/<exec_file_name>`. This ensures the
new process will not share memory with any other Firecracker process.
henceforth referred to as `<chroot_dir>`. Nothing is done if the path already
exists (it should not, since `<id>` is supposed to be unique).
- Copy the file specified with `--exec-file` to `<chroot_dir>/<exec_file_name>`.
This ensures the new process will not share memory with any other Firecracker
process.
- Set resource bounds for current process and its children through
`--resource-limit` argument, by calling `setrlimit()` system call with the
specific resource argument. If no limits are provided, the jailer bounds
`no-file` to a maximum default value of 2048.
- Create the `cgroup` sub-folders. The jailer can use either `cgroup v1` or
- Create the cgroup sub-folders. The jailer can use either `cgroup v1` or
`cgroup v2`. On most systems, this is mounted by default in `/sys/fs/cgroup`
(should be mounted by the user otherwise). The jailer will parse
`/proc/mounts` to detect where each of the controllers required in `--cgroup`
can be found (multiple controllers may share the same path). For each
identified location (referred to as `<cgroup_base>`), the jailer creates the
`<cgroup_base>/<parent_cgroup>/<id>` subfolder, and writes the current pid to
`<cgroup_base>/<parent_cgroup>/<id>/tasks`. Also, the value passed for each
`<cgroup_file>` is written to the file. If `--node` is used the corresponding
values are written to the appropriate `cpuset.mems` and `cpuset.cpus` files.
`<cgroup_file>` is written to the file.
- Call `unshare()` into a new mount namespace, use `pivot_root()` to switch the
old system root mount point with a new one base in `chroot_dir`, switch the
old system root mount point with a new one base in `<chroot_dir>`, switch the
current working directory to the new root, unmount the old root mount point,
and call `chroot` into the current directory.
- Use `mknod` to create a `/dev/net/tun` equivalent inside the jail.
- Use `mknod` to create a `/dev/kvm` equivalent inside the jail.
- Use `chown` to change ownership of the `chroot_dir` (root path `/` as seen by
the jailed firecracker), `/dev/net/tun`, `/dev/kvm`. The ownership is changed
to the provided `uid:gid`.
- Use `chown` to change ownership of the `<chroot_dir>` (root path `/` as seen
by the jailed firecracker), `/dev/net/tun`, `/dev/kvm`. The ownership is
changed to the provided `<uid>:<gid>`.
- If `--netns <netns>` is present, attempt to join the specified network
namespace.
- If `--daemonize` is specified, call `setsid()` and redirect `STDIN`, `STDOUT`,
Expand All @@ -148,8 +156,8 @@ After starting, the Jailer goes through the following operations:
`<exec_file_name> --id=<id> --start-time-us=<opaque> --start-time-cpu-us=<opaque>`
(and also forward any extra arguments provided to the jailer after `--`, as
mentioned in the **Jailer Usage** section), where:
- `id`: (`string`) - The `id` argument provided to jailer.
- `opaque`: (`number`) time calculated by the jailer that it spent doing its
- `<id>`: (`string`) - The `<id>` argument provided to jailer.
- `<opaque>`: (`number`) time calculated by the jailer that it spent doing its
work.

## Example Run and Notes
Expand Down Expand Up @@ -192,7 +200,7 @@ It’s worth noting that, whenever a folder already exists, nothing will be done
and we move on to the next directory that needs to be created. This should only
happen for the common `firecracker` subfolder (but, as for creating the chroot
path before, we do not issue an error if folders directly associated with the
supposedly unique `id` already exist).
supposedly unique `<id>` already exist).

The jailer then writes the current pid to
`/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/tasks`,
Expand Down Expand Up @@ -229,16 +237,16 @@ call `chown(“/dev/net/tun”, 123, 100)`, so Firecracker can use it after drop
privileges. This is required to use multiple TAP interfaces when running jailed.
Do the same for `/dev/kvm`.

Change ownership of `<chroot_dir>` to `uid:gid` so that Firecracker can create
its API socket there.
Change ownership of `<chroot_dir>` to `<uid>:<gid>` so that Firecracker can
create its API socket there.

Since the `--daemonize` flag is present, call `setsid()` to join a new session,
a new process group, and to detach from the controlling terminal. Then, redirect
standard file descriptors to `/dev/null` by calling `dup2(dev_null_fd, STDIN)`,
`dup2(dev_null_fd, STDOUT)`, and `dup2(dev_null_fd, STDERR)`. Close
`dev_null_fd`, because it is no longer necessary.

Finally, the jailer switches the `uid` to `123`, and `gid` to `100`, and execs
Finally, the jailer switches the uid to `123`, and gid to `100`, and execs

```console
./firecracker \
Expand Down Expand Up @@ -301,3 +309,5 @@ Note: default value for `<api-sock>` is `/run/firecracker.socket`.
- If all the cgroup controllers are bunched up on a single mount point using the
"all" option, our current program logic will complain it cannot detect
individual controller mount points.

[1]: https://docs.kernel.org/admin-guide/cgroup-v2.html#no-internal-process-constraint
2 changes: 1 addition & 1 deletion src/jailer/src/env.rs
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ impl Env {
let cg_parent_procs = cg_parent.join("cgroup.procs");
if cg_parent.exists() {
fs::write(cg_parent_procs, std::process::id().to_string())
.map_err(|_| JailerError::CgroupWrite(io::Error::last_os_error()))?;
.map_err(|_| JailerError::CgroupMove(cg_parent, io::Error::last_os_error()))?;
}
}

Expand Down
7 changes: 5 additions & 2 deletions src/jailer/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,11 @@ pub enum JailerError {
CgroupInvalidVersion(String),
#[error("Parent cgroup path is invalid. Path should not be absolute or contain '..' or '.'")]
CgroupInvalidParentPath(),
#[error("Failed to write to cgroups file: {0}")]
CgroupWrite(io::Error),
#[error(
"Failed to move process to cgroup ({0}): {1}.\nHint: If you intended to create a child \
cgroup under {0}, pass any --cgroup parameters."
)]
CgroupMove(PathBuf, io::Error),
#[error("Failed to change owner for {0}: {1}")]
ChangeFileOwner(PathBuf, io::Error),
#[error("Failed to chdir into chroot directory: {0}")]
Expand Down
79 changes: 68 additions & 11 deletions tests/integration_tests/security/test_jail.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,20 @@ def move_pid(self, cgname, pid):
cg_pids = self.root.joinpath(f"{cgname}/cgroup.procs")
cg_pids.write_text(f"{pid}\n", encoding="ascii")

def enable_controller_in_subtree(self, cgname, controller):
"""Enable a controller in subtree_control of a cgroup and its ancestors"""
# Enable the controller in all ancestors if not already enabled.
parent_cg = self.root.joinpath(cgname).parent
parent_subtree_control = parent_cg.joinpath("cgroup.subtree_control")
if controller not in parent_subtree_control.read_text(encoding="ascii"):
self.enable_controller_in_subtree(
parent_cg.relative_to(self.root), controller
)

subtree_control = self.root.joinpath(f"{cgname}/cgroup.subtree_control")
subtree_control.write_text(f"+{controller}", encoding="ascii")
assert controller in subtree_control.read_text(encoding="ascii")


@pytest.fixture(scope="session", autouse=True)
def cgroups_info():
Expand Down Expand Up @@ -230,11 +244,7 @@ def check_cgroups_v2(vm):
cg_parent = cg.root / parent_cgroup
cg_jail = cg_parent / vm.jailer.jailer_id

# if no cgroups were specified, then the jailer should move the FC process
# to the parent group
if len(vm.jailer.cgroups) == 0:
procs = cg_parent.joinpath("cgroup.procs").read_text().splitlines()
assert str(vm.firecracker_pid) in procs
assert len(vm.jailer.cgroups) > 0

for cgroup in vm.jailer.cgroups:
controller = cgroup.split(".")[0]
Expand Down Expand Up @@ -393,11 +403,23 @@ def test_v1_default_cgroups(uvm_plain, cgroups_info):
check_cgroups_v1(test_microvm.jailer.cgroups, test_microvm.jailer.jailer_id)


def test_cgroups_custom_parent_move(uvm_plain, cgroups_info):
@pytest.mark.parametrize(
"parent_exists,domain_controller_in_subtree",
[(True, False), (True, True), (False, None)],
)
def test_cgroups_parent_cgroup_but_no_cgroup(
uvm_plain, cgroups_info, parent_exists, domain_controller_in_subtree
):
"""
Test cgroups when a custom parent cgroup is used and no cgroups are specified
Test cgroups when `--parent-cgroup` is used but no `--cgroup` are specified.

In this case we just want to move under the parent cgroup
If the cgroup specified with `--parent-cgroup` exists, the jailer should
move to the specified cgroup instead of creating a new cgroup under it.
However, if the specified cgroup has domain controllers (e.g. `memory`)
enabled in `cgroup.subtree_control`, the move should fail.

If the specified cgroup does not exist, the jailer does not move the process
to any cgroup and proceeds without error.
"""
if cgroups_info.version != 2:
pytest.skip("cgroupsv2 only")
Expand All @@ -407,9 +429,44 @@ def test_cgroups_custom_parent_move(uvm_plain, cgroups_info):
parent_cgroup = f"custom_cgroup/{test_microvm.id[:8]}"
test_microvm.jailer.parent_cgroup = parent_cgroup

cgroups_info.new_cgroup(parent_cgroup)
test_microvm.spawn()
check_cgroups_v2(test_microvm)
if parent_exists:
# Create the parent cgroup.
cgroups_info.new_cgroup(parent_cgroup)
if domain_controller_in_subtree:
# Enable "memory" controller in cgroup.subtree_control of the parent.
cgroups_info.enable_controller_in_subtree(parent_cgroup, "memory")

# Check no --cgroups are specified just in case.
assert len(test_microvm.jailer.cgroups) == 0

cg_parent = cgroups_info.root / parent_cgroup

if parent_exists:
if domain_controller_in_subtree:
# The jailer should have failed to move to the `parent_cgroup`
# since it has domain controllers enabled in
# `cgroup.subtree_control` due to the no internal process
# constraint.
# https://docs.kernel.org/admin-guide/cgroup-v2.html#no-internal-process-constraint
with pytest.raises(
ChildProcessError,
match=(
rf"Failed to move process to cgroup \({cg_parent}\): "
r"Resource busy \(os error 16\)"
),
):
test_microvm.spawn()
else:
# The jailer should have moved to the `parent_cgroup` instead of
# creating a new cgroup under it and move to the new cgroup.
test_microvm.spawn()
procs = cg_parent.joinpath("cgroup.procs").read_text().splitlines()
assert str(test_microvm.firecracker_pid) in procs
else:
# The jailer should not have moved to any cgroup and the parent
# still does not exist.
test_microvm.spawn()
assert not cg_parent.exists()


def test_args_default_resource_limits(uvm_plain):
Expand Down