diff --git a/docs/jailer.md b/docs/jailer.md index 5fda82154e7..98db675d726 100644 --- a/docs/jailer.md +++ b/docs/jailer.md @@ -17,43 +17,31 @@ The jailer is invoked in this manner: jailer --id \ --exec-file \ --uid \ - --gid - [--parent-cgroup ] - [--cgroup-version ] - [--cgroup ] - [--chroot-base-dir ] - [--netns ] - [--resource-limit ] - [--daemonize] - [--new-pid-ns] + --gid \ + [--cgroup-version ] \ + [--cgroup ] \ + [--parent-cgroup ] \ + [--chroot-base-dir ] \ + [--netns ] \ + [--resource-limit ] \ + [--daemonize] \ + [--new-pid-ns] \ [--...extra arguments for Firecracker] ``` -- `id` is the unique VM identification string, which may contain alphanumeric - characters and hyphens. The maximum `id` length is currently 64 characters. -- `exec_file` is the path to the Firecracker binary that will be exec-ed by the - jailer. The filename must include the string `firecracker`. This is enforced - because the interaction with the jailer is Firecracker specific. -- `uid` and `gid` are the uid and gid the jailer switches to as it execs the - target binary. -- `parent-cgroup` is used to allow the placement of microvm cgroups in custom - nested hierarchies. By specifying this parameter, the jailer will create a new - cgroup named `id` for the microvm in the `/` - subfolder. `cgroup_base` is the cgroup controller root for `cgroup v1` (e.g. - `/sys/fs/cgroup/cpu`) or the unified controller hierarchy for `cgroup v2` ( - e.g. `/sys/fs/cgroup/unified`. `` is a relative path within - that hierarchy. For example, if `--parent-cgroup all_uvms/external_uvms` is - specified, the jailer will write all cgroup parameters specified through - `--cgroup` in `/sys/fs/cgroup//all_uvms/external_uvms/`. - By default, the parent cgroup is `exec-file`. If there are no `--cgroup` - parameters specified and `--group-version=2` was passed, then the jailer will - move the process to the specified cgroup. -- `cgroup-version` is used to select which type of cgroup hierarchy to use for +- `--id` specifies the unique VM identification string, which may contain + alphanumeric characters and hyphens. The maximum length is currently 64 + characters. +- `--exec-file` specifies the path to the Firecracker binary that will be + exec-ed by the jailer. +- `--uid` and `--gid` specify the uid and gid the jailer switches to as it execs + the target binary. +- `--cgroup-version` is used to select which type of cgroup hierarchy to use for the creation of cgroups. The default value is "1" which means that cgroups - specified with the `cgroup` argument will be created within a v1 hierarchy. - Supported options are "1" for cgroup-v1 and "2" for cgroup-v2. -- `cgroup` cgroups can be passed to the jailer to let it set the values when the - microVM process is spawned. The `--cgroup` argument must follow this format: + specified with `--cgroup` will be created within a v1 hierarchy. Supported + options are "1" for cgroup-v1 and "2" for cgroup-v2. +- `--cgroup` can be passed to the jailer to let it set the values when the + microVM process is spawned. The argument must follow this format: `=` (e.g `cpuset.cpus=0`). This argument can be used multiple times to set multiple cgroups. This is useful to avoid providing privileged permissions to another process for setting the cgroups before or @@ -61,15 +49,38 @@ jailer --id \ Firecracker process cgroups before the VM starts running, with no need to create the entire cgroup hierarchy manually (which requires privileged permissions). -- `chroot_base` represents the base folder where chroot jails are built. The - default is `/srv/jailer`. -- `netns` represents the path to a network namespace handle. If present, the +- `--parent-cgroup` is used to allow the placement of microvm cgroups in custom + nested hierarchies. The default value is the filename of ``, which + will be henceforth referred to as ``. The behavior of this + parameter depends on the following condition: + - If either any `--cgroup` parameter is specifed or `--cgroup-version=1` is + passed, the jailer will create a new cgroup named `` for the microvm in + the `/` subfolder. `` is the cgroup + controller root for cgroup v1 (e.g. `/sys/fs/cgroup/cpu`) or the unified + controller hierarchy for cgroup v2 (e.g. `/sys/fs/cgroup/unified`). + `` is a relative path within that hierarchy. For example, if + `--parent-cgroup all_uvms/external_uvms` is specified, the jailer will write + all cgroup parameters specified through `--cgroup` in + `/sys/fs/cgroup//all_uvms/external_uvms/`. + - If no `--cgroup` parameters are specified and `--cgroup-version=2` is + passed, the jailer will not create a new cgroup. If the cgroup specified + with `--parent-cgroup` exists, the jailer will move the process to the + specified cgroup, contrary to its name. This behavior can be used when users + want to configure a cgroup beforehand by themselves and move the process to + the configured cgroup. Note that, if the specified cgroup has domain + controllers (e.g. memory) enabled in `cgroup.subtree_control`, the move + fails due to ["no internal process constraint"][1] and jailer exits with an + error. If the cgroup spcified with `--parent-cgroup` does not exist, the + jailer does not move the process to any cgroup and proceeds without error. +- `--chroot-base-dir` specifies the base folder where chroot jails are built. + The default is `/srv/jailer`. +- `--netns` specifies the path to a network namespace handle. If present, the jailer will use this to join the associated network namespace. -- For extra security and control over resource usage, `resource-limit` can be - used to set bounds to the process resources. The `--resource-limit` argument - must follow this format: `=` (e.g `no-file=1024`) and can be - used multiple times to set multiple bounds. Current available resources that - can be limited using this argument are: +- For extra security and control over resource usage, `--resource-limit` can be + used to set bounds to the process resources. The argument must follow this + format: `=` (e.g `no-file=1024`) and can be used multiple + times to set multiple bounds. Current available resources that can be limited + using this argument are: - `fsize`: The maximum size in bytes for files created by the process. - `no-file`: Specifies a value one greater than the maximum file descriptor number that can be opened by this process. @@ -80,13 +91,13 @@ Here is an example on how to set multiple resource limits using this argument: --resource-limit fsize=250000000 --resource-limit no-file=1024 ``` -- When present, the `--daemonize` flag causes the jailer to call `setsid()` and - redirect all three standard I/O file descriptors to `/dev/null`. -- When present, the `--new-pid-ns` flag causes the jailer to spawn the provided - binary into a new PID namespace. It makes use of the libc `clone()` function - with the `CLONE_NEWPID` flag. As a result, the jailer and the process running - the exec file have different PIDs. The PID of the child process is stored in - the jail root directory inside `.pid`. +- When present, `--daemonize` causes the jailer to call `setsid()` and redirect + all three standard I/O file descriptors to `/dev/null`. +- When present, `--new-pid-ns` causes the jailer to spawn the provided binary + into a new PID namespace. It makes use of the libc `clone()` function with the + `CLONE_NEWPID` flag. As a result, the jailer and the process running the exec + file have different PIDs. The PID of the child process is stored in the jail + root directory inside `.pid`. - The jailer adheres to the "end of command options" convention, meaning all parameters specified after `--` are forwarded to Firecracker. For example, this can be paired with the `--config-file` Firecracker argument to specify a @@ -99,23 +110,21 @@ Here is an example on how to set multiple resource limits using this argument: After starting, the Jailer goes through the following operations: -- Validate **all provided paths** and the VM `id`. +- Validate **all provided paths** and the VM ID. - Close all open file descriptors based on `/proc//fd` except input, output and error. - Cleanup all environment variables received from the parent process. - Create the `///root` folder, which will be - henceforth referred to as `chroot_dir`. `exec_file_name` is the last path - component of `exec_file` (for example, that would be `firecracker` for - `/usr/bin/firecracker`). Nothing is done if the path already exists (it should - not, since `id` is supposed to be unique). -- Copy `exec_file` to - `///root/`. This ensures the - new process will not share memory with any other Firecracker process. + henceforth referred to as ``. Nothing is done if the path already + exists (it should not, since `` is supposed to be unique). +- Copy the file specified with `--exec-file` to `/`. + This ensures the new process will not share memory with any other Firecracker + process. - Set resource bounds for current process and its children through `--resource-limit` argument, by calling `setrlimit()` system call with the specific resource argument. If no limits are provided, the jailer bounds `no-file` to a maximum default value of 2048. -- Create the `cgroup` sub-folders. The jailer can use either `cgroup v1` or +- Create the cgroup sub-folders. The jailer can use either `cgroup v1` or `cgroup v2`. On most systems, this is mounted by default in `/sys/fs/cgroup` (should be mounted by the user otherwise). The jailer will parse `/proc/mounts` to detect where each of the controllers required in `--cgroup` @@ -123,17 +132,16 @@ After starting, the Jailer goes through the following operations: identified location (referred to as ``), the jailer creates the `//` subfolder, and writes the current pid to `///tasks`. Also, the value passed for each - `` is written to the file. If `--node` is used the corresponding - values are written to the appropriate `cpuset.mems` and `cpuset.cpus` files. + `` is written to the file. - Call `unshare()` into a new mount namespace, use `pivot_root()` to switch the - old system root mount point with a new one base in `chroot_dir`, switch the + old system root mount point with a new one base in ``, switch the current working directory to the new root, unmount the old root mount point, and call `chroot` into the current directory. - Use `mknod` to create a `/dev/net/tun` equivalent inside the jail. - Use `mknod` to create a `/dev/kvm` equivalent inside the jail. -- Use `chown` to change ownership of the `chroot_dir` (root path `/` as seen by - the jailed firecracker), `/dev/net/tun`, `/dev/kvm`. The ownership is changed - to the provided `uid:gid`. +- Use `chown` to change ownership of the `` (root path `/` as seen + by the jailed firecracker), `/dev/net/tun`, `/dev/kvm`. The ownership is + changed to the provided `:`. - If `--netns ` is present, attempt to join the specified network namespace. - If `--daemonize` is specified, call `setsid()` and redirect `STDIN`, `STDOUT`, @@ -148,8 +156,8 @@ After starting, the Jailer goes through the following operations: ` --id= --start-time-us= --start-time-cpu-us=` (and also forward any extra arguments provided to the jailer after `--`, as mentioned in the **Jailer Usage** section), where: - - `id`: (`string`) - The `id` argument provided to jailer. - - `opaque`: (`number`) time calculated by the jailer that it spent doing its + - ``: (`string`) - The `` argument provided to jailer. + - ``: (`number`) time calculated by the jailer that it spent doing its work. ## Example Run and Notes @@ -192,7 +200,7 @@ It’s worth noting that, whenever a folder already exists, nothing will be done and we move on to the next directory that needs to be created. This should only happen for the common `firecracker` subfolder (but, as for creating the chroot path before, we do not issue an error if folders directly associated with the -supposedly unique `id` already exist). +supposedly unique `` already exist). The jailer then writes the current pid to `/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/tasks`, @@ -229,8 +237,8 @@ call `chown(“/dev/net/tun”, 123, 100)`, so Firecracker can use it after drop privileges. This is required to use multiple TAP interfaces when running jailed. Do the same for `/dev/kvm`. -Change ownership of `` to `uid:gid` so that Firecracker can create -its API socket there. +Change ownership of `` to `:` so that Firecracker can +create its API socket there. Since the `--daemonize` flag is present, call `setsid()` to join a new session, a new process group, and to detach from the controlling terminal. Then, redirect @@ -238,7 +246,7 @@ standard file descriptors to `/dev/null` by calling `dup2(dev_null_fd, STDIN)`, `dup2(dev_null_fd, STDOUT)`, and `dup2(dev_null_fd, STDERR)`. Close `dev_null_fd`, because it is no longer necessary. -Finally, the jailer switches the `uid` to `123`, and `gid` to `100`, and execs +Finally, the jailer switches the uid to `123`, and gid to `100`, and execs ```console ./firecracker \ @@ -301,3 +309,5 @@ Note: default value for `` is `/run/firecracker.socket`. - If all the cgroup controllers are bunched up on a single mount point using the "all" option, our current program logic will complain it cannot detect individual controller mount points. + +[1]: https://docs.kernel.org/admin-guide/cgroup-v2.html#no-internal-process-constraint diff --git a/src/jailer/src/env.rs b/src/jailer/src/env.rs index 57d28724a57..1f106e2894b 100644 --- a/src/jailer/src/env.rs +++ b/src/jailer/src/env.rs @@ -218,7 +218,7 @@ impl Env { let cg_parent_procs = cg_parent.join("cgroup.procs"); if cg_parent.exists() { fs::write(cg_parent_procs, std::process::id().to_string()) - .map_err(|_| JailerError::CgroupWrite(io::Error::last_os_error()))?; + .map_err(|_| JailerError::CgroupMove(cg_parent, io::Error::last_os_error()))?; } } diff --git a/src/jailer/src/main.rs b/src/jailer/src/main.rs index 9532fa1eba4..694a951d0a4 100644 --- a/src/jailer/src/main.rs +++ b/src/jailer/src/main.rs @@ -43,8 +43,11 @@ pub enum JailerError { CgroupInvalidVersion(String), #[error("Parent cgroup path is invalid. Path should not be absolute or contain '..' or '.'")] CgroupInvalidParentPath(), - #[error("Failed to write to cgroups file: {0}")] - CgroupWrite(io::Error), + #[error( + "Failed to move process to cgroup ({0}): {1}.\nHint: If you intended to create a child \ + cgroup under {0}, pass any --cgroup parameters." + )] + CgroupMove(PathBuf, io::Error), #[error("Failed to change owner for {0}: {1}")] ChangeFileOwner(PathBuf, io::Error), #[error("Failed to chdir into chroot directory: {0}")] diff --git a/tests/integration_tests/security/test_jail.py b/tests/integration_tests/security/test_jail.py index e9cdfc2c644..aa2a206a6a9 100644 --- a/tests/integration_tests/security/test_jail.py +++ b/tests/integration_tests/security/test_jail.py @@ -194,6 +194,20 @@ def move_pid(self, cgname, pid): cg_pids = self.root.joinpath(f"{cgname}/cgroup.procs") cg_pids.write_text(f"{pid}\n", encoding="ascii") + def enable_controller_in_subtree(self, cgname, controller): + """Enable a controller in subtree_control of a cgroup and its ancestors""" + # Enable the controller in all ancestors if not already enabled. + parent_cg = self.root.joinpath(cgname).parent + parent_subtree_control = parent_cg.joinpath("cgroup.subtree_control") + if controller not in parent_subtree_control.read_text(encoding="ascii"): + self.enable_controller_in_subtree( + parent_cg.relative_to(self.root), controller + ) + + subtree_control = self.root.joinpath(f"{cgname}/cgroup.subtree_control") + subtree_control.write_text(f"+{controller}", encoding="ascii") + assert controller in subtree_control.read_text(encoding="ascii") + @pytest.fixture(scope="session", autouse=True) def cgroups_info(): @@ -230,11 +244,7 @@ def check_cgroups_v2(vm): cg_parent = cg.root / parent_cgroup cg_jail = cg_parent / vm.jailer.jailer_id - # if no cgroups were specified, then the jailer should move the FC process - # to the parent group - if len(vm.jailer.cgroups) == 0: - procs = cg_parent.joinpath("cgroup.procs").read_text().splitlines() - assert str(vm.firecracker_pid) in procs + assert len(vm.jailer.cgroups) > 0 for cgroup in vm.jailer.cgroups: controller = cgroup.split(".")[0] @@ -393,11 +403,23 @@ def test_v1_default_cgroups(uvm_plain, cgroups_info): check_cgroups_v1(test_microvm.jailer.cgroups, test_microvm.jailer.jailer_id) -def test_cgroups_custom_parent_move(uvm_plain, cgroups_info): +@pytest.mark.parametrize( + "parent_exists,domain_controller_in_subtree", + [(True, False), (True, True), (False, None)], +) +def test_cgroups_parent_cgroup_but_no_cgroup( + uvm_plain, cgroups_info, parent_exists, domain_controller_in_subtree +): """ - Test cgroups when a custom parent cgroup is used and no cgroups are specified + Test cgroups when `--parent-cgroup` is used but no `--cgroup` are specified. - In this case we just want to move under the parent cgroup + If the cgroup specified with `--parent-cgroup` exists, the jailer should + move to the specified cgroup instead of creating a new cgroup under it. + However, if the specified cgroup has domain controllers (e.g. `memory`) + enabled in `cgroup.subtree_control`, the move should fail. + + If the specified cgroup does not exist, the jailer does not move the process + to any cgroup and proceeds without error. """ if cgroups_info.version != 2: pytest.skip("cgroupsv2 only") @@ -407,9 +429,44 @@ def test_cgroups_custom_parent_move(uvm_plain, cgroups_info): parent_cgroup = f"custom_cgroup/{test_microvm.id[:8]}" test_microvm.jailer.parent_cgroup = parent_cgroup - cgroups_info.new_cgroup(parent_cgroup) - test_microvm.spawn() - check_cgroups_v2(test_microvm) + if parent_exists: + # Create the parent cgroup. + cgroups_info.new_cgroup(parent_cgroup) + if domain_controller_in_subtree: + # Enable "memory" controller in cgroup.subtree_control of the parent. + cgroups_info.enable_controller_in_subtree(parent_cgroup, "memory") + + # Check no --cgroups are specified just in case. + assert len(test_microvm.jailer.cgroups) == 0 + + cg_parent = cgroups_info.root / parent_cgroup + + if parent_exists: + if domain_controller_in_subtree: + # The jailer should have failed to move to the `parent_cgroup` + # since it has domain controllers enabled in + # `cgroup.subtree_control` due to the no internal process + # constraint. + # https://docs.kernel.org/admin-guide/cgroup-v2.html#no-internal-process-constraint + with pytest.raises( + ChildProcessError, + match=( + rf"Failed to move process to cgroup \({cg_parent}\): " + r"Resource busy \(os error 16\)" + ), + ): + test_microvm.spawn() + else: + # The jailer should have moved to the `parent_cgroup` instead of + # creating a new cgroup under it and move to the new cgroup. + test_microvm.spawn() + procs = cg_parent.joinpath("cgroup.procs").read_text().splitlines() + assert str(test_microvm.firecracker_pid) in procs + else: + # The jailer should not have moved to any cgroup and the parent + # still does not exist. + test_microvm.spawn() + assert not cg_parent.exists() def test_args_default_resource_limits(uvm_plain):