|
| 1 | +# Introduction to gVisor security |
| 2 | + |
| 3 | +This document is meant to be a high-level introduction to gVisor for security |
| 4 | +researchers. It explains how gVisor differs from the way other security products |
| 5 | +provide isolation. It assumes a solid understanding of how kernels and operating |
| 6 | +systems work. |
| 7 | + |
| 8 | +A look at the [Security Model page](/docs/architecture_guide/security) is also |
| 9 | +recommended. |
| 10 | + |
| 11 | +[TOC] |
| 12 | + |
| 13 | +## What is gVisor? |
| 14 | + |
| 15 | +[gVisor](https://gvisor.dev/) is an open-source workload isolation solution to |
| 16 | +safely run untrusted code, containers and applications. It |
| 17 | +[fundamentally differs](#how-does-gvisor-work) from other isolation solutions in |
| 18 | +that it is an application kernel, not a virtual machine hypervisor or a system |
| 19 | +call filter. |
| 20 | + |
| 21 | +## How does gVisor work? {#how-does-gvisor-work} |
| 22 | + |
| 23 | +The two most common approaches to sandboxing workloads are to use virtualization |
| 24 | +(virtual machines or VMs), and/or Linux kernel security primitives such as |
| 25 | +`seccomp-bpf`, Linux namespaces, AppArmor, Landlock, etc. **gVisor uses these |
| 26 | +technologies, but not in the standalone way they are typically used**. They are |
| 27 | +used for defense-in-depth rather than as a primary layer of defense. |
| 28 | + |
| 29 | +To explain the difference, it is useful to contrast it against these other |
| 30 | +approaches first. |
| 31 | + |
| 32 | +### How do Linux kernel security primitives work? |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +When using Linux kernel security primitives like `seccomp-bpf`, AppArmor, |
| 37 | +Landlock, namespaces, and such, the attack surface of the sandboxed applications |
| 38 | +is reduced, but that enforcement is still done by the single monolithic Linux |
| 39 | +kernel that the sandboxed application can still talk to. |
| 40 | + |
| 41 | +This means the workload is only one system call away from host compromise. The |
| 42 | +Linux kernel security primitives help in reducing the surface, but any attack |
| 43 | +within that surface (or that undoes the Linux kernel security mechanism itself) |
| 44 | +can still be executed. Additionally, these security primitives need to be |
| 45 | +carefully configured for the particular workload being sandboxed in order to be |
| 46 | +meaningful. For example, system call filters need to be whittled down to the |
| 47 | +very set of system calls the workload needs and no others. This also means that |
| 48 | +if an application depends on an "unsafe" or broad system call (like `ioctl(2)` |
| 49 | +or `io_uring(2)`), it can be very difficult or even impossible to create a |
| 50 | +secure set of filters for that application. Additionally, creating a generic |
| 51 | +configuration that works for all or most workloads will result in needing to |
| 52 | +allow all or most of the kernel surface to be exposed. |
| 53 | + |
| 54 | +While gVisor leverages `seccomp-bpf` and namespaces to minimize its own surface |
| 55 | +to the host kernel, it does so only as a second layer of defense, and not in a |
| 56 | +way that requires workload-specific tailoring in order to be meaningful. |
| 57 | + |
| 58 | +### How does virtualization work? |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +When using virtual machines, a hypervisor (which can run in userspace or |
| 63 | +kernelspace or both; in the diagram below, it is shown in kernelspace but the |
| 64 | +principle applies without loss of generality) manages the coordination between |
| 65 | +two kernels: one running on the host as normal, and another running inside a |
| 66 | +hardware-enforced virtual machine where the sandboxed workload's activity is |
| 67 | +contained. |
| 68 | + |
| 69 | +The virtual machine acts as a strong security boundary that restricts the |
| 70 | +application from accessing any host resource. The only way out of the virtual |
| 71 | +machine is through a "VM exit", an event triggered only in certain circumstances |
| 72 | +and handled by the hypervisor. While virtual machines are the gold standard of |
| 73 | +workload isolation, they come at a steep cost in terms of resource overhead and |
| 74 | +efficiency, due to the need to pre-allocate machine resources to each virtual |
| 75 | +machine on each host and to boot a full, separate Linux kernel. |
| 76 | + |
| 77 | +While gVisor can use virtualization (specifically: KVM), it can also work |
| 78 | +without virtualization while maintaining a high level of security. |
| 79 | + |
| 80 | +### How does gVisor provide isolation? |
| 81 | + |
| 82 | + |
| 83 | + |
| 84 | +Now that we've seen how Linux kernel security primitives and virtual machines |
| 85 | +work, let's turn to gVisor. |
| 86 | + |
| 87 | +**gVisor acts as an application kernel**, but **runs in userspace**. This means |
| 88 | +it takes the role that a kernel would from the perspective of a sandboxed |
| 89 | +workload, while gVisor itself otherwise acts as a regular user application from |
| 90 | +the host kernel's perspective. Like a kernel, gVisor intercepts and handles |
| 91 | +system calls and page faults from the sandboxed workload. This handling logic |
| 92 | +happens entirely within gVisor code, written in memory-safe Go. This kernel-like |
| 93 | +component is called the "gVisor Sentry". |
| 94 | + |
| 95 | +Like a user application, the gVisor Sentry *may* make limited system calls to |
| 96 | +the host Linux kernel. It does so when it determines that servicing the |
| 97 | +sandboxed workload's request requires information from the host machine and that |
| 98 | +the sandboxed workload was initially configured to allow such access. |
| 99 | + |
| 100 | +This means **the gVisor Sentry needs to re-implement Linux in Go**. The gVisor |
| 101 | +Sentry contains a Go-based, from-scratch reimplementation of the Linux system |
| 102 | +call interface, memory management, filesystems, a network stack, process |
| 103 | +management, signal handling, namespaces, etc. **gVisor never passes through any |
| 104 | +system call to the host**. Therefore, if a kernel feature isn't reimplemented in |
| 105 | +gVisor, then the sandboxed workload cannot use it. |
| 106 | + |
| 107 | +Let's walk through an example. Say a sandboxed process calls `getpid(2)`. gVisor |
| 108 | +intercepts this system call. gVisor keeps track of its own PID table |
| 109 | +representing the processes in the sandbox. These are not real host processes! |
| 110 | +Running `top(1)` on the host will not show them. gVisor uses its own PID table |
| 111 | +to find the PID of the sandboxed process, and returns that. From its |
| 112 | +perspective, the sandboxed process just ran `getpid(2)`, yet no host system call |
| 113 | +was made. |
| 114 | + |
| 115 | +Some system calls made by a sandboxed process may result in one or more host |
| 116 | +system calls being made. As a second example, if a sandboxed process wishes to |
| 117 | +`read(2)` from a unix `pipe(2)` that another process in the sandbox is |
| 118 | +`write(2)`'ing to, the gVisor Sentry (and more specifically, the Go runtime it |
| 119 | +relies on) may call the host `futex(2)` system call to perform blocking and |
| 120 | +synchronization between these operations. Therefore, the Sentry does need to be |
| 121 | +able to perform real system calls, but they do not map 1-to-1 to the system |
| 122 | +calls made by the sandboxed processes. |
| 123 | + |
| 124 | +The gVisor Sentry runs in a very restricted environment, leveraging all of the |
| 125 | +Linux kernel security primitives available (system call filtering, namespacing, |
| 126 | +cgroups, `pivot_root(2)`, etc.). Its system call filter prohibits system calls |
| 127 | +like `exec(2)`, `connect(2)`, and their respective variants (with caveats |
| 128 | +depending on sandbox configurations). It has an isolated view of the host |
| 129 | +filesystem using mount namespaces, and runs as in an isolated user namespace |
| 130 | +with minimal capabilities. **This does *not* mean that the sandboxed workload |
| 131 | +can't use these system calls; it actually can!** But their logic and |
| 132 | +implementation is entirely handled within the gVisor Sentry's kernel logic, |
| 133 | +rather than delegating any of it to the host kernel. |
| 134 | + |
| 135 | +For requests that cannot be serviced from within this restricted environment, |
| 136 | +there is a sidecar process called the Gofer which is a slightly-more-trusted |
| 137 | +companion process running in a slightly-more-privileged context. |
| 138 | + |
| 139 | +This security architecture is similar to virtual machines in that there are two |
| 140 | +separate kernels, with the innermost one being exclusive to the sandboxed |
| 141 | +workload, and with very restricted access to the host kernel. However, unlike |
| 142 | +virtual machines, gVisor sandboxes have the flexibility to allocate and release |
| 143 | +host resources (CPU, memory) at runtime, providing better efficiency and |
| 144 | +utilization without compromising on the security benefits of the VM-like |
| 145 | +dual-kernel security architecture. |
| 146 | + |
| 147 | +Additionally, the gVisor components are all written in memory-safe Go, |
| 148 | +eliminating the largest class of security vulnerabilities that would otherwise |
| 149 | +be present in a typical VM setup (Linux as guest kernel). In order to break out |
| 150 | +of a gVisor sandbox, an attacker would need to simultaneously exploit the gVisor |
| 151 | +Sentry kernel *and* the host Linux kernel, which do not share any code. |
| 152 | + |
| 153 | +gVisor contains multiple mechanisms by which it can intercept system calls and |
| 154 | +page faults from the sandboxed workload. These are called |
| 155 | +"[gVisor platforms](https://gvisor.dev/docs/architecture_guide/platforms/)". |
| 156 | +There are currently two supported platforms: |
| 157 | + |
| 158 | +* "Systrap" (the default). This platform is based on the use of Linux's |
| 159 | + `seccomp-bpf` subsystem for system call ***interception*** (as opposed to |
| 160 | + the typical use-case of `seccomp-bpf` being for system call |
| 161 | + ***filtering***). It does not require virtualization support from the host |
| 162 | + and is therefore well-suited to run *inside* a virtual machine. Read our |
| 163 | + [announcement post for more details on Systrap](https://gvisor.dev/blog/2023/04/28/systrap-release/). |
| 164 | +* "KVM". This platform is based on the use of Linux's KVM subsystem and uses |
| 165 | + virtualization as a means to provide address space isolation and |
| 166 | + interception of page faults. Sandboxed workload code runs in guest ring 3. |
| 167 | + This platform requires virtualization support. It can also work with nested |
| 168 | + virtualization, but is generally slower than Systrap in such a mode. |
| 169 | + |
| 170 | +Platforms are meant to be transparently interchangeable from the system |
| 171 | +administrator's perspective. However, they are still different from a security |
| 172 | +perspective, as the Linux kernel functionality they rely on to provide system |
| 173 | +call and page fault interception differs. |
| 174 | + |
| 175 | +For more information on gVisor security, please see the |
| 176 | +[Security Model page](https://gvisor.dev/docs/architecture_guide/security/). |
| 177 | + |
| 178 | +## What does gVisor *not* protect against? |
| 179 | + |
| 180 | +Generally speaking, gVisor protects against Linux kernel exploits by separating |
| 181 | +the sandboxed workload from accessing the host kernel directly. |
| 182 | + |
| 183 | +Where gVisor does ***not*** help: |
| 184 | + |
| 185 | +* Attacks in higher-level components of the stack before the sandbox or |
| 186 | + container runtime even enters the picture, e.g. an exploit in containerd |
| 187 | + that would cause it to start a container without gVisor. |
| 188 | +* Side-channel Spectre-style CPU attacks. gVisor only intercepts system calls |
| 189 | + and page faults, so the application is free to use the CPU as it wants |
| 190 | + (within host cgroup limits), similar to the VM case. Side-channel attacks |
| 191 | + need to be mitigated at the host kernel or hardware level. |
| 192 | +* Exploits *within* the sandboxed workload itself, e.g. a gVisor sandbox |
| 193 | + running nginx and PHP being exploited via an exploit in the PHP code. While |
| 194 | + gVisor *does* help in preventing the attacker from escalating the attack |
| 195 | + further out to the host, the attacker will still have access to whatever the |
| 196 | + sandbox is configured to have access to. In general, this means that |
| 197 | + different customer workloads should be run in different sandboxes to prevent |
| 198 | + a malicious customer from leaking data or exploiting another customer |
| 199 | + workload. Additionally, note that gVisor has a |
| 200 | + [runtime monitoring feature](https://gvisor.dev/docs/user_guide/runtimemonitor/) |
| 201 | + that can be used as an intrusion detection mechanism to detect compromise of |
| 202 | + the sandboxed workload itself. |
| 203 | + |
| 204 | +## How can I test gVisor? |
| 205 | + |
| 206 | +gVisor is available as an [OCI-compliant](https://opencontainers.org/) container |
| 207 | +runtime named [runsc](https://gvisor.dev/docs/user_guide/install/). It can be |
| 208 | +used with container ecosystem tools like Docker |
| 209 | +([gVisor guide](https://gvisor.dev/docs/user_guide/quick_start/docker/)) or |
| 210 | +Kubernetes |
| 211 | +([gVisor guide](https://gvisor.dev/docs/user_guide/quick_start/kubernetes/)). It |
| 212 | +can also be used directly for one-off testing, like this: |
| 213 | + |
| 214 | +```shell |
| 215 | +$ sudo runsc do echo Hello world |
| 216 | +Hello world |
| 217 | +``` |
| 218 | + |
| 219 | +Note the use of `sudo`, which may give you pause. It's a sandboxing tool, after |
| 220 | +all, shouldn't it run as an unprivileged user? gVisor-sandboxed workloads *do* |
| 221 | +run with minimal capabilities in an isolated user namespace from the perspective |
| 222 | +of the host kernel. However, the sandbox setup process requires privileges, |
| 223 | +specifically for setting up the userspace network stack. Once the sandbox setup |
| 224 | +is complete, gVisor re-executes itself and drops all privileges in the process. |
| 225 | +This takes place before any untrusted code runs. For sandboxes that don't |
| 226 | +require networking, it is possible to run in rootless mode without sudo: |
| 227 | + |
| 228 | +```shell |
| 229 | +$ runsc --rootless --network=none do echo Hello world |
| 230 | +Hello world |
| 231 | +``` |
| 232 | + |
| 233 | +How can you tell that gVisor is working? Well, try to do something that involves |
| 234 | +the host kernel. For example, you can call `dmesg(1)`, which reads the kernel |
| 235 | +logs: |
| 236 | + |
| 237 | +```shell |
| 238 | +# Without gVisor (unsandboxed): |
| 239 | +$ dmesg |
| 240 | +dmesg: read kernel buffer failed: Operation not permitted |
| 241 | + |
| 242 | +# With gVisor (sandboxed): |
| 243 | +$ runsc --rootless --network=none do dmesg |
| 244 | +[ 0.000000] Starting gVisor... |
| 245 | +[ 0.498943] Waiting for children... |
| 246 | +[ 0.972223] Committing treasure map to memory... |
| 247 | +[ 1.192981] Segmenting fault lines... |
| 248 | +[ 1.591823] Verifying that no non-zero bytes made their way into /dev/zero... |
| 249 | +[ 1.787191] Consulting tar man page... |
| 250 | +[ 2.083245] Searching for needles in stacks... |
| 251 | +[ 2.534575] Forking spaghetti code... |
| 252 | +[ 2.742140] Digging up root... |
| 253 | +[ 2.921313] Gathering forks... |
| 254 | +[ 3.342436] Creating cloned children... |
| 255 | +[ 3.511124] Setting up VFS... |
| 256 | +[ 3.812459] Setting up FUSE... |
| 257 | +[ 4.233037] Ready! |
| 258 | +``` |
| 259 | +
|
| 260 | +This demonstrates that the `dmesg(1)` binary is talking to the gVisor kernel |
| 261 | +instead of the host Linux kernel. The humorous messages displayed are part of |
| 262 | +gVisor's kernel code when a sandboxed workload asks for kernel logs. These logs |
| 263 | +are fictitious and are generated by gVisor's system call handler on demand, |
| 264 | +which is why re-running this command will yield different messages. Try to catch |
| 265 | +them all! |
| 266 | +
|
| 267 | +Note: `runsc do` gives the sandbox read-only access to the host's entire |
| 268 | +filesystem by default, as `runsc do` is just a convenience feature to test out |
| 269 | +gVisor quickly. In real-world usage, when runsc is used as an OCI container |
| 270 | +runtime, host filesystem mappings are strictly defined by the OCI runtime |
| 271 | +configuration and gVisor will only expose the paths that the OCI configuration |
| 272 | +dictates should be exposed (and will `pivot_root(2)` away from being able to |
| 273 | +access any other host directory, for defense in depth). For this reason, when |
| 274 | +testing gVisor from a security standpoint, it's better to |
| 275 | +[install it as a Docker runtime](https://gvisor.dev/docs/user_guide/quick_start/docker/), |
| 276 | +and then use it as follows: |
| 277 | +
|
| 278 | +```shell |
| 279 | +$ sudo docker run --rm --runtime=runsc -it -v /tmp/vol:/vol ubuntu /bin/bash |
| 280 | +``` |
| 281 | +
|
| 282 | +This will spawn a Bash shell inside a gVisor sandbox, with access to the host |
| 283 | +directory `/tmp/vol` mapped to `/vol` inside the sandbox. You can then poke |
| 284 | +around within the sandbox and see if you can escape out to the host or glean |
| 285 | +information from it (other than the contents of `/tmp/vol`). |
| 286 | +
|
| 287 | +## Further reading |
| 288 | +
|
| 289 | +* For more in-depth details on gVisor's security model and architecture, see |
| 290 | + [gVisor Security Model](/docs/architecture_guide/security). |
| 291 | +* For more details on how system call interception works, see |
| 292 | + [gVisor platforms](/docs/architecture_guide/platforms). |
| 293 | +* For guides on how to get started, see |
| 294 | + [Docker Quick Start](/docs/user_guide/quick_start/docker). |
0 commit comments