Skip to content

Commit 8f2903a

Browse files
EtiennePerotgvisor-bot
authored andcommitted
Create intro to gVisor security page.
This is meant to be an overview of how gVisor works and how to try it out for security researchers unfamiliar with gVisor as a technology. PiperOrigin-RevId: 770361419
1 parent dc7222d commit 8f2903a

File tree

6 files changed

+314
-3
lines changed

6 files changed

+314
-3
lines changed

g3doc/architecture_guide/BUILD

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,19 @@ doc(
3030
weight = "30",
3131
)
3232

33+
doc(
34+
name = "intro_to_gvisor",
35+
src = "intro_to_gvisor.md",
36+
category = "Architecture Guide",
37+
data = [
38+
"isolation_with_gvisor.svg",
39+
"isolation_with_linux_security_primitives.svg",
40+
"isolation_with_virtualization.svg",
41+
],
42+
permalink = "/docs/architecture_guide/intro/",
43+
weight = "10",
44+
)
45+
3346
doc(
3447
name = "security",
3548
src = "security.md",
@@ -39,15 +52,15 @@ doc(
3952
"security.svg",
4053
],
4154
permalink = "/docs/architecture_guide/security/",
42-
weight = "10",
55+
weight = "20",
4356
)
4457

4558
doc(
4659
name = "performance",
4760
src = "performance.md",
4861
category = "Architecture Guide",
4962
permalink = "/docs/architecture_guide/performance/",
50-
weight = "20",
63+
weight = "50",
5164
)
5265

5366
doc(
@@ -58,5 +71,5 @@ doc(
5871
"packetflow.svg",
5972
],
6073
permalink = "/docs/architecture_guide/networking/",
61-
weight = "50",
74+
weight = "60",
6275
)
Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Introduction to gVisor security
2+
3+
This document is meant to be a high-level introduction to gVisor for security
4+
researchers. It explains how gVisor differs from the way other security products
5+
provide isolation. It assumes a solid understanding of how kernels and operating
6+
systems work.
7+
8+
A look at the [Security Model page](/docs/architecture_guide/security) is also
9+
recommended.
10+
11+
[TOC]
12+
13+
## What is gVisor?
14+
15+
[gVisor](https://gvisor.dev/) is an open-source workload isolation solution to
16+
safely run untrusted code, containers and applications. It
17+
[fundamentally differs](#how-does-gvisor-work) from other isolation solutions in
18+
that it is an application kernel, not a virtual machine hypervisor or a system
19+
call filter.
20+
21+
## How does gVisor work? {#how-does-gvisor-work}
22+
23+
The two most common approaches to sandboxing workloads are to use virtualization
24+
(virtual machines or VMs), and/or Linux kernel security primitives such as
25+
`seccomp-bpf`, Linux namespaces, AppArmor, Landlock, etc. **gVisor uses these
26+
technologies, but not in the standalone way they are typically used**. They are
27+
used for defense-in-depth rather than as a primary layer of defense.
28+
29+
To explain the difference, it is useful to contrast it against these other
30+
approaches first.
31+
32+
### How do Linux kernel security primitives work?
33+
34+
![Isolation with Linux kernel security primitives](isolation_with_linux_security_primitives.svg "Isolation with Linux kernel security primitives")
35+
36+
When using Linux kernel security primitives like `seccomp-bpf`, AppArmor,
37+
Landlock, namespaces, and such, the attack surface of the sandboxed applications
38+
is reduced, but that enforcement is still done by the single monolithic Linux
39+
kernel that the sandboxed application can still talk to.
40+
41+
This means the workload is only one system call away from host compromise. The
42+
Linux kernel security primitives help in reducing the surface, but any attack
43+
within that surface (or that undoes the Linux kernel security mechanism itself)
44+
can still be executed. Additionally, these security primitives need to be
45+
carefully configured for the particular workload being sandboxed in order to be
46+
meaningful. For example, system call filters need to be whittled down to the
47+
very set of system calls the workload needs and no others. This also means that
48+
if an application depends on an "unsafe" or broad system call (like `ioctl(2)`
49+
or `io_uring(2)`), it can be very difficult or even impossible to create a
50+
secure set of filters for that application. Additionally, creating a generic
51+
configuration that works for all or most workloads will result in needing to
52+
allow all or most of the kernel surface to be exposed.
53+
54+
While gVisor leverages `seccomp-bpf` and namespaces to minimize its own surface
55+
to the host kernel, it does so only as a second layer of defense, and not in a
56+
way that requires workload-specific tailoring in order to be meaningful.
57+
58+
### How does virtualization work?
59+
60+
![Isolation with virtualization](isolation_with_virtualization.svg "Isolation with virtualization")
61+
62+
When using virtual machines, a hypervisor (which can run in userspace or
63+
kernelspace or both; in the diagram below, it is shown in kernelspace but the
64+
principle applies without loss of generality) manages the coordination between
65+
two kernels: one running on the host as normal, and another running inside a
66+
hardware-enforced virtual machine where the sandboxed workload's activity is
67+
contained.
68+
69+
The virtual machine acts as a strong security boundary that restricts the
70+
application from accessing any host resource. The only way out of the virtual
71+
machine is through a "VM exit", an event triggered only in certain circumstances
72+
and handled by the hypervisor. While virtual machines are the gold standard of
73+
workload isolation, they come at a steep cost in terms of resource overhead and
74+
efficiency, due to the need to pre-allocate machine resources to each virtual
75+
machine on each host and to boot a full, separate Linux kernel.
76+
77+
While gVisor can use virtualization (specifically: KVM), it can also work
78+
without virtualization while maintaining a high level of security.
79+
80+
### How does gVisor provide isolation?
81+
82+
![Isolation with gVisor](isolation_with_gvisor.svg "Isolation with gVisor")
83+
84+
Now that we've seen how Linux kernel security primitives and virtual machines
85+
work, let's turn to gVisor.
86+
87+
**gVisor acts as an application kernel**, but **runs in userspace**. This means
88+
it takes the role that a kernel would from the perspective of a sandboxed
89+
workload, while gVisor itself otherwise acts as a regular user application from
90+
the host kernel's perspective. Like a kernel, gVisor intercepts and handles
91+
system calls and page faults from the sandboxed workload. This handling logic
92+
happens entirely within gVisor code, written in memory-safe Go. This kernel-like
93+
component is called the "gVisor Sentry".
94+
95+
Like a user application, the gVisor Sentry *may* make limited system calls to
96+
the host Linux kernel. It does so when it determines that servicing the
97+
sandboxed workload's request requires information from the host machine and that
98+
the sandboxed workload was initially configured to allow such access.
99+
100+
This means **the gVisor Sentry needs to re-implement Linux in Go**. The gVisor
101+
Sentry contains a Go-based, from-scratch reimplementation of the Linux system
102+
call interface, memory management, filesystems, a network stack, process
103+
management, signal handling, namespaces, etc. **gVisor never passes through any
104+
system call to the host**. Therefore, if a kernel feature isn't reimplemented in
105+
gVisor, then the sandboxed workload cannot use it.
106+
107+
Let's walk through an example. Say a sandboxed process calls `getpid(2)`. gVisor
108+
intercepts this system call. gVisor keeps track of its own PID table
109+
representing the processes in the sandbox. These are not real host processes!
110+
Running `top(1)` on the host will not show them. gVisor uses its own PID table
111+
to find the PID of the sandboxed process, and returns that. From its
112+
perspective, the sandboxed process just ran `getpid(2)`, yet no host system call
113+
was made.
114+
115+
Some system calls made by a sandboxed process may result in one or more host
116+
system calls being made. As a second example, if a sandboxed process wishes to
117+
`read(2)` from a unix `pipe(2)` that another process in the sandbox is
118+
`write(2)`'ing to, the gVisor Sentry (and more specifically, the Go runtime it
119+
relies on) may call the host `futex(2)` system call to perform blocking and
120+
synchronization between these operations. Therefore, the Sentry does need to be
121+
able to perform real system calls, but they do not map 1-to-1 to the system
122+
calls made by the sandboxed processes.
123+
124+
The gVisor Sentry runs in a very restricted environment, leveraging all of the
125+
Linux kernel security primitives available (system call filtering, namespacing,
126+
cgroups, `pivot_root(2)`, etc.). Its system call filter prohibits system calls
127+
like `exec(2)`, `connect(2)`, and their respective variants (with caveats
128+
depending on sandbox configurations). It has an isolated view of the host
129+
filesystem using mount namespaces, and runs as in an isolated user namespace
130+
with minimal capabilities. **This does *not* mean that the sandboxed workload
131+
can't use these system calls; it actually can!** But their logic and
132+
implementation is entirely handled within the gVisor Sentry's kernel logic,
133+
rather than delegating any of it to the host kernel.
134+
135+
For requests that cannot be serviced from within this restricted environment,
136+
there is a sidecar process called the Gofer which is a slightly-more-trusted
137+
companion process running in a slightly-more-privileged context.
138+
139+
This security architecture is similar to virtual machines in that there are two
140+
separate kernels, with the innermost one being exclusive to the sandboxed
141+
workload, and with very restricted access to the host kernel. However, unlike
142+
virtual machines, gVisor sandboxes have the flexibility to allocate and release
143+
host resources (CPU, memory) at runtime, providing better efficiency and
144+
utilization without compromising on the security benefits of the VM-like
145+
dual-kernel security architecture.
146+
147+
Additionally, the gVisor components are all written in memory-safe Go,
148+
eliminating the largest class of security vulnerabilities that would otherwise
149+
be present in a typical VM setup (Linux as guest kernel). In order to break out
150+
of a gVisor sandbox, an attacker would need to simultaneously exploit the gVisor
151+
Sentry kernel *and* the host Linux kernel, which do not share any code.
152+
153+
gVisor contains multiple mechanisms by which it can intercept system calls and
154+
page faults from the sandboxed workload. These are called
155+
"[gVisor platforms](https://gvisor.dev/docs/architecture_guide/platforms/)".
156+
There are currently two supported platforms:
157+
158+
* "Systrap" (the default). This platform is based on the use of Linux's
159+
`seccomp-bpf` subsystem for system call ***interception*** (as opposed to
160+
the typical use-case of `seccomp-bpf` being for system call
161+
***filtering***). It does not require virtualization support from the host
162+
and is therefore well-suited to run *inside* a virtual machine. Read our
163+
[announcement post for more details on Systrap](https://gvisor.dev/blog/2023/04/28/systrap-release/).
164+
* "KVM". This platform is based on the use of Linux's KVM subsystem and uses
165+
virtualization as a means to provide address space isolation and
166+
interception of page faults. Sandboxed workload code runs in guest ring 3.
167+
This platform requires virtualization support. It can also work with nested
168+
virtualization, but is generally slower than Systrap in such a mode.
169+
170+
Platforms are meant to be transparently interchangeable from the system
171+
administrator's perspective. However, they are still different from a security
172+
perspective, as the Linux kernel functionality they rely on to provide system
173+
call and page fault interception differs.
174+
175+
For more information on gVisor security, please see the
176+
[Security Model page](https://gvisor.dev/docs/architecture_guide/security/).
177+
178+
## What does gVisor *not* protect against?
179+
180+
Generally speaking, gVisor protects against Linux kernel exploits by separating
181+
the sandboxed workload from accessing the host kernel directly.
182+
183+
Where gVisor does ***not*** help:
184+
185+
* Attacks in higher-level components of the stack before the sandbox or
186+
container runtime even enters the picture, e.g. an exploit in containerd
187+
that would cause it to start a container without gVisor.
188+
* Side-channel Spectre-style CPU attacks. gVisor only intercepts system calls
189+
and page faults, so the application is free to use the CPU as it wants
190+
(within host cgroup limits), similar to the VM case. Side-channel attacks
191+
need to be mitigated at the host kernel or hardware level.
192+
* Exploits *within* the sandboxed workload itself, e.g. a gVisor sandbox
193+
running nginx and PHP being exploited via an exploit in the PHP code. While
194+
gVisor *does* help in preventing the attacker from escalating the attack
195+
further out to the host, the attacker will still have access to whatever the
196+
sandbox is configured to have access to. In general, this means that
197+
different customer workloads should be run in different sandboxes to prevent
198+
a malicious customer from leaking data or exploiting another customer
199+
workload. Additionally, note that gVisor has a
200+
[runtime monitoring feature](https://gvisor.dev/docs/user_guide/runtimemonitor/)
201+
that can be used as an intrusion detection mechanism to detect compromise of
202+
the sandboxed workload itself.
203+
204+
## How can I test gVisor?
205+
206+
gVisor is available as an [OCI-compliant](https://opencontainers.org/) container
207+
runtime named [runsc](https://gvisor.dev/docs/user_guide/install/). It can be
208+
used with container ecosystem tools like Docker
209+
([gVisor guide](https://gvisor.dev/docs/user_guide/quick_start/docker/)) or
210+
Kubernetes
211+
([gVisor guide](https://gvisor.dev/docs/user_guide/quick_start/kubernetes/)). It
212+
can also be used directly for one-off testing, like this:
213+
214+
```shell
215+
$ sudo runsc do echo Hello world
216+
Hello world
217+
```
218+
219+
Note the use of `sudo`, which may give you pause. It's a sandboxing tool, after
220+
all, shouldn't it run as an unprivileged user? gVisor-sandboxed workloads *do*
221+
run with minimal capabilities in an isolated user namespace from the perspective
222+
of the host kernel. However, the sandbox setup process requires privileges,
223+
specifically for setting up the userspace network stack. Once the sandbox setup
224+
is complete, gVisor re-executes itself and drops all privileges in the process.
225+
This takes place before any untrusted code runs. For sandboxes that don't
226+
require networking, it is possible to run in rootless mode without sudo:
227+
228+
```shell
229+
$ runsc --rootless --network=none do echo Hello world
230+
Hello world
231+
```
232+
233+
How can you tell that gVisor is working? Well, try to do something that involves
234+
the host kernel. For example, you can call `dmesg(1)`, which reads the kernel
235+
logs:
236+
237+
```shell
238+
# Without gVisor (unsandboxed):
239+
$ dmesg
240+
dmesg: read kernel buffer failed: Operation not permitted
241+
242+
# With gVisor (sandboxed):
243+
$ runsc --rootless --network=none do dmesg
244+
[ 0.000000] Starting gVisor...
245+
[ 0.498943] Waiting for children...
246+
[ 0.972223] Committing treasure map to memory...
247+
[ 1.192981] Segmenting fault lines...
248+
[ 1.591823] Verifying that no non-zero bytes made their way into /dev/zero...
249+
[ 1.787191] Consulting tar man page...
250+
[ 2.083245] Searching for needles in stacks...
251+
[ 2.534575] Forking spaghetti code...
252+
[ 2.742140] Digging up root...
253+
[ 2.921313] Gathering forks...
254+
[ 3.342436] Creating cloned children...
255+
[ 3.511124] Setting up VFS...
256+
[ 3.812459] Setting up FUSE...
257+
[ 4.233037] Ready!
258+
```
259+
260+
This demonstrates that the `dmesg(1)` binary is talking to the gVisor kernel
261+
instead of the host Linux kernel. The humorous messages displayed are part of
262+
gVisor's kernel code when a sandboxed workload asks for kernel logs. These logs
263+
are fictitious and are generated by gVisor's system call handler on demand,
264+
which is why re-running this command will yield different messages. Try to catch
265+
them all!
266+
267+
Note: `runsc do` gives the sandbox read-only access to the host's entire
268+
filesystem by default, as `runsc do` is just a convenience feature to test out
269+
gVisor quickly. In real-world usage, when runsc is used as an OCI container
270+
runtime, host filesystem mappings are strictly defined by the OCI runtime
271+
configuration and gVisor will only expose the paths that the OCI configuration
272+
dictates should be exposed (and will `pivot_root(2)` away from being able to
273+
access any other host directory, for defense in depth). For this reason, when
274+
testing gVisor from a security standpoint, it's better to
275+
[install it as a Docker runtime](https://gvisor.dev/docs/user_guide/quick_start/docker/),
276+
and then use it as follows:
277+
278+
```shell
279+
$ sudo docker run --rm --runtime=runsc -it -v /tmp/vol:/vol ubuntu /bin/bash
280+
```
281+
282+
This will spawn a Bash shell inside a gVisor sandbox, with access to the host
283+
directory `/tmp/vol` mapped to `/vol` inside the sandbox. You can then poke
284+
around within the sandbox and see if you can escape out to the host or glean
285+
information from it (other than the contents of `/tmp/vol`).
286+
287+
## Further reading
288+
289+
* For more in-depth details on gVisor's security model and architecture, see
290+
[gVisor Security Model](/docs/architecture_guide/security).
291+
* For more details on how system call interception works, see
292+
[gVisor platforms](/docs/architecture_guide/platforms).
293+
* For guides on how to get started, see
294+
[Docker Quick Start](/docs/user_guide/quick_start/docker).

g3doc/architecture_guide/isolation_with_gvisor.svg

Lines changed: 1 addition & 0 deletions
Loading

g3doc/architecture_guide/isolation_with_linux_security_primitives.svg

Lines changed: 1 addition & 0 deletions
Loading

g3doc/architecture_guide/isolation_with_virtualization.svg

Lines changed: 1 addition & 0 deletions
Loading

website/BUILD

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,7 @@ docs(
147147
"//g3doc:index",
148148
"//g3doc:roadmap",
149149
"//g3doc:style",
150+
"//g3doc/architecture_guide:intro_to_gvisor",
150151
"//g3doc/architecture_guide:networking",
151152
"//g3doc/architecture_guide:performance",
152153
"//g3doc/architecture_guide:platforms",

0 commit comments

Comments
 (0)