Skip to content

Commit 2c7ce65

Browse files
authored
Merge branch 'main' into fix/kvmclock_ctrl_test
2 parents 2d22702 + 567b1ea commit 2c7ce65

File tree

30 files changed

+390
-485
lines changed

30 files changed

+390
-485
lines changed

CHANGELOG.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ and this project adheres to
4545
- [#5007](https://github.com/firecracker-microvm/firecracker/pull/5007): Fixed
4646
watchdog softlockup warning on x86_64 guests when a vCPU is paused during GDB
4747
debugging.
48+
- [#5021](https://github.com/firecracker-microvm/firecracker/pull/5021) If a
49+
balloon device is inflated post UFFD-backed snapshot restore, Firecracker now
50+
causes `remove` UFFD messages to be sent to the UFFD handler. Previously, no
51+
such message would be sent.
4852

4953
## [1.10.1]
5054

@@ -118,7 +122,8 @@ and this project adheres to
118122
VMGenID support for microVMs running on ARM hosts with 6.1 guest kernels.
119123
Support for VMGenID via DeviceTree bindings exists only on mainline 6.10 Linux
120124
onwards. Users of Firecracker will need to backport the relevant patches on
121-
top of their 6.1 kernels to make use of the feature.
125+
top of their 6.1 kernels to make use of the feature. As a result, Firecracker
126+
snapshot version is now 3.0.0
122127
- [#4732](https://github.com/firecracker-microvm/firecracker/pull/4732),
123128
[#4733](https://github.com/firecracker-microvm/firecracker/pull/4733),
124129
[#4741](https://github.com/firecracker-microvm/firecracker/pull/4741),

Cargo.lock

Lines changed: 16 additions & 16 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/snapshotting/snapshot-support.md

Lines changed: 15 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@
2525
- [Secure and insecure usage examples](#usage-examples)
2626
- [Reusing snapshotted states securely](#reusing-snapshotted-states-securely)
2727
- [Vsock device limitation](#vsock-device-limitation)
28+
- [VMGenID device limitation](#vmgenid-device-limitation)
29+
- [Where can I resume my snapshots?](#where-can-i-resume-my-snapshots)
2830

2931
## About microVM snapshotting
3032

@@ -638,28 +640,19 @@ might not be able to handle the injected notification and crash. We suggest to
638640
users that they take snapshots only after the guest kernel has completed
639641
booting, to avoid this issue.
640642

641-
## Snapshot compatibility across kernel versions
643+
## Where can I resume my snapshots?
642644

643-
We have a mechanism in place to experiment with snapshot compatibility across
644-
supported host kernel versions by generating snapshot artifacts through
645-
[this test](../../tests/integration_tests/functional/test_snapshot_phase1.py)
646-
and checking devices' functionality using
647-
[this test](../../tests/integration_tests/functional/test_snapshot_restore_cross_kernel.py).
648-
The test restores the snapshot and ensures that all the devices set-up (network
649-
devices, disk, vsock, balloon and MMDS) are operational post-load.
645+
Snapshots must be resumed on an software and hardware configuration which is
646+
identical to what they were generated on. However, in limited cases, snapshots
647+
can be resumed on identical hardware instances where they were taken on, but
648+
using newer host kernel versions. While we do not provide any guarantees on this
649+
setup (and do not recommend doing this in production), we are currently aware of
650+
the compatibility table reported below:
650651

651-
In those tests the instance is fixed, except some combinations where we also
652-
test across the same CPU family (Intel x86, Gravitons). In general cross-CPU
653-
snapshots [are not supported](./versioning.md#cpu-model)
652+
| .metal instance type | taken on host kernel | restored on host kernel |
653+
| -------------------- | -------------------- | ----------------------- |
654+
| {c5n,m5n,m6i,m6a} | 5.10 | 6.1 |
654655

655-
The tables below reflect the snapshot compatibility observed on the AWS
656-
instances we support.
657-
658-
**all** means all currently supported Intel/AMD/ARM metal instances (m6g, m7g,
659-
m5n, c5n, m6i, m6a). It does not mean cross-instance, i.e. a snapshot taken on
660-
m6i won't work on an m6g instance.
661-
662-
| *CPU family* | *taken on host kernel* | *restored on host kernel* | *working?* |
663-
| ------------ | ---------------------- | ------------------------- | ---------- |
664-
| **x86_64** | 5.10 | 6.1 | Y |
665-
| **x86_64** | 6.1 | 5.10 | N |
656+
For example, a snapshot taken on a m6i.metal host running a 5.10 host kernel can
657+
be restored on a different m6i.metal host running a 6.1 host kernel (but not
658+
vice versa), but could not be restored on a c5n.metal host.

src/clippy-tracing/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ clap = { version = "4.5.27", features = ["derive"] }
1414
itertools = "0.14.0"
1515
proc-macro2 = { version = "1.0.93", features = ["span-locations"] }
1616
quote = "1.0.38"
17-
syn = { version = "2.0.96", features = ["full", "extra-traits", "visit", "visit-mut", "printing"] }
17+
syn = { version = "2.0.98", features = ["full", "extra-traits", "visit", "visit-mut", "printing"] }
1818
walkdir = "2.5.0"
1919

2020
[dev-dependencies]

src/cpu-template-helper/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ displaydoc = "0.2.5"
1515
libc = "0.2.169"
1616
log-instrument = { path = "../log-instrument", optional = true }
1717
serde = { version = "1.0.217", features = ["derive"] }
18-
serde_json = "1.0.137"
18+
serde_json = "1.0.138"
1919
thiserror = "2.0.11"
2020

2121
vmm = { path = "../vmm" }

src/firecracker/Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ micro_http = { git = "https://github.com/firecracker-microvm/micro-http" }
2424

2525
serde = { version = "1.0.217", features = ["derive"] }
2626
serde_derive = "1.0.136"
27-
serde_json = "1.0.137"
27+
serde_json = "1.0.138"
2828
thiserror = "2.0.11"
2929
timerfd = "1.6.0"
3030
utils = { path = "../utils" }
@@ -43,7 +43,7 @@ userfaultfd = "0.8.1"
4343
[build-dependencies]
4444
seccompiler = { path = "../seccompiler" }
4545
serde = { version = "1.0.217" }
46-
serde_json = "1.0.137"
46+
serde_json = "1.0.138"
4747

4848
[features]
4949
tracing = ["log-instrument", "utils/tracing", "vmm/tracing"]

src/firecracker/examples/uffd/fault_all_handler.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ fn main() {
3636
userfaultfd::Event::Pagefault { .. } => {
3737
for region in uffd_handler.mem_regions.clone() {
3838
uffd_handler
39-
.serve_pf(region.mapping.base_host_virt_addr as _, region.mapping.size)
39+
.serve_pf(region.mapping.base_host_virt_addr as _, region.mapping.size);
4040
}
4141
}
4242
_ => panic!("Unexpected event on userfaultfd"),

src/firecracker/examples/uffd/uffd_utils.rs

Lines changed: 33 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ impl UffdHandler {
116116
}
117117
}
118118

119-
pub fn serve_pf(&mut self, addr: *mut u8, len: usize) {
119+
pub fn serve_pf(&mut self, addr: *mut u8, len: usize) -> bool {
120120
// Find the start of the page that the current faulting address belongs to.
121121
let dst = (addr as usize & !(self.page_size - 1)) as *mut libc::c_void;
122122
let fault_page_addr = dst as u64;
@@ -133,14 +133,18 @@ impl UffdHandler {
133133
// event was received. This can be a consequence of guest reclaiming back its
134134
// memory from the host (through balloon device)
135135
Some(MemPageState::Uninitialized) | Some(MemPageState::FromFile) => {
136-
let (start, end) = self.populate_from_file(region, fault_page_addr, len);
137-
self.update_mem_state_mappings(start, end, MemPageState::FromFile);
138-
return;
136+
match self.populate_from_file(region, fault_page_addr, len) {
137+
Some((start, end)) => {
138+
self.update_mem_state_mappings(start, end, MemPageState::FromFile)
139+
}
140+
None => return false,
141+
}
142+
return true;
139143
}
140144
Some(MemPageState::Removed) | Some(MemPageState::Anonymous) => {
141145
let (start, end) = self.zero_out(fault_page_addr);
142146
self.update_mem_state_mappings(start, end, MemPageState::Anonymous);
143-
return;
147+
return true;
144148
}
145149
None => {}
146150
}
@@ -152,20 +156,39 @@ impl UffdHandler {
152156
);
153157
}
154158

155-
fn populate_from_file(&self, region: &MemRegion, dst: u64, len: usize) -> (u64, u64) {
159+
fn populate_from_file(&self, region: &MemRegion, dst: u64, len: usize) -> Option<(u64, u64)> {
156160
let offset = dst - region.mapping.base_host_virt_addr;
157161
let src = self.backing_buffer as u64 + region.mapping.offset + offset;
158162

159163
let ret = unsafe {
160-
self.uffd
161-
.copy(src as *const _, dst as *mut _, len, true)
162-
.expect("Uffd copy failed")
164+
match self.uffd.copy(src as *const _, dst as *mut _, len, true) {
165+
Ok(value) => value,
166+
// Catch EAGAIN errors, which occur when a `remove` event lands in the UFFD
167+
// queue while we're processing `pagefault` events.
168+
// The weird cast is because the `bytes_copied` field is based on the
169+
// `uffdio_copy->copy` field, which is a signed 64 bit integer, and if something
170+
// goes wrong, it gets set to a -errno code. However, uffd-rs always casts this
171+
// value to an unsigned `usize`, which scrambled the errno.
172+
Err(Error::PartiallyCopied(bytes_copied))
173+
if bytes_copied == 0 || bytes_copied == (-libc::EAGAIN) as usize =>
174+
{
175+
return None
176+
}
177+
Err(Error::CopyFailed(errno))
178+
if std::io::Error::from(errno).raw_os_error().unwrap() == libc::EEXIST =>
179+
{
180+
len
181+
}
182+
Err(e) => {
183+
panic!("Uffd copy failed: {e:?}");
184+
}
185+
}
163186
};
164187

165188
// Make sure the UFFD copied some bytes.
166189
assert!(ret > 0);
167190

168-
(dst, dst + len as u64)
191+
Some((dst, dst + len as u64))
169192
}
170193

171194
fn zero_out(&mut self, addr: u64) -> (u64, u64) {

src/firecracker/examples/uffd/valid_handler.rs

Lines changed: 72 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -26,24 +26,79 @@ fn main() {
2626
let mut runtime = Runtime::new(stream, file);
2727
runtime.install_panic_hook();
2828
runtime.run(|uffd_handler: &mut UffdHandler| {
29-
// Read an event from the userfaultfd.
30-
let event = uffd_handler
31-
.read_event()
32-
.expect("Failed to read uffd_msg")
33-
.expect("uffd_msg not ready");
34-
35-
// We expect to receive either a Page Fault or Removed
36-
// event (if the balloon device is enabled).
37-
match event {
38-
userfaultfd::Event::Pagefault { addr, .. } => {
39-
uffd_handler.serve_pf(addr.cast(), uffd_handler.page_size)
29+
// !DISCLAIMER!
30+
// When using UFFD together with the balloon device, this handler needs to deal with
31+
// `remove` and `pagefault` events. There are multiple things to keep in mind in
32+
// such setups:
33+
//
34+
// As long as any `remove` event is pending in the UFFD queue, all ioctls return EAGAIN
35+
// -----------------------------------------------------------------------------------
36+
//
37+
// This means we cannot process UFFD events simply one-by-one anymore - if a `remove` event
38+
// arrives, we need to pre-fetch all other events up to the `remove` event, to unblock the
39+
// UFFD, and then go back to the process the pre-fetched events.
40+
//
41+
// UFFD might receive events in not in their causal order
42+
// -----------------------------------------------------
43+
//
44+
// For example, the guest
45+
// kernel might first respond to a balloon inflation by freeing some memory, and
46+
// telling Firecracker about this. Firecracker will then madvise(MADV_DONTNEED) the
47+
// free memory range, which causes a `remove` event to be sent to UFFD. Then, the
48+
// guest kernel might immediately fault the page in again (for example because
49+
// default_on_oom was set). which causes a `pagefault` event to be sent to UFFD.
50+
//
51+
// However, the pagefault will be triggered from inside KVM on the vCPU thread, while the
52+
// balloon device is handled by Firecracker on its VMM thread. This means that potentially
53+
// this handler can receive the `pagefault` _before_ the `remove` event.
54+
//
55+
// This means that the simple "greedy" strategy of simply prefetching _all_ UFFD events
56+
// to make sure no `remove` event is blocking us can result in the handler acting on
57+
// the `pagefault` event before the `remove` message (despite the `remove` event being
58+
// in the causal past of the `pagefault` event), which means that we will fault in a page
59+
// from the snapshot file, while really we should be faulting in a zero page.
60+
//
61+
// In this example handler, we ignore this problem, to avoid
62+
// complexity (under the assumption that the guest kernel will zero a newly faulted in
63+
// page anyway). A production handler will most likely want to ensure that `remove`
64+
// events for a specific range are always handled before `pagefault` events.
65+
//
66+
// Lastly, we still need to deal with the race condition where a `remove` event arrives
67+
// in the UFFD queue after we got done reading all events, in which case we need to go
68+
// back to reading more events before we can continue processing `pagefault`s.
69+
let mut deferred_events = Vec::new();
70+
71+
loop {
72+
// First, try events that we couldn't handle last round
73+
let mut events_to_handle = Vec::from_iter(deferred_events.drain(..));
74+
75+
// Read all events from the userfaultfd.
76+
while let Some(event) = uffd_handler.read_event().expect("Failed to read uffd_msg") {
77+
events_to_handle.push(event);
78+
}
79+
80+
for event in events_to_handle.drain(..) {
81+
// We expect to receive either a Page Fault or `remove`
82+
// event (if the balloon device is enabled).
83+
match event {
84+
userfaultfd::Event::Pagefault { addr, .. } => {
85+
if !uffd_handler.serve_pf(addr.cast(), uffd_handler.page_size) {
86+
deferred_events.push(event);
87+
}
88+
}
89+
userfaultfd::Event::Remove { start, end } => uffd_handler
90+
.update_mem_state_mappings(start as u64, end as u64, MemPageState::Removed),
91+
_ => panic!("Unexpected event on userfaultfd"),
92+
}
93+
}
94+
95+
// We assume that really only the above removed/pagefault interaction can result in
96+
// deferred events. In that scenario, the loop will always terminate (unless
97+
// newly arriving `remove` events end up indefinitely blocking it, but there's nothing
98+
// we can do about that, and it's a largely theoretical problem).
99+
if deferred_events.is_empty() {
100+
break;
40101
}
41-
userfaultfd::Event::Remove { start, end } => uffd_handler.update_mem_state_mappings(
42-
start as u64,
43-
end as u64,
44-
MemPageState::Removed,
45-
),
46-
_ => panic!("Unexpected event on userfaultfd"),
47102
}
48103
});
49104
}

src/log-instrument-macros/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ bench = false
1313
[dependencies]
1414
proc-macro2 = "1.0.93"
1515
quote = "1.0.38"
16-
syn = { version = "2.0.96", features = ["full", "extra-traits"] }
16+
syn = { version = "2.0.98", features = ["full", "extra-traits"] }
1717

1818
[lints]
1919
workspace = true

0 commit comments

Comments
 (0)