Skip to content

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Sep 3, 2025

Changes

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Sep 3, 2025

Codecov Report

❌ Patch coverage is 47.96651% with 435 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.79%. Comparing base (8208ee8) to head (0d2a0bc).

Files with missing lines Patch % Lines
src/vmm/src/builder.rs 45.29% 93 Missing ⚠️
src/vmm/src/lib.rs 2.12% 92 Missing ⚠️
src/vmm/src/resources.rs 37.50% 55 Missing ⚠️
src/vmm/src/vstate/vcpu.rs 20.75% 42 Missing ⚠️
src/vmm/src/vstate/vm.rs 71.02% 31 Missing ⚠️
src/vmm/src/vstate/memory.rs 74.13% 30 Missing ⚠️
src/vmm/src/persist.rs 37.77% 28 Missing ⚠️
src/vmm/src/devices/virtio/net/device.rs 44.44% 25 Missing ⚠️
src/vmm/src/devices/virtio/block/device.rs 0.00% 10 Missing ⚠️
src/vmm/src/devices/virtio/block/virtio/device.rs 40.00% 6 Missing ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5429      +/-   ##
==========================================
- Coverage   82.68%   81.79%   -0.90%     
==========================================
  Files         263      263              
  Lines       27473    28104     +631     
==========================================
+ Hits        22717    22987     +270     
- Misses       4756     5117     +361     
Flag Coverage Δ
5.10-m5n.metal 81.83% <46.03%> (-1.02%) ⬇️
5.10-m6a.metal 81.01% <46.03%> (-1.09%) ⬇️
5.10-m6g.metal 78.52% <45.96%> (-0.96%) ⬇️
5.10-m6i.metal 81.77% <46.03%> (-1.07%) ⬇️
5.10-m7a.metal-48xl 81.00% <46.03%> (-1.09%) ⬇️
5.10-m7g.metal 78.52% <45.96%> (-0.96%) ⬇️
5.10-m7i.metal-24xl 81.75% <46.03%> (-1.07%) ⬇️
5.10-m7i.metal-48xl 81.75% <46.03%> (-1.07%) ⬇️
5.10-m8g.metal-24xl 78.52% <45.96%> (-0.96%) ⬇️
5.10-m8g.metal-48xl 78.52% <45.96%> (-0.96%) ⬇️
6.1-m5n.metal 81.86% <46.03%> (-1.03%) ⬇️
6.1-m6a.metal 81.05% <46.03%> (-1.09%) ⬇️
6.1-m6g.metal 78.52% <45.96%> (-0.96%) ⬇️
6.1-m6i.metal 81.86% <46.03%> (-1.03%) ⬇️
6.1-m7a.metal-48xl 81.04% <46.03%> (-1.09%) ⬇️
6.1-m7g.metal 78.52% <45.96%> (-0.96%) ⬇️
6.1-m7i.metal-24xl 81.83% <46.03%> (-1.07%) ⬇️
6.1-m7i.metal-48xl 81.83% <46.03%> (-1.06%) ⬇️
6.1-m8g.metal-24xl 78.51% <45.96%> (-0.96%) ⬇️
6.1-m8g.metal-48xl 78.51% <45.96%> (-0.97%) ⬇️
6.16-m5n.metal 81.82% <46.03%> (?)
6.16-m6a.metal 81.04% <46.03%> (?)
6.16-m6g.metal 78.52% <45.96%> (?)
6.16-m6i.metal 81.81% <46.03%> (?)
6.16-m7a.metal-48xl 81.03% <46.03%> (?)
6.16-m7g.metal 78.52% <45.96%> (?)
6.16-m7i.metal-24xl 81.83% <46.03%> (?)
6.16-m7i.metal-48xl 81.82% <46.03%> (?)
6.16-m8g.metal-24xl 78.52% <45.96%> (?)
6.16-m8g.metal-48xl 78.52% <45.96%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@roypat roypat force-pushed the secret-hiding-rebase branch 3 times, most recently from d44d66c to ca0cf78 Compare September 8, 2025 16:05
JackThomson2 and others added 26 commits September 9, 2025 07:45
Creating a script to build and install a modified kernel with patches
applied.

Signed-off-by: Jack Thomson <[email protected]>
Adding a new integration test to assert that the kernel build script
will succeed.

Signed-off-by: Jack Thomson <[email protected]>
Adding the secret hiding kernel as a default for the buildkite pipeline,
this will mean that PR's made against the branch will now be run with
the new secret hiding enabled amis.

Some tests have been marked to skip as they are kernel dependent so
while we are compiling our kernel in CI these could change again.

Signed-off-by: Jack Thomson <[email protected]>
To make it easier to track the upstream kernels which may change as we
rebase, let's mark kernels newer than 6.12 as next for now to make
dashboarding easier.

Signed-off-by: Jack Thomson <[email protected]>
Addressing a comment to move away from dir stacks in our install
scripts. We now store the start directly before we move the build
directory and cd back to that explicitly.

Signed-off-by: Jack Thomson <[email protected]>
Run the kernel build as part of our nightly tests so we can monitor it's
success.

Signed-off-by: Jack Thomson <[email protected]>
Add all required linux host kernel patches required for secret hiding.
These are:
- Direct map removal patches
- make kvm_clock work with direct map removed guest_memfd
- v2 of KVM_USERFAULT patches [1]
- support for UFFDIO_CONTINUE in guest_memfd VMAs
- support for write(2) syscall for guest_memfd

Based on kvm/next

[1]: https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: Patrick Roy <[email protected]>
The patches are in the `patches` subdirectory of `hiding_ci`, so if only
patches were added, then the check of "any files with parent directory
`hiding_ci`" would be false, and the CI step for testing the build of
patches wouldn't actually run.

Fix this by updating the check to be "any files where any parent
directory is `hiding_ci`", which will also catch patches.

Reported-by: Jack Thomson <[email protected]>
Signed-off-by: Patrick Roy <[email protected]>
Update the build script to allow us to install the secret hidden kernels
onto Amazon Linux 2023 instances.

We have to as part of this include a script to download and install ena
drivers for the instance to allow us to boot.

Signed-off-by: Jack Thomson <[email protected]>
The output from the build in x86 is archived so updated the script to
support installing either output type from the build

Signed-off-by: Jack Thomson <[email protected]>
Add an 'apt update' before `apt install`. Otherwise, we might hold an
old view of the package versions and installation might fail.

Signed-off-by: Babis Chalios <[email protected]>
This lint forbids using `..Default::default()` in struct initializers
after all fields have already been initialized, but this is a useful
pattern if you know you want to add more fields to a struct in a future
PR without needing to touch a ton of initializers in unittests again
(_heavy foreshadowing_). So silence the paperclip.

Signed-off-by: Patrick Roy <[email protected]>
There's no need to test this through VmResources when it can be tested
in isolation. Also, everytime I touch MachineConfig I get confsued by
where the hell the tests are, cuz not only are they in a different
module, they're also one directory level away. So move the tests into
machine_config.rs, where it makes sense to have them.

Signed-off-by: Patrick Roy <[email protected]>
With secret freedom, direct accesses to guest memory from the context of
the host kernel are no longer possible. This particularly means that we
cannot pass pointers to guest memory to the host kernel anymore (at
least if the host kernel tries to GUP them). For these scenarios,
introduce a utility decorator struct `MaybeBounce` that can optionally
do indirect read and write syscalls on guest memory by first memcpy-ing
to firecracker userspace, and passing a pointer to firecracker heap
memory into the kernel instead.

Signed-off-by: Patrick Roy <[email protected]>
This is particularly useful for virtio devices, where on-demand
allocation of bounce buffers leads to sever performance impacts (~80%)
in synthetic throughput tests. Additionally, for virtio devices we can
know approximately what the optimal size of a statically allocated
bounce buffer is.

Allocate bounce buffers on the heap, as trying to even temporarily place
a 65k bounce buffer on the stack can lead to stack overflow errors.

Signed-off-by: Patrick Roy <[email protected]>
Add support to our virtio devices to allow userspace bounce buffering of
virtio buffers. This is an alternative to swiotlb.

Don't implement it for vhost-user-blk and for virtio-block with async
engine, because I have no idea how that would even work.

Signed-off-by: Patrick Roy <[email protected]>
If the CI artifacts dont contain old firecracker releases, still succeed
at setting them up after downloading them.

Signed-off-by: Patrick Roy <[email protected]>
Add a utility function for creating a guest_memfd and wrapping it into a
`File` object.

Signed-off-by: Patrick Roy <[email protected]>
There's be a lot more things that are incompatible going forward (mostly
related to secret freedom), so instead of adding a ton of error variants
for each pair of incompatible features, let's just have a single one
where we can insert arbitrary features via a string argument.

Signed-off-by: Patrick Roy <[email protected]>
This will later indicate to Firecracker that guest memory should be
backed by guest_memfd.

Mark vhost-user and async block engine as incompatible, as I/O will
require userspace bounce buffers. For vhost-user-blk, we would need
to communicate the need for bounce buffers to the backend somehow, and
for the async block engine we would need to somehow keep the bounce
buffers around until io_uring finishes requests (which is not
impossible, but complicated and not needed for now).

Signed-off-by: Patrick Roy <[email protected]>
If secret freedom is enabled, the guest kernel and potential initrd
needs to be loaded via bounce buffer, as we cannot directly do `read`
syscalls that target guest memory.

Signed-off-by: Patrick Roy <[email protected]>
Needed because we cannot do I/O straight into secret hidden memory - the
host kernel cannot access it.

Signed-off-by: Patrick Roy <[email protected]>
Fall back to kvm_user_memory_region in case the 2 version of the struct
isnt supported.

Signed-off-by: Patrick Roy <[email protected]>
vm-memory has faulty validation logic that prevents us from mmap-ing
guest_memfds, so just bypass that by calling mmap ourselves for the time
being.

See also rust-vmm/vm-memory#320

Signed-off-by: Patrick Roy <[email protected]>
Have the `struct Vm` constructor take an argument to indicate whether
the VM should be secret free. Use this to determine the correct vm type
for guest_memfd support, and store it inside the VM so that we don't
have to pass bools to various functions.

Signed-off-by: Patrick Roy <[email protected]>
If the `secret_free` field of the memory_config is set to true in the
/machine-config endpoint, back all memory regions using
guest_memfd. For our setup, this means both setting the
guest_memfd[_offset] fields in kvm_user_memory_region2, as well as
mmaping the guest memory and reflecting this VMA back into the memslot's
userspace_addr (which is how kvm internal accesses to guest memory will
work for these guest_memfd regions, such as mmio emulation on x86).

Signed-off-by: Patrick Roy <[email protected]>
kalyazin and others added 25 commits September 9, 2025 07:45
It is used by Secret-Free-enabled UFFD handlers to disable vCPU fault
notifications from the kernel.

Signed-off-by: Nikita Kalyazin <[email protected]>
Accept receiving 3 fds instead of 1, where fds[1] is guest_memfd and
fds[2] is userfault bitmap memfd.

Also handle the FaultRequest message over the UDS socket by calling a
new callback in the Runtime and sending a FaultReply.

Co-authored-by: Patrick Roy <[email protected]>
Signed-off-by: Patrick Roy <[email protected]>
Signed-off-by: Nikita Kalyazin <[email protected]>
There are two ways a UFFD handler receives a fault notification if
Secret Fredom is enabled (which is inferred from 3 fds sent by
Firecracker instead of 1):
 - a VMM- or KVM-triggered fault is delivered via a minor UFFD fault
   event.  The handler is supposed to respond to it via memcpying the
   content of the page (if the page hasn't already been populated)
   followed by a UFFDIO_CONTINUE call.
 - a vCPU-triggered fault is delievered via a FaultRequest message on
   the UDS socket.  The handler is supposed to reply with a pwrite64
   call on the guest_memfd to populate the page followed by a FaultReply
   message on the UDS socket.

In both cases, the handler also needs to clear the bit in the userfault
bitmap at the corresponding offset in order to stop further fault
notifications for the same page.

UFFD handlers use the userfault bitmap for two purposes:
 - communicate to the kernel whether a fault at the corresponding
   guest_memfd offset will cause a VM exit
 - keep track of pages that have already been populated in order to
   avoid overwriting the content of the page that is already
   initialised.

Signed-off-by: Nikita Kalyazin <[email protected]>
These are used for communication of page faults between Firecracker and
a UFFD handler.

Signed-off-by: Nikita Kalyazin <[email protected]>
If configured, userfault bitmap is registered with KVM and controls
whether KVM will exit to userspace on a fault of the corresponding page.

We are going to allocate the bitmap in a memfd in Firecracker, set bits
for all pages to request notifications for vCPU faults and send
it to the UFFD handler to delegate clearing the bits as pages get
populated.

Since the KVM userfault patches are still in review,
set_user_memory_region2 is not aware of the userfault flag and the
userfault bitmap address in its input structure.  Define it in
Firecracker code temporarily.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is needed to instruct the kernel to exit to userspace when a vCPU
fault occurs and the corresponding bit in the userfault bitmap is set.

The userfault bitmap is allocated in a memfd by Firecracker and sent to
the UFFD handler.

This also sends 3 fds to the UFFD handler in the handshake:
 - UFFD (original)
 - guest_memfd: for the handler to be able to populate guest memory
 - userfault bitmap memfd: for the handler to be able to disable exits
   to userspace for the pages that have already been populated

Signed-off-by: Nikita Kalyazin <[email protected]>
This is because vCPUs reason in GPAs while the secret-free UFFD
protocol is guest_memfd-offset-based.

Note that offset_to_gpa is not used yet, but will likely be needed to
support async PF to pass the GPA to a new ioctl when notifying KVM of a
fault resolution.

Signed-off-by: Nikita Kalyazin <[email protected]>
It contains two parts:
 - external: between the VMM thread and the UFFD handler
 - internal: between vCPUs and the VMM thread

An outline of the workflow:
 - When a vCPU fault occurs, vCPU exits to userspace
 - The vCPU thread sends sends the exit syndrome in the vCPU to VMM
   channel and writes to the eventfd
 - The VMM thread forwards the syndrome to the UFFD handler via the UDS
   socket
 - The UFFD handler populates the page, clears the corresponding bit in
   the userfault bitmap and sends a reply to Firecracker
 - The VMM thread receives the reply and updates a vCPU condvar to
   notify the vCPU that the fault has been resolved
 - The vCPU resumes execution

Note that as a result of this change, an ability to exit the VM
gracefully is lost (at least on x86).  In the existing implementation,
the VMM thread initiated an exit if an event was read from the eventfd,
but no VcpuResponse::Exited responses were read for unknown reason.
Since the exit_evt eventfd is now also used by vCPUs to notify the VMM
thread of the VM exits caused by pagefaults, this situation (an eventfd
event, but response in the channel) can occur also because we have read
all VcpuResponse::Userfault in response to the previous eventfd event.

Signed-off-by: Nikita Kalyazin <[email protected]>
In a regular VM, we mmap the memory snapshot file and supply the address
in the KVM memory slot.  In Secret Free VMs, we provide guest_memfd in
the memory slot instead.  There is no way we can restore a Secret Free
VM from a file, unless we prepopulate the guest_memfd with the file
content, which is inefficient and is not practically useful.

Signed-off-by: Nikita Kalyazin <[email protected]>
This includes both functional and performance tests.

Signed-off-by: Nikita Kalyazin <[email protected]>
Do not add a balloon device to a Secret Free VM as it is not currently
supported.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is because the error type has changed due the implementation of
snapshot restore support for Secret Free VMs.

Signed-off-by: Nikita Kalyazin <[email protected]>
Graceful shutdown is currently broken on x86_64.

Signed-off-by: Nikita Kalyazin <[email protected]>
Writing to the noturbo sysfs immediately locks up the entire instance,
so stop doing this for now.

Signed-off-by: Patrick Roy <[email protected]>
Without this, the script will ask for user input and get stuck if run
unattended.

Signed-off-by: Patrick Roy <[email protected]>
Started seeing the below failure in test_population_latency:

thread 'main' panicked at .../uffd/fault_all_handler.rs:41:18:
uffd_msg not ready
note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace

I am not entierly sure how this can happen, because the read from the
uffd is supposed to be blocking, but maybe it's a weird interaction
with the fault-all behavior (e.g. there was a uffd event queues, but
because we faulted everything it got cancelled again?), so let's just
try going back to read(2) if we dont read anything.

Signed-off-by: Patrick Roy <[email protected]>
Currently, we often get stuck with the problem where something in the
host kernel breaks that causes functional tests to fail, but we cannot
update the patch series from which the host kernel gets built, because
functional tests are failing. Break this cyclic dependency by simply not
running functional tests when updating only the patch series (as they
dont test the updated kernel anyway.

Signed-off-by: Patrick Roy <[email protected]>
Return errors up the stack instead of panicking.

Signed-off-by: Nikita Kalyazin <[email protected]>
Return None if file_offset() is None instead.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is to make sure that we always write the entire FaultRequest
message even if the syscall was interrupted.

Signed-off-by: Nikita Kalyazin <[email protected]>
Make sure we continue reading the FaultReply if the syscall was
interrupted.

Signed-off-by: Nikita Kalyazin <[email protected]>
Get rid of the expect by using indexing.

Signed-off-by: Nikita Kalyazin <[email protected]>
Upgrade uffd-rs to 0.9.0, which comes with support for UFFDIO_CONTINUE,
so we can drop our homegrown version of it.

Signed-off-by: Patrick Roy <[email protected]>
by unconditionally grabbing the second part of the path of a modified
file in `run_all_tests()`, we ended up indexing out of bounds if a
modified file does not _have_ a second component in its path (e.g. if
the file is at the repository root, like `Cargo.lock`). Fix this by
checking for the length of x.parts first, and using python's
short-circuiting behavior of logical operators.

Signed-off-by: Patrick Roy <[email protected]>
We are seeing the execution time exceed the default 60 minutes Buidlkite
timeout sometimes.

Signed-off-by: Nikita Kalyazin <[email protected]>
@roypat roypat force-pushed the secret-hiding-rebase branch from ca0cf78 to 11811c7 Compare September 9, 2025 06:45
With secret freedom, Firecracker tracks more per-vcpu metadata, so in a
test with 32 vcpus, we manage to barely go above the memory limit. Just
disable the monitor for these tests.

Signed-off-by: Patrick Roy <[email protected]>
In the uvm_restored fixture, we create a throwaway VM to take a snapshot
of. This VM is completely invisible to the test, so cannot be configured
differently. If the memory monitor triggers in this VM, then it has
nothing to do with the test itself, and it is not recoverable. So just
disable the memory monitor for this VM.

Signed-off-by: Patrick Roy <[email protected]>
We open-coded MemoryMonitor.stop() inside __exit__. Stop doing that.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants