Skip to content

Conversation

@bchalios
Copy link
Contributor

@bchalios bchalios commented Jul 3, 2025

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@bchalios bchalios changed the base branch from feature/pcie to main July 3, 2025 22:07
bchalios and others added 28 commits July 4, 2025 10:40
This is just code organization changes. Create a new module under
`virtio`, called `transport`. For the time being the only transport
supported is `mmio`. Also, move `IrqInterrupt` type within the MMIO
transport code, as it is MMIO specific.

Signed-off-by: Babis Chalios <[email protected]>
`IrqTrigger::new()` returns a `Result` because creating an `EventFd`
might fail with an `std::io::Error` error. All users of `IrqTrigger`
create the object and directly unwrap the error.

To avoid unwraps all over the place, change `IrqTrigger::new()` to
unwrap a potential error while creating the EventFd internally and just
return `Self`.

Signed-off-by: Babis Chalios <[email protected]>
The MMIO transport for VirtIO devices uses an `IrqTrigger` object as the
object that models the logic for sending interrupts from the device to
the guest. We create one such object for every VirtIO device when
creating it. The MMIO device manager associates this object with an IRQ
number and registers it with KVM.

This commit changes the timing of association of an `IrqTrigger` with a
VirtIO-mmio device. It only assigns such an object to the device during
its activation. We do this to prepare for supporting a PCI transport for
VirtIO devices. The cloud hypervisor implementation for these passes the
interrupt objects used by the device during activation, so we make this
change to have a uniform way to handle interrupts for both transports.
Functionally, nothing changes for MMIO devices, as before activation we
don't trigger any interrupts.

Signed-off-by: Babis Chalios <[email protected]>
Describing the APIs that need to implement types that are used as
interrupts for VirtIO devices. Currently, we only use `IrqInterrupt`
interrupts, but this will change once we have MSI-X with PCIe devices.

Signed-off-by: Babis Chalios <[email protected]>
VirtIO devices assume they're operating under an MMIO transport and as a
consequence they use IrqTrigger as interrupts. Switch that to using
VirtioInterrupt for all VirtIO device objects. Only assume a
VirtioInterrupt is an IrqTrigger in MMIO specific code.

Signed-off-by: Babis Chalios <[email protected]>
Bring in the vm-device crate from CloudHypervisor. We will be using it
for adding PCIe support.

Signed-off-by: Babis Chalios <[email protected]>
We use `SerialDevice` with Stdin as the input source. Encode this in the
type so that we don't spill the generic all over the place.

Signed-off-by: Babis Chalios <[email protected]>
Co-authored-by: Egor Lazarchuk <[email protected]>
Signed-off-by: Egor Lazarchuk <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
Use the vm_device::Bus bus for all MMIO devices. This is mainly to
prepare for using it for PCIe devices. Also, sepate VirtIO devices from
other MMIO devices inside the MMIODeviceManager struct. This makes
iterating over VirtIO devices easier since we don't need to access two
data structures to get a reference to a VirtIO device any more.

Signed-off-by: Babis Chalios <[email protected]>
We were always constructing RTCDevice using a set of metrics that were
defined in the RTC module itself. Don't leak the metrics to other
modules. Instead, create a new() function that always constructs it the
correct way.

Signed-off-by: Babis Chalios <[email protected]>
Use the vm_device::Bus bus for PortIO devices on x86. PCIe devices will
use this as well.

Signed-off-by: Babis Chalios <[email protected]>
PCIe spec mandates that software can access the configuration space of
PCIe devices both via MMIO and Port IO accesses. As a result, PCIe
devices will need to register to both buses (on x86).

Change the organization of devices, so that MMIO and PIO device managers
do not own the buses. Instead, introduce a DeviceManager object which
holds the buses, the resource allocator and includes also all types of
device managers (at the moment MMIO, PIO and ACPI).

Signed-off-by: Babis Chalios <[email protected]>
We always create anew the keyboard interrupt event. Just create it
inside `I8042Device::new()` and return an error if that fails.

Signed-off-by: Babis Chalios <[email protected]>
test_serial_dos test checks that when we send a lot of bytes in the
serial device the emulation logic does not increase indefinitely the
underlying buffer that we use for when the device is set in loopback
mode.

However, the test does not wait for the microVM to start and sometimes
the virtual memory allocation may increase between readings. Add a
network device to the microVM so that we implicitly wait until it has
booted before taking the first measurement.

Signed-off-by: Babis Chalios <[email protected]>
Bring in pci crate from cloud hypervisor with a few modifications.
We use the rust-vmm vm-allocator crate instead of Cloud Hypervisor's
downstream one. For the time being, rust-vmm's implementation should
include all we need for supporting the devices we care about. If we need
more functionality from our allocators, we will implement the logic
directly in the rust-vmm vm-allocator crate.

Signed-off-by: Babis Chalios <[email protected]>
PCIe distinguishes MMIO regions between 32bit and 64bit, caring for
devices that can't deal with 64-bit addresses. This commit defines the
appropriate regions for both x86 and aarch64 architectures, extends the
resource allocator to handle allocations for both of these regions and
adjusts the logic that calculates the memory regions for the
architecture.

Also, un-do the change that added an `offset` argument
`arch_memory_regions` function. We won't be using this for "secret
hiding" so it just made the logic (especially for kani proofs) too
convoluted.

Signed-off-by: Babis Chalios <[email protected]>
PCIe devices need some times to relocate themselves in memory. To do so,
they need to keep an (atomic) reference to a type that implements
`DeviceRelocation` trait. The logic for relocation involves removing the
device from the bus it has been registered to, allocate a new address
range for it and reinsert it.

Instead of creating a new type for it, reuse `ResourceAllocator`. This
means that we need to move the buses from the `DeviceManager` inside
`ResourceAllocator`.

Signed-off-by: Babis Chalios <[email protected]>
Add a PCIe segment which includes a single PCIe root port and a bus.

At the moment, the PCIe segment is always enabled. Later commit will
make it optional and enable it only when a command line argument flag is
passed to Firecracker binary.

Signed-off-by: Babis Chalios <[email protected]>
So that we can declare which memory region is used by PCIe devices for
MMCONFIG.

Signed-off-by: Babis Chalios <[email protected]>
Write the PCI root bridge in FDT when PCI is enabled.

Signed-off-by: Babis Chalios <[email protected]>
Add a command line argument to enable PCIe support. By default, PCIe
is disabled. The reason for making PCIe off by default is that users
need to explicitly enable PCI support in their kernels. Requiring users
to explicitly enable it, does not break existing deployments, i.e. users
can upgrade Firecracker within their existing environments without
breaking any deployment.

Signed-off-by: Babis Chalios <[email protected]>
At the moment, the logic just restores the device manager and add the
PCIe root complex if PCI is enabled.

Signed-off-by: Babis Chalios <[email protected]>
Add an integration test that checks that `lspci` correctly locates the
PCIe root complex if PCI is enabled for the microVM. Also, add a
negative test that checks that PCIe root complex doesn't exist when PCI
is not enabled.

Also, extend coverage of, at least some of, the tests to ensure that
they run with and without PCI configuration enabled. Do that by
extending the `uvm_any*` fixtures to yield both variants.

Signed-off-by: Babis Chalios <[email protected]>
PCI-enabled guest kernels enable the `extd_apicid` CPU feature for AMD
CPU families after 16h. Our supported AMD families (Milan & Genoa) are
both 19h. This is irrespective of whether PCI is enabled in Firecracker.

Do not mark this as host-only when running with PCI enabled kernels,
i.e. all kernels that support ACPI.

Signed-off-by: Babis Chalios <[email protected]>
We have some Rust integration tests that check building and
booting of microVMs works correctly. Add variants for PCI-enabled
microVMs.

Signed-off-by: Babis Chalios <[email protected]>
Tests test_spectre_meltdown_checker_on_guest and
test_check_vulnerability_files_ab run A/B tests between the HEAD of the
target branch and the tip of a PR branch. This will currently fail,
because Firecracker builds from the HEAD of the target branch know
nothing about the `--enable-pci` command line flag, so launching
the Firecracker binary for revision A will fail.

Only run these tests for non-PCI uVMs for now. Once this commit gets
merged we will re-enable and make sure that everything works as
expected.

Signed-off-by: Babis Chalios <[email protected]>
1. build the kernel with PCI/e support.
2. fix a race condition between udev renaming the network devices and
   fcnet setting up the network interfaces
3. install pciutils on the image

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
I've rebuilt the CI artifacts for the new development version.

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
The memory monitor was only assuming a single MMIO gap on x86_64 when
calculating the memory regions that corresponded to guest memory. Now we
need to account for two MMIO gaps in x86 and one in ARM.

Signed-off-by: Babis Chalios <[email protected]>
bchalios added 27 commits July 4, 2025 10:40
Commit be5a600 (tests: fix MMIO gaps in memory monitor tool) that fixed
the memory monitor to account for the 64-bit MMIO region included a
left-over debug print. Remove it.

Signed-off-by: Babis Chalios <[email protected]>
We need the new KvmIrqRouting FamStruct wrapper from kvm-bindings, which
though forces us to update vmm-sys-util to 0.14.0 and also bump all
downstream dependencies of vmm-sys-util to use that version.

Signed-off-by: Babis Chalios <[email protected]>
Define thiserror::Error and displaydoc::Display for various error types
in the vended PCI crate. This way we can embed them in our error types
downstream. Also export a few types and struct fields that were private
and we will be needing them.

Signed-off-by: Babis Chalios <[email protected]>
Instead of returning an `EventFd` type, which will actually force us to
clone the file descriptor in the Firecracker side.

Signed-off-by: Babis Chalios <[email protected]>
This is code we are not going to use in Firecracker. Remove it, so we
can keep the crates we vend as minimal as possible, including only
things we are actually using.

Signed-off-by: Babis Chalios <[email protected]>
We'd like to be able to store Vm within an atomic reference so we can
pass it around and share it with other components. The main issue with
doing this change is that we need Vm to be `mut` during initialization
and the builder.rs code was creating Vmm with Vm embedded in it.

To solve this, we break down the initialization of the Vmm object. We
first create its individual parts (Vm, Kvm and DeviceManager), perform
any necessary initialization logic on Vm and once this done add it
within an Arc.

Signed-off-by: Babis Chalios <[email protected]>
Add logic to track the device interrupts used by the microVM. This is
not strictly needed right now, but we will need it when adding support
for MSI-X interrupts. MSI-X interrupts are configured at runtime and we
need to interact with KVM to set the interruput routes. To do it, we
need to keep track all of the interrupts the VM is using.

Signed-off-by: Babis Chalios <[email protected]>
Enable Vm to vend and manage MSI/MSI-X interrupts. This adds the logic
to create a set of MSI vectors and then handle their lifetime.

Signed-off-by: Babis Chalios <[email protected]>
Vm object is now maintaining information about the interrupts (both
traditional IRQs and MSI-X vectors) that are being used by microVM
devices. Derive Serialize/Deserialize add logic for recreating objects
for relevant types.

Signed-off-by: Babis Chalios <[email protected]>
Apparently, PCI needs Queue::size to be initialized to the maximum
possible size supported by the device, otherwise initialization fails.

Signed-off-by: Babis Chalios <[email protected]>
Remove the flags in FADT that were declaring we do not support MSI and
PCI ASPM.

Signed-off-by: Babis Chalios <[email protected]>
Merge the device-related errors that DeviceManager might return. This
way, we can avoid adding yet another error type for PCI devices and
reduce some the variants of StartMicrovmError.

Suggested-by: Egor Lazarchuk <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
Add a VirtIO PCI transport implementation. When a Firecracker microVM is
launched with --enable-pci, we will create all VirtIO devices using the
PCI transport layer. Snapshotting of VirtIO PCI devices is not supported
and we will add this functionality in later commit.

Add a couple of tests that ensure that PCI configuration space is what
expected. We read common fields and make sure the BAR we allocate for
the VirtIO device is what expected.

Signed-off-by: Babis Chalios <[email protected]>
We are now calling KVM_CHECK_EXTENSION for checking the
KVM_CAP_MSI_DEVID capability. We are also calling KVM_SET_GSI_ROUTING to
set the interrupts routes and KVM_IRQFD to set/unset interrupt lines.

Signed-off-by: Babis Chalios <[email protected]>
Add some unit tests to PciSegment. We now test that the
next_device_bdf() method and the initialization logic work as expected.
We also check that the configuration space of the PCI segment is
correctly registered with the MMIO and, on x86, PIO bus.

Signed-off-by: Babis Chalios <[email protected]>
vm-allocator now allows us to (De)serialize IdAllocator and
AddressAllocator types. Add ResourceAllocator in DeviceManager snapshot
state and restore it when loading a snapshot. Like this we can avoid
doing the ExactMatch allocations during snapshot resumes for reserving
the exact same MMIO ranges.

Moreover, change DeviceManager and PciDevices to provide save/restore
functionality via the Persist trait. Like that we can avoid first
creating the objects and then restoring their state, overwriting their
fields.

Signed-off-by: Babis Chalios <[email protected]>
VirtIO MMIO restore logic activates the device the moment we restore the
device state, if the device was activated when snapshotted. Move the
activation responsibility to the logic the restores the MMIO transport.
The reason for this change is that that's how it will be done for the
PCI transport. Unifying this will allow us reusing the same types for
restoring the non-transport state of devices.

Note that we needed to change the way Net devices are saved/restored.
RxBuffer type of Net devices holds RX descriptors that we have parsed
from the Queue ahead of time. The way we restored this info was
manipulating the queue to re-parse the RX descriptors during the restore
phase. However, we need the device to be activated to do so, which now
isn't. So, instead of storing this info inside the snapshot make sure we
have flushed everything before taking the snapshot.

Also, simplify a bit the types that we use for serializing/deserializing
the state of a device.

Signed-off-by: Babis Chalios <[email protected]>
Support serializing the device-specific and transport state of a VirtIO
device that uses the PCI transport.

Signed-off-by: Babis Chalios <[email protected]>
ResourceAllocator object was part of DeviceManager since it is (mainly)
devices that use it. ResourceAllocator is as well the object that
implements (in a dummy way, for the moment) the DeviceRelocation trait
which PciDevices use to move the address space of a PciDevice when
triggered from the guest.

Problem with DeviceRelocation is that it also needs the Vm file
descriptor to perform the relocation, because we need to move register
the new IO event fd for VirtIO devices.

To make things simpler, move ResourceAllocator inside the Vm object. In
subsequent commit we will remove the DeviceRelocation from
ResourceAllocator and move it to Vm instead.

This has the nice secondary effect that we were able to simplify the
signature of many device-related methods that received Vm and
ResourceAllocator arguments.

Signed-off-by: Babis Chalios <[email protected]>
We had previously added MMIO and Port IO buses inside ResourceAllocator
so that we could implement DeviceRelocation for the type. Now, we will
delegate device relocation responsibilities to ArchVm instead. That is
because device relocation requires access to the Vm file descriptor as
well.

As a result, we can move buses to the Vm object itself. Add MMIO bus to
VmCommon as both architectures use it. Add PortIO bus for x86
architecture only.

Not that we don't still support DeviceRelocation. VirtIO devices should
not request us to relocate them. Also, for adding such support we would
need to also support VirtIO reset. We will look into adding this
functionaliyt later on.

Signed-off-by: Babis Chalios <[email protected]>
Add support for ITS device which provides support for MSI interrupts on
ARM architecture. This is currently supported only on systems with GICv3
interrupt controller.

In order to make saving/restore of ITS state work properly, we need to
change the order in which we restore redistributor register GICR_CTLR.
We need to make sure that this register is restored last. Otherwise,
restoring GICR_PROPBASER doesn't have any effect and ITS depends on it
in order to save/restore ITS tables to/from guest memory.

Signed-off-by: Babis Chalios <[email protected]>
Refactor the test code that inserts VirtIO devices in a Vmm object and
then add a test which creates a Vmm with PCI devices and then serializes
and deserializes the device manager and ensures that everything is as
restored as expected.

Signed-off-by: Babis Chalios <[email protected]>
Use pci_enabled fixture for boot time, block, and network tests to
create PCI microVM variants as well.

Signed-off-by: Babis Chalios <[email protected]>
We only pass pci=off if PCI is disabled in Firecracker. Adapt tests and
comments to reflect that.

Signed-off-by: Babis Chalios <[email protected]>
So that we don't have to downcast VirtioDevice trait objects to the
actual device type before calling the logic to process events for each
device.

Signed-off-by: Babis Chalios <[email protected]>
Instead of storing internal allocators of ResourceAllocator within an
Arc<Mutex<>> container, just store `ResourceAllocator` itself in an
`Arc<Mutex<>>`.

Apart from that, we get rid of the `ResourceAllocatorState` state
object, and just clone `ResourceAllocator` itself when we want to
save/restore.

Also, make the creation of `ResourceAllocato` infallible, since we know
that the ranges we are using are correct.

Finally, fix saving/restoring the state of ResourceAllocator. We were
actually not resetting it correctly upon snapshot restore. The reason
why this was not a problem is that we don't actually need to perform any
new allocations post restore at the moment. However, like this we are
ready when we need to perform any hot-plugging operations. Also, add a
unit-test to ensure that this logic works correctly.

Signed-off-by: Babis Chalios <[email protected]>
We were confusing queue indexex with event indexes, when passing the
index of the queue that needed to be triggered after handling events.

Fix the logic to pass the correct index. This refactors a bit the code
to signal the queues in each event handler method. With MMIO we don't
need to signal each queue independently (one signal will cause the guest
to scan all queues). With PCI though, we are using MSI-X, so we need to
signal each queue independently.

Also, change vsock functional integration tests to also run for
PCI-enabled microVMs.

Signed-off-by: Babis Chalios <[email protected]>
@codecov
Copy link

codecov bot commented Jul 4, 2025

Codecov Report

Attention: Patch coverage is 64.14791% with 1784 lines in your changes missing coverage. Please review.

Project coverage is 80.16%. Comparing base (68ed647) to head (3181444).

Files with missing lines Patch % Lines
src/pci/src/configuration.rs 41.12% 378 Missing ⚠️
src/vmm/src/devices/virtio/transport/pci/device.rs 54.30% 329 Missing ⚠️
src/pci/src/msix.rs 19.49% 289 Missing ⚠️
src/pci/src/bus.rs 23.76% 231 Missing ⚠️
.../src/devices/virtio/transport/pci/common_config.rs 52.15% 89 Missing ⚠️
src/vmm/src/device_manager/mod.rs 84.59% 47 Missing ⚠️
src/vmm/src/devices/pci/pci_segment.rs 85.17% 47 Missing ⚠️
src/vmm/src/builder.rs 74.30% 37 Missing ⚠️
src/vmm/src/device_manager/pci_mngr.rs 91.72% 34 Missing ⚠️
src/pci/src/device.rs 0.00% 30 Missing ⚠️
... and 34 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5289      +/-   ##
==========================================
- Coverage   82.86%   80.16%   -2.70%     
==========================================
  Files         250      265      +15     
  Lines       26897    30830    +3933     
==========================================
+ Hits        22288    24715    +2427     
- Misses       4609     6115    +1506     
Flag Coverage Δ
5.10-c5n.metal 79.90% <61.28%> (-3.46%) ⬇️
5.10-m5n.metal 79.90% <61.28%> (-3.45%) ⬇️
5.10-m6a.metal 79.10% <61.28%> (-3.47%) ⬇️
5.10-m6g.metal 76.49% <59.69%> (-2.69%) ⬇️
5.10-m6i.metal 79.90% <61.28%> (-3.45%) ⬇️
5.10-m7a.metal-48xl 79.08% <61.28%> (?)
5.10-m7g.metal 76.49% <59.69%> (-2.69%) ⬇️
5.10-m7i.metal-24xl 79.87% <61.28%> (?)
5.10-m7i.metal-48xl 79.86% <61.28%> (?)
5.10-m8g.metal-24xl 76.48% <59.69%> (?)
5.10-m8g.metal-48xl 76.48% <59.69%> (?)
6.1-c5n.metal 79.95% <61.28%> (-3.45%) ⬇️
6.1-m5n.metal 79.95% <61.28%> (-3.45%) ⬇️
6.1-m6a.metal 79.14% <61.28%> (-3.47%) ⬇️
6.1-m6g.metal 76.49% <59.69%> (-2.68%) ⬇️
6.1-m6i.metal 79.94% <61.28%> (-3.46%) ⬇️
6.1-m7a.metal-48xl 79.13% <61.28%> (?)
6.1-m7g.metal 76.49% <59.69%> (-2.69%) ⬇️
6.1-m7i.metal-24xl 79.96% <61.28%> (?)
6.1-m7i.metal-48xl 79.96% <61.28%> (?)
6.1-m8g.metal-24xl 76.48% <59.69%> (?)
6.1-m8g.metal-48xl 76.48% <59.69%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bchalios bchalios closed this Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants