Skip to content

Conversation

bchalios
Copy link
Contributor

Changes

Add support for PCI devices in Firecracker. This is an optional feature that users can enable by passing the --enable-pci flag when launching the Firecracker binary. When this flag is present, Firecracker will create all VirtIO devices using a PCI transport rather than the default virtio-mmio.

Reason

PCI transport for VirtIO devices allows using MSI-X interrupts which lead to more efficient data plane communication for VirtIO devices.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

bchalios and others added 30 commits August 12, 2025 13:05
This is just code organization changes. Create a new module under
`virtio`, called `transport`. For the time being the only transport
supported is `mmio`. Also, move `IrqInterrupt` type within the MMIO
transport code, as it is MMIO specific.

Signed-off-by: Babis Chalios <[email protected]>
`IrqTrigger::new()` returns a `Result` because creating an `EventFd`
might fail with an `std::io::Error` error. All users of `IrqTrigger`
create the object and directly unwrap the error.

To avoid unwraps all over the place, change `IrqTrigger::new()` to
unwrap a potential error while creating the EventFd internally and just
return `Self`.

Signed-off-by: Babis Chalios <[email protected]>
The MMIO transport for VirtIO devices uses an `IrqTrigger` object as the
object that models the logic for sending interrupts from the device to
the guest. We create one such object for every VirtIO device when
creating it. The MMIO device manager associates this object with an IRQ
number and registers it with KVM.

This commit changes the timing of association of an `IrqTrigger` with a
VirtIO-mmio device. It only assigns such an object to the device during
its activation. We do this to prepare for supporting a PCI transport for
VirtIO devices. The cloud hypervisor implementation for these passes the
interrupt objects used by the device during activation, so we make this
change to have a uniform way to handle interrupts for both transports.
Functionally, nothing changes for MMIO devices, as before activation we
don't trigger any interrupts.

Signed-off-by: Babis Chalios <[email protected]>
Describing the APIs that need to implement types that are used as
interrupts for VirtIO devices. Currently, we only use `IrqInterrupt`
interrupts, but this will change once we have MSI-X with PCIe devices.

Signed-off-by: Babis Chalios <[email protected]>
VirtIO devices assume they're operating under an MMIO transport and as a
consequence they use IrqTrigger as interrupts. Switch that to using
VirtioInterrupt for all VirtIO device objects. Only assume a
VirtioInterrupt is an IrqTrigger in MMIO specific code.

Signed-off-by: Babis Chalios <[email protected]>
Bring in the vm-device crate from CloudHypervisor. We will be using it
for adding PCIe support.

Signed-off-by: Babis Chalios <[email protected]>
We use `SerialDevice` with Stdin as the input source. Encode this in the
type so that we don't spill the generic all over the place.

Signed-off-by: Babis Chalios <[email protected]>
Co-authored-by: Egor Lazarchuk <[email protected]>
Signed-off-by: Egor Lazarchuk <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
Use the vm_device::Bus bus for all MMIO devices. This is mainly to
prepare for using it for PCIe devices. Also, sepate VirtIO devices from
other MMIO devices inside the MMIODeviceManager struct. This makes
iterating over VirtIO devices easier since we don't need to access two
data structures to get a reference to a VirtIO device any more.

Signed-off-by: Babis Chalios <[email protected]>
We were always constructing RTCDevice using a set of metrics that were
defined in the RTC module itself. Don't leak the metrics to other
modules. Instead, create a new() function that always constructs it the
correct way.

Signed-off-by: Babis Chalios <[email protected]>
Use the vm_device::Bus bus for PortIO devices on x86. PCIe devices will
use this as well.

Signed-off-by: Babis Chalios <[email protected]>
PCIe spec mandates that software can access the configuration space of
PCIe devices both via MMIO and Port IO accesses. As a result, PCIe
devices will need to register to both buses (on x86).

Change the organization of devices, so that MMIO and PIO device managers
do not own the buses. Instead, introduce a DeviceManager object which
holds the buses, the resource allocator and includes also all types of
device managers (at the moment MMIO, PIO and ACPI).

Signed-off-by: Babis Chalios <[email protected]>
We always create anew the keyboard interrupt event. Just create it
inside `I8042Device::new()` and return an error if that fails.

Signed-off-by: Babis Chalios <[email protected]>
test_serial_dos test checks that when we send a lot of bytes in the
serial device the emulation logic does not increase indefinitely the
underlying buffer that we use for when the device is set in loopback
mode.

However, the test does not wait for the microVM to start and sometimes
the virtual memory allocation may increase between readings. Add a
network device to the microVM so that we implicitly wait until it has
booted before taking the first measurement.

Signed-off-by: Babis Chalios <[email protected]>
Bring in pci crate from cloud hypervisor with a few modifications.
We use the rust-vmm vm-allocator crate instead of Cloud Hypervisor's
downstream one. For the time being, rust-vmm's implementation should
include all we need for supporting the devices we care about. If we need
more functionality from our allocators, we will implement the logic
directly in the rust-vmm vm-allocator crate.

Signed-off-by: Babis Chalios <[email protected]>
PCIe distinguishes MMIO regions between 32bit and 64bit, caring for
devices that can't deal with 64-bit addresses. This commit defines the
appropriate regions for both x86 and aarch64 architectures, extends the
resource allocator to handle allocations for both of these regions and
adjusts the logic that calculates the memory regions for the
architecture.

Also, un-do the change that added an `offset` argument
`arch_memory_regions` function. We won't be using this for "secret
hiding" so it just made the logic (especially for kani proofs) too
convoluted.

Signed-off-by: Babis Chalios <[email protected]>
PCIe devices need some times to relocate themselves in memory. To do so,
they need to keep an (atomic) reference to a type that implements
`DeviceRelocation` trait. The logic for relocation involves removing the
device from the bus it has been registered to, allocate a new address
range for it and reinsert it.

Instead of creating a new type for it, reuse `ResourceAllocator`. This
means that we need to move the buses from the `DeviceManager` inside
`ResourceAllocator`.

Signed-off-by: Babis Chalios <[email protected]>
Add a PCIe segment which includes a single PCIe root port and a bus.

At the moment, the PCIe segment is always enabled. Later commit will
make it optional and enable it only when a command line argument flag is
passed to Firecracker binary.

Signed-off-by: Babis Chalios <[email protected]>
So that we can declare which memory region is used by PCIe devices for
MMCONFIG.

Signed-off-by: Babis Chalios <[email protected]>
Write the PCI root bridge in FDT when PCI is enabled.

Signed-off-by: Babis Chalios <[email protected]>
Add a command line argument to enable PCIe support. By default, PCIe
is disabled. The reason for making PCIe off by default is that users
need to explicitly enable PCI support in their kernels. Requiring users
to explicitly enable it, does not break existing deployments, i.e. users
can upgrade Firecracker within their existing environments without
breaking any deployment.

Signed-off-by: Babis Chalios <[email protected]>
At the moment, the logic just restores the device manager and add the
PCIe root complex if PCI is enabled.

Signed-off-by: Babis Chalios <[email protected]>
Add an integration test that checks that `lspci` correctly locates the
PCIe root complex if PCI is enabled for the microVM. Also, add a
negative test that checks that PCIe root complex doesn't exist when PCI
is not enabled.

Also, extend coverage of, at least some of, the tests to ensure that
they run with and without PCI configuration enabled. Do that by
extending the `uvm_any*` fixtures to yield both variants.

Signed-off-by: Babis Chalios <[email protected]>
PCI-enabled guest kernels enable the `extd_apicid` CPU feature for AMD
CPU families after 16h. Our supported AMD families (Milan & Genoa) are
both 19h. This is irrespective of whether PCI is enabled in Firecracker.

Do not mark this as host-only when running with PCI enabled kernels,
i.e. all kernels that support ACPI.

Signed-off-by: Babis Chalios <[email protected]>
We have some Rust integration tests that check building and
booting of microVMs works correctly. Add variants for PCI-enabled
microVMs.

Signed-off-by: Babis Chalios <[email protected]>
Tests test_spectre_meltdown_checker_on_guest and
test_check_vulnerability_files_ab run A/B tests between the HEAD of the
target branch and the tip of a PR branch. This will currently fail,
because Firecracker builds from the HEAD of the target branch know
nothing about the `--enable-pci` command line flag, so launching
the Firecracker binary for revision A will fail.

Only run these tests for non-PCI uVMs for now. Once this commit gets
merged we will re-enable and make sure that everything works as
expected.

Signed-off-by: Babis Chalios <[email protected]>
1. build the kernel with PCI/e support.
2. fix a race condition between udev renaming the network devices and
   fcnet setting up the network interfaces
3. install pciutils on the image

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
The memory monitor was only assuming a single MMIO gap on x86_64 when
calculating the memory regions that corresponded to guest memory. Now we
need to account for two MMIO gaps in x86 and one in ARM.

Signed-off-by: Babis Chalios <[email protected]>
When we re-arranged the MMIO address space in commit 9a165d1 (arch:
define 64-bit capable MMIO memory regions) we moved the MMIO region of
the boot timer device for x86 systems, but we didn't update the init
scripts that hardcode it and use it to report boot time timestamp back
to Firecracker.

Update the init.c and initramfs values for the region. Also, add a
functional test that runs during CI PR tests and makes sure the boot
timer works.

Signed-off-by: Babis Chalios <[email protected]>
Commit be5a600 (tests: fix MMIO gaps in memory monitor tool) that fixed
the memory monitor to account for the 64-bit MMIO region included a
left-over debug print. Remove it.

Signed-off-by: Babis Chalios <[email protected]>
We need the new KvmIrqRouting FamStruct wrapper from kvm-bindings, which
though forces us to update vmm-sys-util to 0.14.0 and also bump all
downstream dependencies of vmm-sys-util to use that version.

Signed-off-by: Babis Chalios <[email protected]>
bchalios and others added 22 commits August 12, 2025 13:05
We are using a HashMap to track the interrupt routes we use in the
system. The index to the HashMap is the GSI of the interrupt route.
We know the maximum number of GSIs we have available so pre-allocate the
space for the HashMap to avoid reallocations at runtime.

Signed-off-by: Babis Chalios <[email protected]>
Fix a bug in the common VirtIO configuration for PCI transport where we
would use `queue_select` to read the queue's MSI vector without
validating it matches a valid queue. This could lead in panics when
accessing the `msix_queue` array. The spec states that in such cases we
should return `NO_VECTOR` (0xffff), so do that.

Signed-off-by: Babis Chalios <[email protected]>
We were shifting by the wrong number of bits for 2-byte accesses.

Signed-off-by: Babis Chalios <[email protected]>
This is the equivalent to the fix in 741c29f but for guest writes.
We need to make sure that `queue_select` points to a valid queue before
setting the MSI-X vector otherwise we'll hit a panic when accessing the
underlying `Vec`.

Signed-off-by: Babis Chalios <[email protected]>
swiotlb=noforce disables SWIOTLB, which is enabled by the kernel if the
physical addresses exceed 32b. This is not needed for us and causes 64MB
to be wasted on the microvm.

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
We are vending PCI code taken from Cloud Hypervisor's implementation.
There is code in there that we don't actually use. So drop it to reduce
the dead code in the project.

Signed-off-by: Babis Chalios <[email protected]>
Cloud Hypervisor code is passing a resource type argument in the logic
that allocates BARs for devices. `Resource` is an Enum where one of its
variants is `PciBar`. Not sure what Cloud Hypervisor uses this for, but
it seems redundant since this method is specifically used to allocate
BAR memory for devices.

We definitely don't use it, so remove it.

Signed-off-by: Babis Chalios <[email protected]>
Despite the fact we are using a single 64bit MMIO BARs for VirtIO
devices, we had code that allowed for multiple BARs including BARs of
other types (IO and 32bit MMIO). Remove this code and always assume we
are only using 64bit BAR at index 0.

Signed-off-by: Babis Chalios <[email protected]>
This method is used to notify the guest about queue events in a way
that's most performant with the underlying interrupt implementation.
As in IrqTrigger there is no distinction between different queues, it's
best to send just one interrupt notifiying that "some queues" have a
pending event.
Conversely, in VirtioInterruptMsix, we need to trigger a MSI for each
distinct queue.

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
We noticed that, compared to main, the vsock device is up to 30% slower
when PCI is disabled. This is due to the refactor in 087e185
("fix(vsock): pass correct index when triggering interrupts"), as we're
now sending interrupts as soon as we detect the event, rather than
deferring them and sending only one interrupt at the end.
While with MSI we need to send multiple interrupts, so there is little
difference (still, tests show up to 5% improvement with this change),
with legacy IRQ there's only one interrupt line so we end up sending
multiple back to back interrupts rather than a single one.
This patch reverts to the previous behaviour and uses the newly
introduced `trigger_queues` method to deduplicate interrupts in the case
of IrqTrigger.

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
According to the VirtIO spec section 2.1.2, when we enter a failed state
we need to to set `DEVICE_NEEDS_RESET` status field and if `DRIVER_OK`
is set we need to must send a device configuration change interrupt to
the driver.

In MMIO code we were trying to follow this logic, but we were trying to
get the interrupt object from the device, which fails (and panics)
because interrupts are only set in the device upon successful
activation. In the PCI transport, instead, we were not doing this at
all.

The transport layers hold a reference to the interrupts at all times.
So, add the missing logic to the PCI transport and use the transport
layer reference in both transports to avoid panics.

Signed-off-by: Babis Chalios <[email protected]>
MMIO transport layer holds a Mutex to the VirtIO device. Within the
logic that handles the handshake between the driver and the device,
there were cases where we would take and release the lock multiple times
within the same code branch.

Change this to only take the lock once.

Signed-off-by: Babis Chalios <[email protected]>
We should check that the MSI-X vector that a guest assigns to a VirtIO
queue or configuration is a valid one. If it is a value bigger than the
available vectors allocated for the device, it should not update the
vector field and mark the corresponding vector to the NO_VECTOR value.

Also, add checks in the MSI implementation for VirtioInterrupt trait,
for making sure that we don't try to interact with out-of-range
interrupt vectors.

Signed-off-by: Babis Chalios <[email protected]>
If a function is called in the `.ok_or`, this gets executed
independently on whether the function succeeds or not, although the
result is only used on failure.

Replace it with a .ok_or_else to avoid a useless string allocation and
memcpy.

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
"or_fun_call" warns us when a function is called in side a "*_or(...)"
function. In this case the value is computed (which may be expensive),
but is only used if the Result/Option is an Error/None.
In these cases, using the "*_or_else" variant is the best thing to do.

Signed-off-by: Riccardo Mancini <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
We are not really using the `as_any` and `id` members of the PciDevice
trait. At the moment, only VirtIO devices and the PCI root port is
implementing PciDevice and we are never iterating over the container of
PCI devices.

Signed-off-by: Babis Chalios <[email protected]>
Add a unit test for the logic that handles adding BARs for a device.
Also, ensure that we check that the BAR index we are adding is within
range before we use it to index the BARs vector.

Signed-off-by: Babis Chalios <[email protected]>
Make sure that configuring the MSI-X capability in PCI configuration
space works properly (configuration space is initialized as expected).
Also, make sure that the implementation respects the read/write
properties of the respective bits.

Finally, fix logic in the code that adds capabilities to take into
account properly the size of capabilities.

Signed-off-by: Babis Chalios <[email protected]>
Make sure we handle correctly accessing invalid registers.

Signed-off-by: Babis Chalios <[email protected]>
Make sure we detect correctly valid intents to reprogram (move) BARs.
Also, make sure we correctly ignore buggy ones.

Signed-off-by: Babis Chalios <[email protected]>
Add a few unit tests to check the logic that accesses PCI configuration
space via the PCI Bus. Ensure that negative cases are being handled
properly.

Signed-off-by: Babis Chalios <[email protected]>
Also, drop some of effectively dead code that Cloud Hypervisor was using
because they were not relying on KVM to handle interrupt controllers.
Finally, fixup some error cases on guest reads which need to return
all-ones when bad accesses happen.

Signed-off-by: Babis Chalios <[email protected]>
@bchalios bchalios changed the title Feature/pcie Add PCI support on Firecracker Aug 12, 2025
@bchalios bchalios enabled auto-merge (rebase) August 12, 2025 15:17
@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Aug 12, 2025
Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 77.31589% with 1075 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.35%. Comparing base (e5b4f26) to head (75eb51d).
⚠️ Report is 99 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/devices/virtio/transport/pci/device.rs 55.02% 304 Missing ⚠️
src/pci/src/configuration.rs 69.20% 166 Missing ⚠️
.../src/devices/virtio/transport/pci/common_config.rs 53.36% 97 Missing ⚠️
src/vmm/src/device_manager/mod.rs 79.14% 68 Missing ⚠️
src/vmm/src/devices/pci/pci_segment.rs 85.17% 47 Missing ⚠️
src/pci/src/msix.rs 85.57% 46 Missing ⚠️
src/vmm/src/builder.rs 74.30% 37 Missing ⚠️
src/vmm/src/devices/virtio/vsock/device.rs 31.81% 30 Missing ⚠️
src/pci/src/bus.rs 91.78% 23 Missing ⚠️
src/pci/src/device.rs 24.13% 22 Missing ⚠️
... and 31 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5364      +/-   ##
==========================================
- Coverage   82.98%   82.35%   -0.64%     
==========================================
  Files         250      265      +15     
  Lines       26837    30640    +3803     
==========================================
+ Hits        22271    25233    +2962     
- Misses       4566     5407     +841     
Flag Coverage Δ
5.10-c5n.metal 82.32% <75.64%> (-1.17%) ⬇️
5.10-m5n.metal 82.32% <75.64%> (-1.16%) ⬇️
5.10-m6a.metal 81.60% <75.64%> (-1.11%) ⬇️
5.10-m6g.metal 78.92% <74.33%> (-0.24%) ⬇️
5.10-m6i.metal 82.32% <75.64%> (-1.16%) ⬇️
5.10-m7a.metal-48xl 81.59% <75.64%> (?)
5.10-m7g.metal 78.92% <74.33%> (-0.24%) ⬇️
5.10-m7i.metal-24xl 82.29% <75.64%> (?)
5.10-m7i.metal-48xl 82.29% <75.64%> (?)
5.10-m8g.metal-24xl 78.92% <74.33%> (?)
5.10-m8g.metal-48xl 78.91% <74.33%> (?)
6.1-c5n.metal 82.37% <75.64%> (-1.16%) ⬇️
6.1-m5n.metal 82.37% <75.64%> (-1.16%) ⬇️
6.1-m6a.metal 81.65% <75.64%> (-1.10%) ⬇️
6.1-m6g.metal 78.92% <74.33%> (-0.24%) ⬇️
6.1-m6i.metal 82.36% <75.64%> (-1.16%) ⬇️
6.1-m7a.metal-48xl 81.63% <75.64%> (?)
6.1-m7g.metal 78.92% <74.33%> (-0.24%) ⬇️
6.1-m7i.metal-24xl 82.37% <75.64%> (?)
6.1-m7i.metal-48xl 82.37% <75.64%> (?)
6.1-m8g.metal-24xl 78.91% <74.33%> (?)
6.1-m8g.metal-48xl 78.91% <74.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@Manciukic Manciukic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

                        A_______
                        |______<
                        ||
                  ______||_______
                  \##############\
                   \##############\
                   |               |
                   |               |
                   |###############|
                   |###############|
                   |               |
 VK                |               |
                   |###############|
                  /###############/          @@
          /\     /_______________/          (  C
 (@\     ( ")   /\\  /\\||/\\  /\\  /\\    / /'
   \\_   (\_)  ( "))( ")|( "))( "))( "))  / /
    \```--/----/(o)-/(o)-/(o)-/(o)-/(o)--' /
~~~~~~~~~/ ~~~/~~~~/~~~~/~~~~/~~~~/~~~~~~~~~~~~~~~~~
    - - '  ( ' )( ' )( ' )( ' )( ' )
------------------------------------------------

@bchalios bchalios merged commit 80aa1de into main Aug 12, 2025
6 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants