VirtIO 1.0 and multi-queue support #998

dancrossnyc · 2025-12-23T02:17:04Z

No description provided.

dancrossnyc · 2025-12-23T02:18:51Z

Cc @papertigers @rcgoodfellow

citrus-it

I've reviewed this mostly from a VirtIO/viona perspective.

crates/viona-api/src/ffi.rs

crates/viona-api/src/lib.rs

lib/propolis/src/hw/virtio/pci.rs

lib/propolis/src/hw/virtio/queue.rs

citrus-it · 2025-12-30T16:53:04Z

lib/propolis/src/hw/virtio/viona.rs

    }
 }

 impl MigrateMulti for PciVirtioViona {


It seems we'll need to push at least the number of selected queue pairs into the migration state here.

We do; that information is bundled up into the VirtioQueues type and its migration machinery.

I was expecting to see both 'max pairs' and 'use pairs' in the state so we know what to pass to viona on resume/migration. Will we end up setting the right number of pairs in that case based on negotiated features?

That is implicitly encapsulated in the distinction between the vector size and length in VirtQueues.

iximeow

I've gone through most of this except for queue.rs and part of viona.rs but I've got to task switch so no reason hold the half-finished review. generally, thanks for getting this together, and apologies for the forever delay in review.

iximeow · 2026-01-07T20:53:26Z

lib/propolis/src/hw/virtio/mod.rs

+    /// XXX: Check these mappings against some reference.
+    pub fn pci_sub_dev_id(self) -> u16 {
+        self as u16


VirtIO 1.2 section 4.1.2.1 says:

Non-transitional devices SHOULD have a PCI Subsystem Device ID of 0x40 or higher

so this at least wants to branch on Mode::Modern and return self as u16 + 0x40 in that case, it seems?

We went back and forth a bit on this In c-bhyve and ended up just setting this to the same as the device ID for legacy, transitional and modern devices. It fits the spirit of the spec.

Yeah, this one's a bit of a pickle. As Andy said, c-bhyve does it the way we do, and it seems valuable to maintain parity with it; on the other hand, the spec does contain that wording. In practice, I don't think any drivers really care. I find myself falling on the side of wanting to maintain parity with c-bhyve here.

I don't think this is doing it like c-bhyve at the moment.

What we want is this (for a network device - ID 1, dev 0x1000

Legacy/Transitional:

Vendor: 0x1af4

Device: 0x1000

Subvendor: 0x1af4

Subdevice: 0x1

PCI Revision: 0x0

Modern:

Vendor: 0x1af4

Device: 0x1041

Subvendor: 0x1af4

Subdevice: 0x1041

PCI Revision: 0x1

My inference from the spec wording is that we don't want, say, a modern network device having 0x1af4,1 in the subdevice fields to reduce the change of a guest attaching a legacy driver based on that. Anything >= 0x40 achieves it.

Ah, ok. I think the code now reflects this.

iximeow · 2026-01-07T21:32:01Z

lib/propolis/src/hw/virtio/mod.rs

+    /// However, for legacy and transitional mode devices, the mapping is
+    /// irregular, and defined by a table.  Moreover, not all virtio devices
+    /// have a well-defined transitional mapping.  This last thing is not
+    /// really an issue for us, since we only expose a handful of device
+    /// models; regardless, we provide mappings for everything defined in
+    /// the VirtIO spec.


wording nit, because when I first read this I missed the "not all have a well-defined transitional mapping" -> "so some are defined ad-hoc" implication. and pedantically, going by the VirtIO spec won't be sufficient since I assume we'll want DeviceId::Socket.pci_dev_id(Mode::Legacy) when Mike's at that point too?

Suggested change

/// However, for legacy and transitional mode devices, the mapping is

/// irregular, and defined by a table. Moreover, not all virtio devices

/// have a well-defined transitional mapping. This last thing is not

/// really an issue for us, since we only expose a handful of device

/// models; regardless, we provide mappings for everything defined in

/// the VirtIO spec.

/// However, for legacy and transitional mode devices, the mapping is

/// irregular, and defined by a table. The table describes the subset of

/// VirtIO devices that typically support operation by legacy drivers. Even

/// then there are cases of devices that in-practice have legacy drivers but

/// no specified transitional device ID. In such cases we use a device ID

/// that seems to be the consensus across existing implementations. To date

/// this means "follow QEMU's lead."

I word-smithed it a bit, but think I've captured the spirit of what you're getting at. PTAL.

iximeow · 2026-01-07T21:48:35Z

lib/propolis/src/hw/virtio/mod.rs

+    /// Maps a VirtIO Device ID to a PCI Device Class.
+    pub fn pci_class(self) -> u8 {
+        use crate::hw::pci as pci_hw;
+        match self {
+            Self::Network => pci_hw::bits::CLASS_NETWORK,
+            Self::Block | Self::NineP => pci_hw::bits::CLASS_STORAGE,
+            _ => pci_hw::bits::CLASS_UNCLASSIFIED,


I realize this just moves the class definitions from PciVirtioState::new() to one more central lookup, but there's not really an authoritative spec definition for the device classes to use, is there? looking at QEMU I see that 9P is classed a CLASS_NETWORK (o_O?) and other devices seem to be PCI_CLASS_OTHERS == 0xff where CLASS_UNCLASSIFIED has us setting the class to 0.

while I'm primarily wondering if there's a reference to note (or non-reference to lament :) ), we actually don't expect to get CLASS_UNCLASSIFIED at any point right? seems like this could take the same approach as pci_dev_id and return an Err if we're translating an unexpected DeviceId.

9P is network in QEMU? That's kind of weird; it's literally a file protocol. sigh Then again, it's designed to go over a network, so....

I think we probably do expect to see some unclassified things eventually. For instance, if we ever implement virtio console or the entropy device. But we don't have any now, so...result it is.

iximeow · 2026-01-07T22:07:30Z

lib/propolis/src/hw/virtio/pci.rs

+        let vendor_id = VENDOR_VIRTIO;
+        let sub_vendor_id = VENDOR_VIRTIO;
+        let device_id = device_type.pci_dev_id(mode).expect("Device ID");
+        let sub_device_id = device_type.pci_sub_dev_id();
+        let device_class = device_type.pci_class();
        let mut builder = pci::Builder::new(pci::Ident {
-            vendor_id: VENDOR_VIRTIO,
-            device_id: dev_id,
-            sub_vendor_id: VENDOR_VIRTIO,
-            sub_device_id: sub_dev_id,
-            class: dev_class,
+            vendor_id,
+            device_id,
+            sub_vendor_id,
+            sub_device_id,
+            device_class,
            ..Default::default()


does it make sense to have a device_type.pci_ident() that does this shuffling? then it could return a Result and pci_dev_id/pci_sub_dev_id/pci_class can be Result-ful as appropriate for their mappings.

I've added it, but I'm honestly kind of ambivalent about it. I don't know if it stays on the right side of the balance of finding common ground between the "generic" VirtIO stuff and the VirtIO PCI stuff; maybe the impl bits should have been moved to virtio/pci.rs. But let me know what you think.

iximeow · 2026-01-07T22:24:22Z

lib/propolis/src/hw/virtio/pci.rs

+            // Note: we don't presently support a non-zero multiplier for the
+            // notification register, so we don't need to size this for the
+            // number of queues; hence the fixed size.


me not really understanding the value of a non-zero multiplier: if I'm reading correctly, the guest notifies a device by writing a virtqueue number to this register, so it's not like a write to this register notifies all queues. but with a non-zero multiplier you could do something like write to queue 1 to say that.. queue 0 has available buffers? is it just an error to notify queue N about a virtqueue other than N?

The data written to the register can be more arbitrary if VIRTIO_F_NOTIF_CONFIG_DATA is negotiated, but that still doesn't need a multiplier.
If the multiplier was chosen to put each register on a separate cache line I suppose there could be a benefit there.
Qemu has a flag (VIRTIO_PCI_FLAG_PAGE_PER_VQ) that sets the multiplier to 0x1000, otherwise it uses 0x4.

I suppose it's also possible to support multiple queue notifications simultaneously if you use the multiplier, using multiple threads or something in the host.

iximeow · 2026-01-08T01:36:55Z

lib/propolis/src/hw/virtio/pci.rs

+                        if let Some(queue) = self.queues.get(state.queue_select)
+                        {
+                            let mut size = queue.size.lock().unwrap();
+                            *size = offered;
+                        }


Suggested change

if let Some(queue) = self.queues.get(state.queue_select)

{

let mut size = queue.size.lock().unwrap();

*size = offered;

}

let Some(queue) = self.queues.get(state.queue_select) else {

// Invalid queue: drop the write.

return;

}

let mut size = queue.size.lock().unwrap();

*size = offered;

here the wrapping might end up hideouser.. but in general this helps fight the rightward drift of getting the queue and gives us a clearer place to say why the non-behavior on queue_select is reasonable.

not demanding this change here, especially because this continues the prior pattern. but if you don't get to it here I'll probably do it in a followup next week.

The wrapping was worse, but I introduced a temporary to hold the queue select value and it was ok.

In general, I haven't stressed too hard on things like this because I think we ought to redo the entire mechanism.

iximeow · 2026-01-08T01:41:34Z

lib/propolis/src/hw/virtio/pci.rs

+    fn pci_cfg_cap_read(
+        &self,
+        dev: &dyn VirtioDevice,
+        id: &PciCfgCapReg,
+        op: &mut ReadOp,
+    ) {
+        let _todo = dev;
+        match id {
+            PciCfgCapReg::Common(common_id) => match common_id {
+                CommonCfgCapReg::CfgType => {
+                    op.write_u8(VirtioCfgCapTag::Pci as u8)
+                }
+                CommonCfgCapReg::Bar => op.write_u8(0), // TODO: Handle
+                CommonCfgCapReg::Offset => op.write_u32(0), // TODO: Handle
+                CommonCfgCapReg::Length => op.write_u32(0), // TODO: Handle
+                _ => self.common_cfg_cap_read(common_id, op),
+            },
+            PciCfgCapReg::PciData => {
+                // TODO: We actually need to handle this.
+                op.write_u32(0);
+            }
+        }
+    }
+
+    fn pci_cfg_cap_write(
+        &self,
+        dev: &dyn VirtioDevice,
+        id: &PciCfgCapReg,
+        op: &mut WriteOp,
+    ) {
+        let _todo = (dev, op);
+        match id {
+            PciCfgCapReg::Common(common_id) => {
+                match common_id {
+                    CommonCfgCapReg::Bar => {
+                        // TODO: Store the bar
+                    }
+                    CommonCfgCapReg::Offset => {
+                        // TODO: Store the offset
+                    }
+                    CommonCfgCapReg::Length => {
+                        // TODO: Store the length
+                    }
+                    // Everything else is read-only for the driver.
+                    _ => {}
+                }
+            }
+            PciCfgCapReg::PciData => {
+                // TODO: Handle the write.
+            }
+        }
+    }


how TODO is this? it seems like leaving this unplumbed would be pretty unfortunate but clearly it's not necessary to see a working NIC?

We haven't found a driver that requires it, so it's pretty "meh, should be done... eventually." Given the time constraints, it didn't seem worth it at the moment.

iximeow · 2026-01-08T02:13:25Z

lib/propolis/src/hw/virtio/viona.rs

+    // The notification addresses (both port and MMIO) for the device can change
+    // due to guest action, or other administrative tasks within propolis.  We
+    // want to update the in-kernel IO port hook only in the former case, when
+    // the device emulation is actually running.


I wouldn't move this comment out onto the function itself: it's really "why is it correct to guard hdl.set_notify_*_port on self.indicator.state() == IndicatedState::Run".

maybe leave this one where it was and leave a // Only update the kernel for the same reason as notify_port_update() in notify_mmio_addr_update? I think what you're going for is that this really describes the if in both functions, at least.

I've word-smithed (or perhaps munged, if not out-right mangled) the comments here. PTAL.

iximeow · 2026-01-08T02:19:38Z

lib/propolis/src/hw/virtio/viona.rs

+pub const RX_QUEUE_SIZE: u16 = 0x800;
+pub const TX_QUEUE_SIZE: u16 = 0x100;
+pub const CTL_QUEUE_SIZE: u16 = 32;


VqSize::new is already const, is there a reason to not make these VqSize and call it here?

iximeow · 2026-01-08T02:42:18Z

lib/propolis/src/hw/virtio/viona.rs

+/// Types and so forth for supporting the control queue.
+pub mod control {


these types and definitions come from VirtIO also, right? I see what looks like the The Truth in 5.1.6.5-ish but that was after a bit of confusion thinking these would be defined by viona for the VMM to control it.

Oops; I missed this in my earlier swing. I added a comment here.

dancrossnyc requested review from citrus-it, iximeow and papertigers and removed request for papertigers December 23, 2025 02:17

dancrossnyc force-pushed the virtio-modern branch 2 times, most recently from e7b1927 to 59f9711 Compare December 23, 2025 21:18

askfongjojo added this to the 18 milestone Dec 29, 2025

citrus-it reviewed Dec 30, 2025

View reviewed changes

papertigers self-requested a review December 30, 2025 20:58

dancrossnyc force-pushed the virtio-modern branch 5 times, most recently from 9586765 to 2c7aa84 Compare January 6, 2026 19:14

citrus-it self-requested a review January 7, 2026 11:22

citrus-it approved these changes Jan 7, 2026

View reviewed changes

citrus-it assigned dancrossnyc Jan 7, 2026

dancrossnyc force-pushed the virtio-modern branch 6 times, most recently from bf65fd9 to 45326c0 Compare January 8, 2026 02:14

iximeow reviewed Jan 8, 2026

View reviewed changes

dancrossnyc force-pushed the virtio-modern branch 4 times, most recently from 09f3da1 to 2f401e9 Compare January 8, 2026 21:43

VirtIO 1.0 and multi-queue support

8e2f55f

dancrossnyc force-pushed the virtio-modern branch from 2f401e9 to 8e2f55f Compare January 8, 2026 21:59

		/// Types and so forth for supporting the control queue.
		pub mod control {

VirtIO 1.0 and multi-queue support #998

Are you sure you want to change the base?

VirtIO 1.0 and multi-queue support #998

Conversation

dancrossnyc commented Dec 23, 2025

Uh oh!

dancrossnyc commented Dec 23, 2025

Uh oh!

citrus-it left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Legacy/Transitional:

Modern:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment