-
Notifications
You must be signed in to change notification settings - Fork 29
VirtIO 1.0 and multi-queue support #998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e7b1927 to
59f9711
Compare
citrus-it
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed this mostly from a VirtIO/viona perspective.
| } | ||
| } | ||
|
|
||
| impl MigrateMulti for PciVirtioViona { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we'll need to push at least the number of selected queue pairs into the migration state here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do; that information is bundled up into the VirtioQueues type and its migration machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting to see both 'max pairs' and 'use pairs' in the state so we know what to pass to viona on resume/migration. Will we end up setting the right number of pairs in that case based on negotiated features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is implicitly encapsulated in the distinction between the vector size and length in VirtQueues.
9586765 to
2c7aa84
Compare
bf65fd9 to
45326c0
Compare
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through most of this except for queue.rs and part of viona.rs but I've got to task switch so no reason hold the half-finished review. generally, thanks for getting this together, and apologies for the forever delay in review.
lib/propolis/src/hw/virtio/mod.rs
Outdated
| /// XXX: Check these mappings against some reference. | ||
| pub fn pci_sub_dev_id(self) -> u16 { | ||
| self as u16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VirtIO 1.2 section 4.1.2.1 says:
Non-transitional devices SHOULD have a PCI Subsystem Device ID of 0x40 or higher
so this at least wants to branch on Mode::Modern and return self as u16 + 0x40 in that case, it seems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went back and forth a bit on this In c-bhyve and ended up just setting this to the same as the device ID for legacy, transitional and modern devices. It fits the spirit of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this one's a bit of a pickle. As Andy said, c-bhyve does it the way we do, and it seems valuable to maintain parity with it; on the other hand, the spec does contain that wording. In practice, I don't think any drivers really care. I find myself falling on the side of wanting to maintain parity with c-bhyve here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is doing it like c-bhyve at the moment.
What we want is this (for a network device - ID 1, dev 0x1000
Legacy/Transitional:
- Vendor:
0x1af4 - Device:
0x1000 - Subvendor:
0x1af4 - Subdevice:
0x1 - PCI Revision:
0x0
Modern:
- Vendor:
0x1af4 - Device:
0x1041 - Subvendor:
0x1af4 - Subdevice:
0x1041 - PCI Revision:
0x1
My inference from the spec wording is that we don't want, say, a modern network device having 0x1af4,1 in the subdevice fields to reduce the change of a guest attaching a legacy driver based on that. Anything >= 0x40 achieves it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. I think the code now reflects this.
| /// However, for legacy and transitional mode devices, the mapping is | ||
| /// irregular, and defined by a table. Moreover, not all virtio devices | ||
| /// have a well-defined transitional mapping. This last thing is not | ||
| /// really an issue for us, since we only expose a handful of device | ||
| /// models; regardless, we provide mappings for everything defined in | ||
| /// the VirtIO spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wording nit, because when I first read this I missed the "not all have a well-defined transitional mapping" -> "so some are defined ad-hoc" implication. and pedantically, going by the VirtIO spec won't be sufficient since I assume we'll want DeviceId::Socket.pci_dev_id(Mode::Legacy) when Mike's at that point too?
| /// However, for legacy and transitional mode devices, the mapping is | |
| /// irregular, and defined by a table. Moreover, not all virtio devices | |
| /// have a well-defined transitional mapping. This last thing is not | |
| /// really an issue for us, since we only expose a handful of device | |
| /// models; regardless, we provide mappings for everything defined in | |
| /// the VirtIO spec. | |
| /// However, for legacy and transitional mode devices, the mapping is | |
| /// irregular, and defined by a table. The table describes the subset of | |
| /// VirtIO devices that typically support operation by legacy drivers. Even | |
| /// then there are cases of devices that in-practice have legacy drivers but | |
| /// no specified transitional device ID. In such cases we use a device ID | |
| /// that seems to be the consensus across existing implementations. To date | |
| /// this means "follow QEMU's lead." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I word-smithed it a bit, but think I've captured the spirit of what you're getting at. PTAL.
lib/propolis/src/hw/virtio/mod.rs
Outdated
| /// Maps a VirtIO Device ID to a PCI Device Class. | ||
| pub fn pci_class(self) -> u8 { | ||
| use crate::hw::pci as pci_hw; | ||
| match self { | ||
| Self::Network => pci_hw::bits::CLASS_NETWORK, | ||
| Self::Block | Self::NineP => pci_hw::bits::CLASS_STORAGE, | ||
| _ => pci_hw::bits::CLASS_UNCLASSIFIED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this just moves the class definitions from PciVirtioState::new() to one more central lookup, but there's not really an authoritative spec definition for the device classes to use, is there? looking at QEMU I see that 9P is classed a CLASS_NETWORK (o_O?) and other devices seem to be PCI_CLASS_OTHERS == 0xff where CLASS_UNCLASSIFIED has us setting the class to 0.
while I'm primarily wondering if there's a reference to note (or non-reference to lament :) ), we actually don't expect to get CLASS_UNCLASSIFIED at any point right? seems like this could take the same approach as pci_dev_id and return an Err if we're translating an unexpected DeviceId.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9P is network in QEMU? That's kind of weird; it's literally a file protocol. sigh Then again, it's designed to go over a network, so....
I think we probably do expect to see some unclassified things eventually. For instance, if we ever implement virtio console or the entropy device. But we don't have any now, so...result it is.
lib/propolis/src/hw/virtio/pci.rs
Outdated
| let vendor_id = VENDOR_VIRTIO; | ||
| let sub_vendor_id = VENDOR_VIRTIO; | ||
| let device_id = device_type.pci_dev_id(mode).expect("Device ID"); | ||
| let sub_device_id = device_type.pci_sub_dev_id(); | ||
| let device_class = device_type.pci_class(); | ||
| let mut builder = pci::Builder::new(pci::Ident { | ||
| vendor_id: VENDOR_VIRTIO, | ||
| device_id: dev_id, | ||
| sub_vendor_id: VENDOR_VIRTIO, | ||
| sub_device_id: sub_dev_id, | ||
| class: dev_class, | ||
| vendor_id, | ||
| device_id, | ||
| sub_vendor_id, | ||
| sub_device_id, | ||
| device_class, | ||
| ..Default::default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to have a device_type.pci_ident() that does this shuffling? then it could return a Result and pci_dev_id/pci_sub_dev_id/pci_class can be Result-ful as appropriate for their mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it, but I'm honestly kind of ambivalent about it. I don't know if it stays on the right side of the balance of finding common ground between the "generic" VirtIO stuff and the VirtIO PCI stuff; maybe the impl bits should have been moved to virtio/pci.rs. But let me know what you think.
| // Note: we don't presently support a non-zero multiplier for the | ||
| // notification register, so we don't need to size this for the | ||
| // number of queues; hence the fixed size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me not really understanding the value of a non-zero multiplier: if I'm reading correctly, the guest notifies a device by writing a virtqueue number to this register, so it's not like a write to this register notifies all queues. but with a non-zero multiplier you could do something like write to queue 1 to say that.. queue 0 has available buffers? is it just an error to notify queue N about a virtqueue other than N?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data written to the register can be more arbitrary if VIRTIO_F_NOTIF_CONFIG_DATA is negotiated, but that still doesn't need a multiplier.
If the multiplier was chosen to put each register on a separate cache line I suppose there could be a benefit there.
Qemu has a flag (VIRTIO_PCI_FLAG_PAGE_PER_VQ) that sets the multiplier to 0x1000, otherwise it uses 0x4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's also possible to support multiple queue notifications simultaneously if you use the multiplier, using multiple threads or something in the host.
lib/propolis/src/hw/virtio/pci.rs
Outdated
| if let Some(queue) = self.queues.get(state.queue_select) | ||
| { | ||
| let mut size = queue.size.lock().unwrap(); | ||
| *size = offered; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if let Some(queue) = self.queues.get(state.queue_select) | |
| { | |
| let mut size = queue.size.lock().unwrap(); | |
| *size = offered; | |
| } | |
| let Some(queue) = self.queues.get(state.queue_select) else { | |
| // Invalid queue: drop the write. | |
| return; | |
| } | |
| let mut size = queue.size.lock().unwrap(); | |
| *size = offered; |
here the wrapping might end up hideouser.. but in general this helps fight the rightward drift of getting the queue and gives us a clearer place to say why the non-behavior on queue_select is reasonable.
not demanding this change here, especially because this continues the prior pattern. but if you don't get to it here I'll probably do it in a followup next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapping was worse, but I introduced a temporary to hold the queue select value and it was ok.
In general, I haven't stressed too hard on things like this because I think we ought to redo the entire mechanism.
| fn pci_cfg_cap_read( | ||
| &self, | ||
| dev: &dyn VirtioDevice, | ||
| id: &PciCfgCapReg, | ||
| op: &mut ReadOp, | ||
| ) { | ||
| let _todo = dev; | ||
| match id { | ||
| PciCfgCapReg::Common(common_id) => match common_id { | ||
| CommonCfgCapReg::CfgType => { | ||
| op.write_u8(VirtioCfgCapTag::Pci as u8) | ||
| } | ||
| CommonCfgCapReg::Bar => op.write_u8(0), // TODO: Handle | ||
| CommonCfgCapReg::Offset => op.write_u32(0), // TODO: Handle | ||
| CommonCfgCapReg::Length => op.write_u32(0), // TODO: Handle | ||
| _ => self.common_cfg_cap_read(common_id, op), | ||
| }, | ||
| PciCfgCapReg::PciData => { | ||
| // TODO: We actually need to handle this. | ||
| op.write_u32(0); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| fn pci_cfg_cap_write( | ||
| &self, | ||
| dev: &dyn VirtioDevice, | ||
| id: &PciCfgCapReg, | ||
| op: &mut WriteOp, | ||
| ) { | ||
| let _todo = (dev, op); | ||
| match id { | ||
| PciCfgCapReg::Common(common_id) => { | ||
| match common_id { | ||
| CommonCfgCapReg::Bar => { | ||
| // TODO: Store the bar | ||
| } | ||
| CommonCfgCapReg::Offset => { | ||
| // TODO: Store the offset | ||
| } | ||
| CommonCfgCapReg::Length => { | ||
| // TODO: Store the length | ||
| } | ||
| // Everything else is read-only for the driver. | ||
| _ => {} | ||
| } | ||
| } | ||
| PciCfgCapReg::PciData => { | ||
| // TODO: Handle the write. | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how TODO is this? it seems like leaving this unplumbed would be pretty unfortunate but clearly it's not necessary to see a working NIC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't found a driver that requires it, so it's pretty "meh, should be done... eventually." Given the time constraints, it didn't seem worth it at the moment.
lib/propolis/src/hw/virtio/viona.rs
Outdated
| // The notification addresses (both port and MMIO) for the device can change | ||
| // due to guest action, or other administrative tasks within propolis. We | ||
| // want to update the in-kernel IO port hook only in the former case, when | ||
| // the device emulation is actually running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't move this comment out onto the function itself: it's really "why is it correct to guard hdl.set_notify_*_port on self.indicator.state() == IndicatedState::Run".
maybe leave this one where it was and leave a // Only update the kernel for the same reason as notify_port_update() in notify_mmio_addr_update? I think what you're going for is that this really describes the if in both functions, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've word-smithed (or perhaps munged, if not out-right mangled) the comments here. PTAL.
lib/propolis/src/hw/virtio/viona.rs
Outdated
| pub const RX_QUEUE_SIZE: u16 = 0x800; | ||
| pub const TX_QUEUE_SIZE: u16 = 0x100; | ||
| pub const CTL_QUEUE_SIZE: u16 = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VqSize::new is already const, is there a reason to not make these VqSize and call it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| /// Types and so forth for supporting the control queue. | ||
| pub mod control { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these types and definitions come from VirtIO also, right? I see what looks like the The Truth in 5.1.6.5-ish but that was after a bit of confusion thinking these would be defined by viona for the VMM to control it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops; I missed this in my earlier swing. I added a comment here.
09f3da1 to
2f401e9
Compare
2f401e9 to
8e2f55f
Compare
No description provided.