Overhaul block attachment and request dispatch #953

pfmooney · 2025-10-03T02:46:24Z

This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.

pfmooney · 2025-10-03T02:48:00Z

This should not be merged before the R17 release branch is cut. We'll want soak time, consider the scope of work. I wanted to get it up and available for folks to review.

iximeow

i've mostly not looked at propolis/src/block yet but figured this first round of thoughts would be useful. on the whole this seems pretty reasonable! but I see that transmute in attachment.rs which I look forward to in.. round 2

iximeow · 2025-10-06T21:59:19Z

lib/propolis/src/block/minder.rs

+/// A wrapper for an IO [Request] bearing necessary tracking information to
+/// issue its completion back to the [queue](DeviceQueue) from which it came.
+///
+/// A panic will occur an [DeviceRequest] instance is dropped without calling


Suggested change

/// A panic will occur an [DeviceRequest] instance is dropped without calling

/// A panic will occur if a `DeviceRequest` instance is dropped without calling

grammar nit, and doc nit that i'd skip the rustdoc link since it should take you back to this same struct that you'd be reading the docs of.

iximeow · 2025-10-07T00:47:52Z

lib/propolis/src/block/minder.rs

+pub struct DeviceRequest {
+    req: Request,
+    id: ReqId,
+    completed: bool,


it's somewhat pleasant that complete() ends up optimizing out both the assignment and assert but this does tick the size of DeviceRequest from 64 bytes up to 72 (so I'm also a little sad to learn that done isn't packed into the discriminant of Operation)

if you move the assert-ful Drop so this is like:

struct NoDrop; impl Drop for NoDrop { fn drop(&mut self) { panic!("DeviceRequest should be completed before drop"); } } pub struct DeviceRequest { req: Request, id: ReqId, source: Weak<QueueMinder>, _nodrop: NoDrop, }

then you can destructure self in complete() and mem::forget() the NoDrop. it's probably not a big deal but going just above 64 bytes on the size makes me a bit itchy.

that said, I see this is also how Permit tracks completion and Drop. if you want to keep this as-is for consistency and we can shuffle that later, fine by me (I'm mostly operating on "we're here for every I/O" instinct. Assuming XXk iops I'd guess this is in the "important but individual changes are approximately noise" zone?)

Although I'm not all that worried about the size (because of the "pull" model, we don't end up holding very many outstanding requests at any given time), I do prefer this structure as you've suggested. I'll fix it here and for permit.

iximeow · 2025-10-07T01:35:11Z

lib/propolis/src/block/minder.rs

+            match req.op {
+                Operation::Read(off, len) => {
+                    probes::block_begin_read!(|| {
+                        (devqid, id, off as u64, len as u64)
+                    });
+                }
+                Operation::Write(off, len) => {
+                    probes::block_begin_write!(|| {
+                        (devqid, id, off as u64, len as u64)
+                    });
+                }
+                Operation::Flush => {
+                    probes::block_begin_flush!(|| { (devqid, id) });
+                }
+                Operation::Discard(off, len) => {
+                    probes::block_begin_discard!(|| {
+                        (devqid, id, off as u64, len as u64)
+                    });
+                }
+            }


relatively unfortunate that this goes through an indirect branch on the discriminant of Operation even if none of the probes are enabled. what do you think about having one probes::block_begin and leave filtering on the type to the probe user?

this and the one below make me a little suspicious in a way that next_req_fn doesn't because I figure the operation mix is basically unpredictable whereas those indirect calls should be straightforward?

I agree that it's not ideal, but I'm also not sure about consolidating the probes:

We're already pushing the limits when it comes to number of args (hence the amalgamation of devqid)

Communicating the request type would mean a string, or a shared decoder ring for casting it to an integer, which is not fantastic, ergonomically speaking

I think the costs imposed here are relatively small, compared to the rest of the work required to process an IO.

iximeow · 2025-10-07T02:08:42Z

lib/propolis/src/block/minder.rs

+    }
+}
+
+pub(super) struct QueueMinder {


I think the whole QueueCollection/QueueSlot/QueueSlotState/QueueMinder could use a few words on why we're picking indirection here rather than somewhere up that sequence. My read (which I think is reasonable, just would be nice to say up front) is that you decided to have NextReqFn and CompleteReqFn here so we don't have to plumb DQ through Request and QueueMinder up into an eventual

struct QueueSlotState { minder: Option<Arc<dyn QueueMinderTrait>> }

when realistically there's no inner loop in the call chain under QueueSlotState that having a concrete function type would help with. and so if we ever end up in a situation with a loop over indirect calls under QueueSlotState, it might change the balance and justify plumbing types through, but we just don't have that today.

When I was originally writing this, I wanted to avoid that type specialization leaking out far and wide. It turned into kind of a mess before. I'm not sure that it's especially more readable to push the QueueMinder interface into a trait in order to move the indirection thusly.

iximeow · 2025-10-07T02:22:06Z

lib/propolis/src/block/minder.rs

+>;
+
+struct QmEntry {
+    token: Box<dyn Any + Send + Sync>,


though plumbing DQ and the token type through would let us avoid the indirection here, and I think avoid a copy or two in building QmEntry. maybe profiling just says the hot spots are elsewhere and so indirection just keeps this a bit more flexible 'til we've chased out other problems?

Yeah, avoiding the alloc for Box-ing it, as well as the copies. Not plumbing it through is justified (or at least explained) below.

iximeow · 2025-10-07T05:58:46Z

lib/propolis/src/hw/nvme/admin.rs

+        match self.delete_sq(sqid, nvme) {
+            Ok(()) => {
+                nvme.block_attach.queue_dissociate(sqid_to_block_qid(sqid));
+                // TODO: wait until requests are done?


this is just "we hadn't been waiting before, but we probably should do something about that", right?

Yeah. There's a whole area of stuff we should probably be better at when it comes to emulating admin commands. Now was not the time to fall down that hole.

iximeow · 2025-10-07T06:27:33Z

lib/propolis/src/hw/nvme/mod.rs

+        // doorbell, now that we've dropped all involved locks.
+        if let Some(sqids) = to_notify {
+            for (sqid, occupied) in
+                sqids.into_iter().filter(|(id, _)| *id != queue::ADMIN_QUEUE_ID)


it's basically fine to see ADMIN_QUEUE_ID here, right? just, there isn't a block queue to be notified. either someone set up a submission queue with the admin queue as the completion queue, or someone literally just run the doorbell for the admin submission queue. could be helpful to say so here.

Not really? The existing register parsing structure has the AdminSq/AdminCq offsets split out, so we only end up calling ring_doorbell() for IO queues.

iximeow · 2025-10-07T06:30:35Z

lib/propolis/src/hw/nvme/mod.rs

+    /// Controller-enabled state, duplicated outside the protection of the
+    /// [NvmeCtrl] lock.
+    is_enabled: AtomicBool,


probably not getting to it in this first pass, but "duplicated outside the protection" makes me immediately curious what the relationship between these enabled bits is and if there's a procedure around how they must be managed (can this only be written with NvmeCtrl locked? or is there some more precise situation here)

Yeah, it's written with NvmeCtrl locked. I've expanded the comment here to hopefully make it clearer?

iximeow · 2025-10-07T06:36:24Z

lib/propolis/src/hw/nvme/mod.rs

+            let guard = self
+                .queues
+                .get_cq(qid)
+                .ok_or(NvmeError::InvalidCompQueue(qid))?;
+            let cq = guard.as_ref().unwrap();


nitty: I think spelling this as

Suggested change

let guard = self

.queues

.get_cq(qid)

.ok_or(NvmeError::InvalidCompQueue(qid))?;

let cq = guard.as_ref().unwrap();

let guard = self

.queues

.get_cq(qid);

let Some(cq) = guard.as_ref() else {

return Err(NvmeError::InvalidCompQueue(qid));

}

would be pretty acceptable if you wanted to drop the outer Option-ness. I'm pretty sure rustc would tolerate the borrows, at least.. (same for get_sq)

The outer option is used to guard against a doorbell rung for > MAX_QUEUES, which is possible.

iximeow · 2025-10-07T06:40:19Z

lib/propolis/src/hw/nvme/mod.rs

+    cqs: [Mutex<Option<Arc<CompQueue>>>; MAX_NUM_QUEUES],
+}
+impl NvmeQueues {
+    /// Replace the contents of a [SubQueue] slot.


this and set_cq_slot assert about old because if it was Some you must be destroying the queue, and if it was None we should have rejected the command to destroy the queue? so in both cases something must have happened that doesn't correspond to a legal NVMe command. the asserts make sense through that lens but I think a sentence about that would be helpful for later readers.

Ok. I've added a comment here.

iximeow

somewhat fewer comments on the back half (plus what we've been talking about re. NoneProcessing)

iximeow · 2025-10-07T17:27:31Z

lib/propolis/src/block/attachment.rs

+            (Some(ds), Some((bs, _))) if self.eq(ds) && self.eq(bs) => {
+                // Device and backend agree about mutual attachment
+            }
+            _ => return,


this could reasonably be an assertion, right? if they're both Some but disagree on their attachment, something has come and changed attachment state outside AttachPair::attach, which at least currently should be impossible. if one is None but not the other, we'd have to be racing an AttachPair::detach, but only one of attach or detach is happening at a time because of the locks above.

I'm not so sure that's safe. I believe it's possible that detach could race with something which detaches the device/backend, and then subsequently attaches them to different respective things. In that case the AttachPair here would remain valid, but the pointers would be different once we made it inside the locks.

iximeow · 2025-10-07T17:34:37Z

lib/propolis/src/block/attachment.rs

+        let (Some(dev), Some(be)) =
+            (self.dev_attach.upgrade(), self.backend_attach.upgrade())
+        else {
+            return;


not that I think we should assert this, but a comment might be helpful: if we've gotten here, one of dev_attach or backend_attach's Drop has run and already detached the device and block state. so it ought to be the case that if one of these is Some, its .att_state.lock() should be None.

mostly important to me that we're clear we're not leaking attachment state if we were racing attach/detach operations.

Comment added.

iximeow · 2025-10-07T17:44:47Z

lib/propolis/src/block/attachment.rs

+        let mut dev_att_state = dev.0.att_state.lock().unwrap();
+        let mut be_att_state = be.0.att_state.lock().unwrap();


on the topic of racing, it seems worth mentioning on AttachPair that we are handling a series of locks, and so attach and detach have to agree on the ordering (not that it matters for how we do attach/detach in practice, because we detach by walking block backends rather than detaching backends and devices in a super racy way)

I've added a comment. I hope it's adequate?

pfmooney

Thanks for looking through so far. I think I've at least touched on all the points you've raised.

pfmooney · 2025-10-07T13:17:27Z

lib/propolis/src/block/minder.rs

+pub struct DeviceRequest {
+    req: Request,
+    id: ReqId,
+    completed: bool,


Although I'm not all that worried about the size (because of the "pull" model, we don't end up holding very many outstanding requests at any given time), I do prefer this structure as you've suggested. I'll fix it here and for permit.

pfmooney · 2025-10-07T13:41:24Z

lib/propolis/src/block/minder.rs

+            match req.op {
+                Operation::Read(off, len) => {
+                    probes::block_begin_read!(|| {
+                        (devqid, id, off as u64, len as u64)
+                    });
+                }
+                Operation::Write(off, len) => {
+                    probes::block_begin_write!(|| {
+                        (devqid, id, off as u64, len as u64)
+                    });
+                }
+                Operation::Flush => {
+                    probes::block_begin_flush!(|| { (devqid, id) });
+                }
+                Operation::Discard(off, len) => {
+                    probes::block_begin_discard!(|| {
+                        (devqid, id, off as u64, len as u64)
+                    });
+                }
+            }


I agree that it's not ideal, but I'm also not sure about consolidating the probes:

We're already pushing the limits when it comes to number of args (hence the amalgamation of devqid)

Communicating the request type would mean a string, or a shared decoder ring for casting it to an integer, which is not fantastic, ergonomically speaking

I think the costs imposed here are relatively small, compared to the rest of the work required to process an IO.

pfmooney · 2025-10-07T13:55:01Z

lib/propolis/src/block/attachment.rs

-    lock: Mutex<()>,
-    cv: Condvar,
+/// Static for generating unique block [DeviceId]s with a process
+static NEXT_DEVICE_ID: AtomicU32 = AtomicU32::new(1);


Changed to 0

pfmooney · 2025-10-07T13:55:48Z

lib/propolis/src/block/attachment.rs

+impl QueueCollection {
+    fn new(max_queues: NonZeroUsize, devid: DeviceId) -> Arc<Self> {
+        let count = max_queues.get();
+        assert!(count < MAX_WORKERS.get());


Whoops, yup.

pfmooney · 2025-10-07T13:56:20Z

lib/propolis/src/block/attachment.rs

+impl WorkerCollection {
+    fn new(max_workers: NonZeroUsize) -> Arc<Self> {
+        let max_workers = max_workers.get();
+        assert!(max_workers < MAX_WORKERS.get());


pfmooney · 2025-10-07T22:42:07Z

lib/propolis/src/hw/nvme/mod.rs

+        // doorbell, now that we've dropped all involved locks.
+        if let Some(sqids) = to_notify {
+            for (sqid, occupied) in
+                sqids.into_iter().filter(|(id, _)| *id != queue::ADMIN_QUEUE_ID)


Not really? The existing register parsing structure has the AdminSq/AdminCq offsets split out, so we only end up calling ring_doorbell() for IO queues.

pfmooney · 2025-10-07T22:45:02Z

lib/propolis/src/hw/nvme/mod.rs

+            let guard = self
+                .queues
+                .get_cq(qid)
+                .ok_or(NvmeError::InvalidCompQueue(qid))?;
+            let cq = guard.as_ref().unwrap();


The outer option is used to guard against a doorbell rung for > MAX_QUEUES, which is possible.

pfmooney · 2025-10-07T22:49:59Z

lib/propolis/src/hw/nvme/mod.rs

+    /// Controller-enabled state, duplicated outside the protection of the
+    /// [NvmeCtrl] lock.
+    is_enabled: AtomicBool,


Yeah, it's written with NvmeCtrl locked. I've expanded the comment here to hopefully make it clearer?

pfmooney · 2025-10-08T15:21:18Z

lib/propolis/src/hw/nvme/mod.rs

+    cqs: [Mutex<Option<Arc<CompQueue>>>; MAX_NUM_QUEUES],
+}
+impl NvmeQueues {
+    /// Replace the contents of a [SubQueue] slot.


Ok. I've added a comment here.

pfmooney · 2025-10-08T15:23:14Z

lib/propolis/src/hw/nvme/mod.rs

-        size: u32,
+        nvme: &PciNvme,
        mem: &MemCtx,
    ) -> Result<Arc<SubQueue>, NvmeError> {


I'm not sold on returning the QueueId at the moment, but we definitely need some stronger checks here. I've included some additional logic around the admin queues, and fixed the part where we were not doing proper association during state import.

pfmooney requested a review from iximeow October 3, 2025 02:48

iximeow reviewed Oct 7, 2025

View reviewed changes

CR Feedback round 1

73c0df4

pfmooney commented Oct 8, 2025

View reviewed changes

	/// A panic will occur an [DeviceRequest] instance is dropped without calling
	/// A panic will occur if a `DeviceRequest` instance is dropped without calling

		let mut dev_att_state = dev.0.att_state.lock().unwrap();
		let mut be_att_state = be.0.att_state.lock().unwrap();

Overhaul block attachment and request dispatch #953

Are you sure you want to change the base?

Overhaul block attachment and request dispatch #953

Uh oh!

Conversation

pfmooney commented Oct 3, 2025

Uh oh!

pfmooney commented Oct 3, 2025

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pfmooney left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!