Skip to content

Conversation

pfmooney
Copy link
Collaborator

@pfmooney pfmooney commented Oct 3, 2025

This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.

This addresses considerable lock contention and thundering herd issues
present in the block IO abstractions.  Devices with multiple queues
(read: NVMe), serviced by multiple workers, should be in much better
shape when it comes to efficiently farming out work.
@pfmooney
Copy link
Collaborator Author

pfmooney commented Oct 3, 2025

This should not be merged before the R17 release branch is cut. We'll want soak time, consider the scope of work. I wanted to get it up and available for folks to review.

@pfmooney pfmooney requested a review from iximeow October 3, 2025 02:48
Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've mostly not looked at propolis/src/block yet but figured this first round of thoughts would be useful. on the whole this seems pretty reasonable! but I see that transmute in attachment.rs which I look forward to in.. round 2

/// A wrapper for an IO [Request] bearing necessary tracking information to
/// issue its completion back to the [queue](DeviceQueue) from which it came.
///
/// A panic will occur an [DeviceRequest] instance is dropped without calling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// A panic will occur an [DeviceRequest] instance is dropped without calling
/// A panic will occur if a `DeviceRequest` instance is dropped without calling

grammar nit, and doc nit that i'd skip the rustdoc link since it should take you back to this same struct that you'd be reading the docs of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pub struct DeviceRequest {
req: Request,
id: ReqId,
completed: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's somewhat pleasant that complete() ends up optimizing out both the assignment and assert but this does tick the size of DeviceRequest from 64 bytes up to 72 (so I'm also a little sad to learn that done isn't packed into the discriminant of Operation)

if you move the assert-ful Drop so this is like:

struct NoDrop;

impl Drop for NoDrop {
    fn drop(&mut self) {
        panic!("DeviceRequest should be completed before drop");
    }
}

pub struct DeviceRequest {
    req: Request,
    id: ReqId,
    source: Weak<QueueMinder>,
    _nodrop: NoDrop,
}

then you can destructure self in complete() and mem::forget() the NoDrop. it's probably not a big deal but going just above 64 bytes on the size makes me a bit itchy.


that said, I see this is also how Permit tracks completion and Drop. if you want to keep this as-is for consistency and we can shuffle that later, fine by me (I'm mostly operating on "we're here for every I/O" instinct. Assuming XXk iops I'd guess this is in the "important but individual changes are approximately noise" zone?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm not all that worried about the size (because of the "pull" model, we don't end up holding very many outstanding requests at any given time), I do prefer this structure as you've suggested. I'll fix it here and for permit.

Comment on lines +203 to +222
match req.op {
Operation::Read(off, len) => {
probes::block_begin_read!(|| {
(devqid, id, off as u64, len as u64)
});
}
Operation::Write(off, len) => {
probes::block_begin_write!(|| {
(devqid, id, off as u64, len as u64)
});
}
Operation::Flush => {
probes::block_begin_flush!(|| { (devqid, id) });
}
Operation::Discard(off, len) => {
probes::block_begin_discard!(|| {
(devqid, id, off as u64, len as u64)
});
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relatively unfortunate that this goes through an indirect branch on the discriminant of Operation even if none of the probes are enabled. what do you think about having one probes::block_begin and leave filtering on the type to the probe user?

this and the one below make me a little suspicious in a way that next_req_fn doesn't because I figure the operation mix is basically unpredictable whereas those indirect calls should be straightforward?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not ideal, but I'm also not sure about consolidating the probes:

  1. We're already pushing the limits when it comes to number of args (hence the amalgamation of devqid)
  2. Communicating the request type would mean a string, or a shared decoder ring for casting it to an integer, which is not fantastic, ergonomically speaking

I think the costs imposed here are relatively small, compared to the rest of the work required to process an IO.

}
}

pub(super) struct QueueMinder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole QueueCollection/QueueSlot/QueueSlotState/QueueMinder could use a few words on why we're picking indirection here rather than somewhere up that sequence. My read (which I think is reasonable, just would be nice to say up front) is that you decided to have NextReqFn and CompleteReqFn here so we don't have to plumb DQ through Request and QueueMinder up into an eventual

struct QueueSlotState {
    minder: Option<Arc<dyn QueueMinderTrait>>
}

when realistically there's no inner loop in the call chain under QueueSlotState that having a concrete function type would help with. and so if we ever end up in a situation with a loop over indirect calls under QueueSlotState, it might change the balance and justify plumbing types through, but we just don't have that today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was originally writing this, I wanted to avoid that type specialization leaking out far and wide. It turned into kind of a mess before. I'm not sure that it's especially more readable to push the QueueMinder interface into a trait in order to move the indirection thusly.

>;

struct QmEntry {
token: Box<dyn Any + Send + Sync>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though plumbing DQ and the token type through would let us avoid the indirection here, and I think avoid a copy or two in building QmEntry. maybe profiling just says the hot spots are elsewhere and so indirection just keeps this a bit more flexible 'til we've chased out other problems?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, avoiding the alloc for Box-ing it, as well as the copies. Not plumbing it through is justified (or at least explained) below.

match self.delete_sq(sqid, nvme) {
Ok(()) => {
nvme.block_attach.queue_dissociate(sqid_to_block_qid(sqid));
// TODO: wait until requests are done?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just "we hadn't been waiting before, but we probably should do something about that", right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. There's a whole area of stuff we should probably be better at when it comes to emulating admin commands. Now was not the time to fall down that hole.

// doorbell, now that we've dropped all involved locks.
if let Some(sqids) = to_notify {
for (sqid, occupied) in
sqids.into_iter().filter(|(id, _)| *id != queue::ADMIN_QUEUE_ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's basically fine to see ADMIN_QUEUE_ID here, right? just, there isn't a block queue to be notified. either someone set up a submission queue with the admin queue as the completion queue, or someone literally just run the doorbell for the admin submission queue. could be helpful to say so here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really? The existing register parsing structure has the AdminSq/AdminCq offsets split out, so we only end up calling ring_doorbell() for IO queues.

Comment on lines 635 to 637
/// Controller-enabled state, duplicated outside the protection of the
/// [NvmeCtrl] lock.
is_enabled: AtomicBool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not getting to it in this first pass, but "duplicated outside the protection" makes me immediately curious what the relationship between these enabled bits is and if there's a procedure around how they must be managed (can this only be written with NvmeCtrl locked? or is there some more precise situation here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's written with NvmeCtrl locked. I've expanded the comment here to hopefully make it clearer?

Comment on lines +1041 to +1045
let guard = self
.queues
.get_cq(qid)
.ok_or(NvmeError::InvalidCompQueue(qid))?;
let cq = guard.as_ref().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitty: I think spelling this as

Suggested change
let guard = self
.queues
.get_cq(qid)
.ok_or(NvmeError::InvalidCompQueue(qid))?;
let cq = guard.as_ref().unwrap();
let guard = self
.queues
.get_cq(qid);
let Some(cq) = guard.as_ref() else {
return Err(NvmeError::InvalidCompQueue(qid));
}

would be pretty acceptable if you wanted to drop the outer Option-ness. I'm pretty sure rustc would tolerate the borrows, at least.. (same for get_sq)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer option is used to guard against a doorbell rung for > MAX_QUEUES, which is possible.

cqs: [Mutex<Option<Arc<CompQueue>>>; MAX_NUM_QUEUES],
}
impl NvmeQueues {
/// Replace the contents of a [SubQueue] slot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and set_cq_slot assert about old because if it was Some you must be destroying the queue, and if it was None we should have rejected the command to destroy the queue? so in both cases something must have happened that doesn't correspond to a legal NVMe command. the asserts make sense through that lens but I think a sentence about that would be helpful for later readers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I've added a comment here.

Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat fewer comments on the back half (plus what we've been talking about re. NoneProcessing)

(Some(ds), Some((bs, _))) if self.eq(ds) && self.eq(bs) => {
// Device and backend agree about mutual attachment
}
_ => return,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could reasonably be an assertion, right? if they're both Some but disagree on their attachment, something has come and changed attachment state outside AttachPair::attach, which at least currently should be impossible. if one is None but not the other, we'd have to be racing an AttachPair::detach, but only one of attach or detach is happening at a time because of the locks above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure that's safe. I believe it's possible that detach could race with something which detaches the device/backend, and then subsequently attaches them to different respective things. In that case the AttachPair here would remain valid, but the pointers would be different once we made it inside the locks.

let (Some(dev), Some(be)) =
(self.dev_attach.upgrade(), self.backend_attach.upgrade())
else {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I think we should assert this, but a comment might be helpful: if we've gotten here, one of dev_attach or backend_attach's Drop has run and already detached the device and block state. so it ought to be the case that if one of these is Some, its .att_state.lock() should be None.

mostly important to me that we're clear we're not leaking attachment state if we were racing attach/detach operations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

Comment on lines +321 to +322
let mut dev_att_state = dev.0.att_state.lock().unwrap();
let mut be_att_state = be.0.att_state.lock().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the topic of racing, it seems worth mentioning on AttachPair that we are handling a series of locks, and so attach and detach have to agree on the ordering (not that it matters for how we do attach/detach in practice, because we detach by walking block backends rather than detaching backends and devices in a super racy way)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment. I hope it's adequate?

Copy link
Collaborator Author

@pfmooney pfmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking through so far. I think I've at least touched on all the points you've raised.

pub struct DeviceRequest {
req: Request,
id: ReqId,
completed: bool,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm not all that worried about the size (because of the "pull" model, we don't end up holding very many outstanding requests at any given time), I do prefer this structure as you've suggested. I'll fix it here and for permit.

Comment on lines +203 to +222
match req.op {
Operation::Read(off, len) => {
probes::block_begin_read!(|| {
(devqid, id, off as u64, len as u64)
});
}
Operation::Write(off, len) => {
probes::block_begin_write!(|| {
(devqid, id, off as u64, len as u64)
});
}
Operation::Flush => {
probes::block_begin_flush!(|| { (devqid, id) });
}
Operation::Discard(off, len) => {
probes::block_begin_discard!(|| {
(devqid, id, off as u64, len as u64)
});
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not ideal, but I'm also not sure about consolidating the probes:

  1. We're already pushing the limits when it comes to number of args (hence the amalgamation of devqid)
  2. Communicating the request type would mean a string, or a shared decoder ring for casting it to an integer, which is not fantastic, ergonomically speaking

I think the costs imposed here are relatively small, compared to the rest of the work required to process an IO.

lock: Mutex<()>,
cv: Condvar,
/// Static for generating unique block [DeviceId]s with a process
static NEXT_DEVICE_ID: AtomicU32 = AtomicU32::new(1);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 0

impl QueueCollection {
fn new(max_queues: NonZeroUsize, devid: DeviceId) -> Arc<Self> {
let count = max_queues.get();
assert!(count < MAX_WORKERS.get());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yup.

impl WorkerCollection {
fn new(max_workers: NonZeroUsize) -> Arc<Self> {
let max_workers = max_workers.get();
assert!(max_workers < MAX_WORKERS.get());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

// doorbell, now that we've dropped all involved locks.
if let Some(sqids) = to_notify {
for (sqid, occupied) in
sqids.into_iter().filter(|(id, _)| *id != queue::ADMIN_QUEUE_ID)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really? The existing register parsing structure has the AdminSq/AdminCq offsets split out, so we only end up calling ring_doorbell() for IO queues.

Comment on lines +1041 to +1045
let guard = self
.queues
.get_cq(qid)
.ok_or(NvmeError::InvalidCompQueue(qid))?;
let cq = guard.as_ref().unwrap();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer option is used to guard against a doorbell rung for > MAX_QUEUES, which is possible.

Comment on lines 635 to 637
/// Controller-enabled state, duplicated outside the protection of the
/// [NvmeCtrl] lock.
is_enabled: AtomicBool,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's written with NvmeCtrl locked. I've expanded the comment here to hopefully make it clearer?

cqs: [Mutex<Option<Arc<CompQueue>>>; MAX_NUM_QUEUES],
}
impl NvmeQueues {
/// Replace the contents of a [SubQueue] slot.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I've added a comment here.

size: u32,
nvme: &PciNvme,
mem: &MemCtx,
) -> Result<Arc<SubQueue>, NvmeError> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on returning the QueueId at the moment, but we definitely need some stronger checks here. I've included some additional logic around the admin queues, and fixed the part where we were not doing proper association during state import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants