-
Notifications
You must be signed in to change notification settings - Fork 27
Overhaul block attachment and request dispatch #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.
This should not be merged before the R17 release branch is cut. We'll want soak time, consider the scope of work. I wanted to get it up and available for folks to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've mostly not looked at propolis/src/block
yet but figured this first round of thoughts would be useful. on the whole this seems pretty reasonable! but I see that transmute in attachment.rs
which I look forward to in.. round 2
lib/propolis/src/block/minder.rs
Outdated
/// A wrapper for an IO [Request] bearing necessary tracking information to | ||
/// issue its completion back to the [queue](DeviceQueue) from which it came. | ||
/// | ||
/// A panic will occur an [DeviceRequest] instance is dropped without calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A panic will occur an [DeviceRequest] instance is dropped without calling | |
/// A panic will occur if a `DeviceRequest` instance is dropped without calling |
grammar nit, and doc nit that i'd skip the rustdoc link since it should take you back to this same struct that you'd be reading the docs of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/propolis/src/block/minder.rs
Outdated
pub struct DeviceRequest { | ||
req: Request, | ||
id: ReqId, | ||
completed: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's somewhat pleasant that complete()
ends up optimizing out both the assignment and assert but this does tick the size of DeviceRequest
from 64 bytes up to 72 (so I'm also a little sad to learn that done
isn't packed into the discriminant of Operation
)
if you move the assert-ful Drop
so this is like:
struct NoDrop;
impl Drop for NoDrop {
fn drop(&mut self) {
panic!("DeviceRequest should be completed before drop");
}
}
pub struct DeviceRequest {
req: Request,
id: ReqId,
source: Weak<QueueMinder>,
_nodrop: NoDrop,
}
then you can destructure self
in complete()
and mem::forget()
the NoDrop. it's probably not a big deal but going just above 64 bytes on the size makes me a bit itchy.
that said, I see this is also how Permit
tracks completion and Drop. if you want to keep this as-is for consistency and we can shuffle that later, fine by me (I'm mostly operating on "we're here for every I/O" instinct. Assuming XXk iops I'd guess this is in the "important but individual changes are approximately noise" zone?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I'm not all that worried about the size (because of the "pull" model, we don't end up holding very many outstanding requests at any given time), I do prefer this structure as you've suggested. I'll fix it here and for permit.
match req.op { | ||
Operation::Read(off, len) => { | ||
probes::block_begin_read!(|| { | ||
(devqid, id, off as u64, len as u64) | ||
}); | ||
} | ||
Operation::Write(off, len) => { | ||
probes::block_begin_write!(|| { | ||
(devqid, id, off as u64, len as u64) | ||
}); | ||
} | ||
Operation::Flush => { | ||
probes::block_begin_flush!(|| { (devqid, id) }); | ||
} | ||
Operation::Discard(off, len) => { | ||
probes::block_begin_discard!(|| { | ||
(devqid, id, off as u64, len as u64) | ||
}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relatively unfortunate that this goes through an indirect branch on the discriminant of Operation
even if none of the probes are enabled. what do you think about having one probes::block_begin
and leave filtering on the type to the probe user?
this and the one below make me a little suspicious in a way that next_req_fn
doesn't because I figure the operation mix is basically unpredictable whereas those indirect calls should be straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not ideal, but I'm also not sure about consolidating the probes:
- We're already pushing the limits when it comes to number of args (hence the amalgamation of
devqid
) - Communicating the request type would mean a string, or a shared decoder ring for casting it to an integer, which is not fantastic, ergonomically speaking
I think the costs imposed here are relatively small, compared to the rest of the work required to process an IO.
} | ||
} | ||
|
||
pub(super) struct QueueMinder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the whole QueueCollection/QueueSlot/QueueSlotState/QueueMinder
could use a few words on why we're picking indirection here rather than somewhere up that sequence. My read (which I think is reasonable, just would be nice to say up front) is that you decided to have NextReqFn
and CompleteReqFn
here so we don't have to plumb DQ
through Request
and QueueMinder
up into an eventual
struct QueueSlotState {
minder: Option<Arc<dyn QueueMinderTrait>>
}
when realistically there's no inner loop in the call chain under QueueSlotState
that having a concrete function type would help with. and so if we ever end up in a situation with a loop over indirect calls under QueueSlotState
, it might change the balance and justify plumbing types through, but we just don't have that today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was originally writing this, I wanted to avoid that type specialization leaking out far and wide. It turned into kind of a mess before. I'm not sure that it's especially more readable to push the QueueMinder interface into a trait in order to move the indirection thusly.
>; | ||
|
||
struct QmEntry { | ||
token: Box<dyn Any + Send + Sync>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though plumbing DQ
and the token type through would let us avoid the indirection here, and I think avoid a copy or two in building QmEntry
. maybe profiling just says the hot spots are elsewhere and so indirection just keeps this a bit more flexible 'til we've chased out other problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, avoiding the alloc for Box-ing it, as well as the copies. Not plumbing it through is justified (or at least explained) below.
match self.delete_sq(sqid, nvme) { | ||
Ok(()) => { | ||
nvme.block_attach.queue_dissociate(sqid_to_block_qid(sqid)); | ||
// TODO: wait until requests are done? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just "we hadn't been waiting before, but we probably should do something about that", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. There's a whole area of stuff we should probably be better at when it comes to emulating admin commands. Now was not the time to fall down that hole.
// doorbell, now that we've dropped all involved locks. | ||
if let Some(sqids) = to_notify { | ||
for (sqid, occupied) in | ||
sqids.into_iter().filter(|(id, _)| *id != queue::ADMIN_QUEUE_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's basically fine to see ADMIN_QUEUE_ID here, right? just, there isn't a block queue to be notified. either someone set up a submission queue with the admin queue as the completion queue, or someone literally just run the doorbell for the admin submission queue. could be helpful to say so here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really? The existing register parsing structure has the AdminSq/AdminCq offsets split out, so we only end up calling ring_doorbell()
for IO queues.
lib/propolis/src/hw/nvme/mod.rs
Outdated
/// Controller-enabled state, duplicated outside the protection of the | ||
/// [NvmeCtrl] lock. | ||
is_enabled: AtomicBool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not getting to it in this first pass, but "duplicated outside the protection" makes me immediately curious what the relationship between these enabled bits is and if there's a procedure around how they must be managed (can this only be written with NvmeCtrl
locked? or is there some more precise situation here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's written with NvmeCtrl locked. I've expanded the comment here to hopefully make it clearer?
let guard = self | ||
.queues | ||
.get_cq(qid) | ||
.ok_or(NvmeError::InvalidCompQueue(qid))?; | ||
let cq = guard.as_ref().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitty: I think spelling this as
let guard = self | |
.queues | |
.get_cq(qid) | |
.ok_or(NvmeError::InvalidCompQueue(qid))?; | |
let cq = guard.as_ref().unwrap(); | |
let guard = self | |
.queues | |
.get_cq(qid); | |
let Some(cq) = guard.as_ref() else { | |
return Err(NvmeError::InvalidCompQueue(qid)); | |
} |
would be pretty acceptable if you wanted to drop the outer Option
-ness. I'm pretty sure rustc would tolerate the borrows, at least.. (same for get_sq
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer option is used to guard against a doorbell rung for > MAX_QUEUES, which is possible.
cqs: [Mutex<Option<Arc<CompQueue>>>; MAX_NUM_QUEUES], | ||
} | ||
impl NvmeQueues { | ||
/// Replace the contents of a [SubQueue] slot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and set_cq_slot
assert about old
because if it was Some
you must be destroying the queue, and if it was None
we should have rejected the command to destroy the queue? so in both cases something must have happened that doesn't correspond to a legal NVMe command. the asserts make sense through that lens but I think a sentence about that would be helpful for later readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I've added a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat fewer comments on the back half (plus what we've been talking about re. NoneProcessing)
(Some(ds), Some((bs, _))) if self.eq(ds) && self.eq(bs) => { | ||
// Device and backend agree about mutual attachment | ||
} | ||
_ => return, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could reasonably be an assertion, right? if they're both Some
but disagree on their attachment, something has come and changed attachment state outside AttachPair::attach
, which at least currently should be impossible. if one is None
but not the other, we'd have to be racing an AttachPair::detach
, but only one of attach or detach is happening at a time because of the locks above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure that's safe. I believe it's possible that detach could race with something which detaches the device/backend, and then subsequently attaches them to different respective things. In that case the AttachPair
here would remain valid, but the pointers would be different once we made it inside the locks.
let (Some(dev), Some(be)) = | ||
(self.dev_attach.upgrade(), self.backend_attach.upgrade()) | ||
else { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that I think we should assert this, but a comment might be helpful: if we've gotten here, one of dev_attach
or backend_attach
's Drop has run and already detached the device and block state. so it ought to be the case that if one of these is Some, its .att_state.lock()
should be None.
mostly important to me that we're clear we're not leaking attachment state if we were racing attach/detach operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added.
let mut dev_att_state = dev.0.att_state.lock().unwrap(); | ||
let mut be_att_state = be.0.att_state.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the topic of racing, it seems worth mentioning on AttachPair
that we are handling a series of locks, and so attach and detach have to agree on the ordering (not that it matters for how we do attach/detach in practice, because we detach by walking block backends rather than detaching backends and devices in a super racy way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment. I hope it's adequate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking through so far. I think I've at least touched on all the points you've raised.
lib/propolis/src/block/minder.rs
Outdated
pub struct DeviceRequest { | ||
req: Request, | ||
id: ReqId, | ||
completed: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I'm not all that worried about the size (because of the "pull" model, we don't end up holding very many outstanding requests at any given time), I do prefer this structure as you've suggested. I'll fix it here and for permit.
match req.op { | ||
Operation::Read(off, len) => { | ||
probes::block_begin_read!(|| { | ||
(devqid, id, off as u64, len as u64) | ||
}); | ||
} | ||
Operation::Write(off, len) => { | ||
probes::block_begin_write!(|| { | ||
(devqid, id, off as u64, len as u64) | ||
}); | ||
} | ||
Operation::Flush => { | ||
probes::block_begin_flush!(|| { (devqid, id) }); | ||
} | ||
Operation::Discard(off, len) => { | ||
probes::block_begin_discard!(|| { | ||
(devqid, id, off as u64, len as u64) | ||
}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not ideal, but I'm also not sure about consolidating the probes:
- We're already pushing the limits when it comes to number of args (hence the amalgamation of
devqid
) - Communicating the request type would mean a string, or a shared decoder ring for casting it to an integer, which is not fantastic, ergonomically speaking
I think the costs imposed here are relatively small, compared to the rest of the work required to process an IO.
lib/propolis/src/block/attachment.rs
Outdated
lock: Mutex<()>, | ||
cv: Condvar, | ||
/// Static for generating unique block [DeviceId]s with a process | ||
static NEXT_DEVICE_ID: AtomicU32 = AtomicU32::new(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 0
lib/propolis/src/block/attachment.rs
Outdated
impl QueueCollection { | ||
fn new(max_queues: NonZeroUsize, devid: DeviceId) -> Arc<Self> { | ||
let count = max_queues.get(); | ||
assert!(count < MAX_WORKERS.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, yup.
lib/propolis/src/block/attachment.rs
Outdated
impl WorkerCollection { | ||
fn new(max_workers: NonZeroUsize) -> Arc<Self> { | ||
let max_workers = max_workers.get(); | ||
assert!(max_workers < MAX_WORKERS.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
// doorbell, now that we've dropped all involved locks. | ||
if let Some(sqids) = to_notify { | ||
for (sqid, occupied) in | ||
sqids.into_iter().filter(|(id, _)| *id != queue::ADMIN_QUEUE_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really? The existing register parsing structure has the AdminSq/AdminCq offsets split out, so we only end up calling ring_doorbell()
for IO queues.
let guard = self | ||
.queues | ||
.get_cq(qid) | ||
.ok_or(NvmeError::InvalidCompQueue(qid))?; | ||
let cq = guard.as_ref().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer option is used to guard against a doorbell rung for > MAX_QUEUES, which is possible.
lib/propolis/src/hw/nvme/mod.rs
Outdated
/// Controller-enabled state, duplicated outside the protection of the | ||
/// [NvmeCtrl] lock. | ||
is_enabled: AtomicBool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's written with NvmeCtrl locked. I've expanded the comment here to hopefully make it clearer?
cqs: [Mutex<Option<Arc<CompQueue>>>; MAX_NUM_QUEUES], | ||
} | ||
impl NvmeQueues { | ||
/// Replace the contents of a [SubQueue] slot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I've added a comment here.
size: u32, | ||
nvme: &PciNvme, | ||
mem: &MemCtx, | ||
) -> Result<Arc<SubQueue>, NvmeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sold on returning the QueueId
at the moment, but we definitely need some stronger checks here. I've included some additional logic around the admin queues, and fixed the part where we were not doing proper association during state import.
This addresses considerable lock contention and thundering herd issues present in the block IO abstractions. Devices with multiple queues (read: NVMe), serviced by multiple workers, should be in much better shape when it comes to efficiently farming out work.