Skip to content

Conversation

@iximeow
Copy link
Member

@iximeow iximeow commented Jan 6, 2026

access() implies (typically) an Arc::clone of the guarded resource. When that resource is access()'d frequently, especially across many cores, the contention for incrementing and decrementing the shared refcount can become the primary bottleneck for Propolis.

This is particularly visible in nvme where queues live in guest memory, are mediated by MemCtx, queues are managed across dozens of threads (hundreds in some cases!) threads, and millions of items are read and written to queues per second. On large processors an increment on refcount for Arc<MemCtx> can take microseconds(!)


This patch is a less comprehensive improvement to the problem than I'd hoped for, but it is a substantial improvement enough to include for now.

The first try at this change set out to add an alternative to accessors::Guard, call it accessor::View, which would have a refresh() analogous to Guard::access(). View was expected to hold a reference on a Node, and View::refresh() was
expected to go through the same "check res_leaf or else re-derive the leaf resource" steps as guard().

After getting through the lock_tree() changes to keep a guard for a node when refreshing the leaf, it became clear in most cases we'd have an accessor::View we would not have that view via mutable reference.

accessor::View's cached reference would need to be behind a Mutex for interior mutability, at which point a "View" is not much different from any other "Node" in the accessor tree.

So, instead, this implements a less aspirational change of adding an "access() but return the guard instead of clone the reference". This is strictly less work than before, but still implies going out to mutex_lock and such for something that, in the happy case, should be nothing more than a load. Even so, the mutexes in question are per-Node and generally not contended - this is hard to spot even under load.


In the limit, an accessor tree that supports disabling access to subtrees already is up against the issue that access()'d Guard don't synchronize with operations on the resource. A PCI bus can "deny" access to memory, but a device or its queues may still have a Guard permitting to access memory after the fact and do so.

It seems theoretically possible to lean into accessed resources being an Arc<T> and have nodes hold an AtomicUsize of that pointer, updating it as appropriate while managing the tree. Then, nodes and the tree generally could have generation counters where on access the caller must only compare generations to check that the cached access is still valid. To deny access to a (sub)tree, relevant Node could be modified, the tree's generation bumped, and the access-denier can wait on the reference count to the old tree to drop to zero.

Critically, comparing the generation numbers requires only loads, so cores don't fight over writes, and in the typical case the generation number won't change, so a view's reference can remain counted until the generation number change is finally witnessed.

I think that the main difficulty with this approach is that it requires components that hold a long-lived reference on the resource to regularly "check in" that the reference is still valid, but it would be very easy to accidentally sleep while holding a reference. This is the same general problem as blocking a writer because you've slept holding a read lock.

I didn't get to fully flushing that out here only due to time constraints; I don't see a reason it won't work (yet?).

&'a self,
tree_ref: &'a TreeBackref<T>,
) -> Result<MutexGuard<'a, Tree<T>>, TreeBackref<T>> {
fn try_lock_tree<'node, 'guard>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this works as <'a>, but I had a hard time reasoning about how long the guards are held until I'd given them distinct names.

@iximeow
Copy link
Member Author

iximeow commented Jan 6, 2026

This partially unwinds #678 (and in fact the comment on try_tree_lock about "the two guards are returned" has been stale since then!)

As I was writing this, I didn't see a way this could introduce novel locking issues except by lock_tree() functions that didn't expect to get the current entry already locked as an argument. But I hadn't seen #678 before, and need to take a moment to digest it in case there's something more subtle here..

@AlejandroME AlejandroME added the local storage Relating to the local storage project label Jan 6, 2026
@AlejandroME AlejandroME added this to the 18 milestone Jan 6, 2026
`access()` implies (typically) an `Arc::clone` of the guarded resource.
When that resource is access()'d frequently, especially across many
cores, the contention for incrementing and decrementing the shared
refcount can become the primary bottleneck for Propolis.

This is particularly visible in `nvme`, where queues live in guest memory,
are mediated by `MemCtx`, queues are managed across dozens of threads
(hundreds in some cases!) threads, and millions of items are read and
written to queues per second. On large processors an increment can take
hundreds of nanoseconds, or thousands of cycles!

----

This patch is a less comprehensive improvement to the problem than I'd
hoped for, but it *is* a substantial improvement enough to include for
now.

The first try at this change set out to add an alternative to
accessors::Guard, call it accessor::View, which would have a
refresh() analogous to Guard::access(). View was expected to hold
a reference on a Node, and View::refresh() was
expected to go through the same "check res_leaf or else re-derive the
leaf resource" steps as guard().

After getting through the lock_tree() changes to keep a guard for a
node when refreshing the leaf, it became clear in most cases we'd have
an accessor::View we would *not* have that view via mutable reference.

accessor::View's cached reference would need to be behind a Mutex for
interior mutability, at which point a "View" is not much different from
any other "Node" in the accessor tree.

So, instead, this implements a less aspirational change of adding an
"access() but return the guard instead of clone the reference". This is
strictly less work than before, but still implies going out to
`mutex_lock` and such for something that, in the happy case, should be
nothing more than a load. Even so, the mutexes in question are per-Node
and generally not contended - this is hard to spot even under load.

----

In the limit, an accessor tree that supports disabling access to
subtrees already is up against the issue that `access()`'d `Guard` don't
synchronize with operations on the resource. A PCI bus can "deny"
access to memory, but a device or its queues may still have a Guard
permitting to access memory after the fact and do so.

It seems theoretically possible to lean into accessed resources being an
`Arc<T>` and have nodes hold an `AtomicUsize` of that pointer, updating
it as appropriate while managing the tree. Then, nodes and the tree
generally could have generation counters where on access the caller must
only compare generations to check that the cached access is still valid.
To deny access to a (sub)tree, relevant `Node` could be modified, the
tree's generation bumped, and the access-denier can wait on the
reference count to the old tree to drop to zero.

Critically, comparing the generation numbers requires only loads, so
cores don't fight over writes, and in the typical case the generation
number won't change, so a view's reference can remain counted until the
generation number change is finally witnessed.

I *think* that the main difficulty with this approach is that it
requires components that hold a long-lived reference on the resource to
regularly "check in" that the reference is still valid, but it would be
very easy to accidentally sleep while holding a reference. This is the
same general problem as blocking a writer becuase you've slept holding a
read lock.
@iximeow iximeow force-pushed the ixi/borrowful-accessor branch from 72a2f97 to 66a6cec Compare January 8, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

local storage Relating to the local storage project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants