Add non-Arc-clone'ing variant of access() by iximeow · Pull Request #1003 · oxidecomputer/propolis

iximeow · 2026-01-06T05:14:40Z

access() implies (typically) an Arc::clone of the guarded resource. When that resource is access()'d frequently, especially across many cores, the contention for incrementing and decrementing the shared refcount can become the primary bottleneck for Propolis.

This is particularly visible in nvme where queues live in guest memory, are mediated by MemCtx, queues are managed across dozens of threads (hundreds in some cases!) threads, and millions of items are read and written to queues per second. On large processors an increment on refcount for Arc<MemCtx> can take microseconds(!)

This patch is a less comprehensive improvement to the problem than I'd hoped for, but it is a substantial improvement enough to include for now.

The first try at this change set out to add an alternative to accessors::Guard, call it accessor::View, which would have a refresh() analogous to Guard::access(). View was expected to hold a reference on a Node, and View::refresh() was
expected to go through the same "check res_leaf or else re-derive the leaf resource" steps as guard().

After getting through the lock_tree() changes to keep a guard for a node when refreshing the leaf, it became clear in most cases we'd have an accessor::View we would not have that view via mutable reference.

accessor::View's cached reference would need to be behind a Mutex for interior mutability, at which point a "View" is not much different from any other "Node" in the accessor tree.

So, instead, this implements a less aspirational change of adding an "access() but return the guard instead of clone the reference". This is strictly less work than before, but still implies going out to mutex_lock and such for something that, in the happy case, should be nothing more than a load. Even so, the mutexes in question are per-Node and generally not contended - this is hard to spot even under load.

In the limit, an accessor tree that supports disabling access to subtrees already is up against the issue that access()'d Guard don't synchronize with operations on the resource. A PCI bus can "deny" access to memory, but a device or its queues may still have a Guard permitting to access memory after the fact and do so.

It seems theoretically possible to lean into accessed resources being an Arc<T> and have nodes hold an AtomicUsize of that pointer, updating it as appropriate while managing the tree. Then, nodes and the tree generally could have generation counters where on access the caller must only compare generations to check that the cached access is still valid. To deny access to a (sub)tree, relevant Node could be modified, the tree's generation bumped, and the access-denier can wait on the reference count to the old tree to drop to zero.

Critically, comparing the generation numbers requires only loads, so cores don't fight over writes, and in the typical case the generation number won't change, so a view's reference can remain counted until the generation number change is finally witnessed.

I think that the main difficulty with this approach is that it requires components that hold a long-lived reference on the resource to regularly "check in" that the reference is still valid, but it would be very easy to accidentally sleep while holding a reference. This is the same general problem as blocking a writer because you've slept holding a read lock.

I didn't get to fully flushing that out here only due to time constraints; I don't see a reason it won't work (yet?).

iximeow · 2026-01-06T05:15:45Z

lib/propolis/src/accessors.rs

-        &'a self,
-        tree_ref: &'a TreeBackref<T>,
-    ) -> Result<MutexGuard<'a, Tree<T>>, TreeBackref<T>> {
+    fn try_lock_tree<'node, 'guard>(


I'm pretty sure this works as <'a>, but I had a hard time reasoning about how long the guards are held until I'd given them distinct names.

iximeow · 2026-01-06T05:36:33Z

This partially unwinds #678 (and in fact the comment on try_tree_lock about "the two guards are returned" has been stale since then!)

As I was writing this, I didn't see a way this could introduce novel locking issues except by lock_tree() functions that didn't expect to get the current entry already locked as an argument. But I hadn't seen #678 before, and need to take a moment to digest it in case there's something more subtle here..

`access()` implies (typically) an `Arc::clone` of the guarded resource. When that resource is access()'d frequently, especially across many cores, the contention for incrementing and decrementing the shared refcount can become the primary bottleneck for Propolis. This is particularly visible in `nvme`, where queues live in guest memory, are mediated by `MemCtx`, queues are managed across dozens of threads (hundreds in some cases!) threads, and millions of items are read and written to queues per second. On large processors an increment can take hundreds of nanoseconds, or thousands of cycles! ---- This patch is a less comprehensive improvement to the problem than I'd hoped for, but it *is* a substantial improvement enough to include for now. The first try at this change set out to add an alternative to accessors::Guard, call it accessor::View, which would have a refresh() analogous to Guard::access(). View was expected to hold a reference on a Node, and View::refresh() was expected to go through the same "check res_leaf or else re-derive the leaf resource" steps as guard(). After getting through the lock_tree() changes to keep a guard for a node when refreshing the leaf, it became clear in most cases we'd have an accessor::View we would *not* have that view via mutable reference. accessor::View's cached reference would need to be behind a Mutex for interior mutability, at which point a "View" is not much different from any other "Node" in the accessor tree. So, instead, this implements a less aspirational change of adding an "access() but return the guard instead of clone the reference". This is strictly less work than before, but still implies going out to `mutex_lock` and such for something that, in the happy case, should be nothing more than a load. Even so, the mutexes in question are per-Node and generally not contended - this is hard to spot even under load. ---- In the limit, an accessor tree that supports disabling access to subtrees already is up against the issue that `access()`'d `Guard` don't synchronize with operations on the resource. A PCI bus can "deny" access to memory, but a device or its queues may still have a Guard permitting to access memory after the fact and do so. It seems theoretically possible to lean into accessed resources being an `Arc<T>` and have nodes hold an `AtomicUsize` of that pointer, updating it as appropriate while managing the tree. Then, nodes and the tree generally could have generation counters where on access the caller must only compare generations to check that the cached access is still valid. To deny access to a (sub)tree, relevant `Node` could be modified, the tree's generation bumped, and the access-denier can wait on the reference count to the old tree to drop to zero. Critically, comparing the generation numbers requires only loads, so cores don't fight over writes, and in the typical case the generation number won't change, so a view's reference can remain counted until the generation number change is finally witnessed. I *think* that the main difficulty with this approach is that it requires components that hold a long-lived reference on the resource to regularly "check in" that the reference is still valid, but it would be very easy to accidentally sleep while holding a reference. This is the same general problem as blocking a writer becuase you've slept holding a read lock.

iximeow · 2026-01-09T22:00:21Z

OK, I've gone through a few parts of VM lifecycles and understand #678 better; having covered the callers of lock_tree to release the returned-as-locked node when appropriate (or just using the guard instead of re-locking), I'm really confident the locking changes are good.

iximeow commented Jan 6, 2026

View reviewed changes

AlejandroME added the local storage Relating to the local storage project label Jan 6, 2026

AlejandroME added this to the 18 milestone Jan 6, 2026

AlejandroME assigned iximeow Jan 6, 2026

iximeow mentioned this pull request Jan 8, 2026

"rcu_preempt kthread starved for 10002 jiffies!" #1008

Closed

iximeow force-pushed the ixi/borrowful-accessor branch from 72a2f97 to 66a6cec Compare January 8, 2026 17:31

iximeow merged commit 9a3a93a into oxidecomputer:master Jan 9, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add non-Arc-clone'ing variant of access()#1003

Add non-Arc-clone'ing variant of access()#1003
iximeow merged 1 commit intooxidecomputer:masterfrom
iximeow:ixi/borrowful-accessor

iximeow commented Jan 6, 2026

Uh oh!

iximeow Jan 6, 2026

Uh oh!

iximeow commented Jan 6, 2026

Uh oh!

iximeow commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iximeow commented Jan 6, 2026

Uh oh!

iximeow Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

iximeow commented Jan 6, 2026

Uh oh!

iximeow commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants