Skip to content

refactor: replace PersistSemaphore with PersistChannel#1710

Open
RodrigoVillar wants to merge 9 commits intomainfrom
rodrigo/replace-persist-semaphore-with-channel-condvar
Open

refactor: replace PersistSemaphore with PersistChannel#1710
RodrigoVillar wants to merge 9 commits intomainfrom
rodrigo/replace-persist-semaphore-with-channel-condvar

Conversation

@RodrigoVillar
Copy link
Contributor

@RodrigoVillar RodrigoVillar commented Feb 24, 2026

Why this should be merged

Quoting #1694:

The current solution uses a combination of a modified semaphore and a channel to apply back pressure, with the goal of bounding the staleness of the most recent persisted revision. In general, mixing semaphores and channels (or any other synchronization primitive) can result in tricky and difficult to reason about solutions. In this case, there is an implicit dependence between the values of the counter inside the semaphore, and a separate counter managed by the event loop to determine when to "release permits" in the modified semaphore. This can lead to a deadlock if the semaphore counter reaches zero before the event loop counter falls below the threshold to reset the semaphore. The current solution may work, but it is fragile since a change to just one of the counter thresholds (max permits or count value before reset) could lead to a deadlock that might be difficult to find during standard tests.

How this works

Replaces the PersistSemaphore + crossbeam::channel with PersistChannel, a channel-like abstraction around the locks/condvars mentioned in #1694. By using PersistChannel, we no longer have the issue of the PersistWorker and the PersistLoop having their own notions of progress as all progress is managed via PersistChannelState.

This PR also adds a drop guard to the PersistLoop, such that if the background thread exits for whatever reason, the system is marked as shutdown which prevents the PersistWorker from hanging.

How this was tested

CI + existing deferred persistence UTs

@RodrigoVillar RodrigoVillar self-assigned this Feb 24, 2026
@RodrigoVillar RodrigoVillar changed the title Rodrigo/replace persist semaphore with channel condvar refactor: replace PersistSemaphore with locks and condvars Feb 25, 2026
@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch from a0d2f34 to d63cb45 Compare February 25, 2026 15:29
@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch 2 times, most recently from 9757aa7 to 0a22f43 Compare February 26, 2026 16:31
@RodrigoVillar RodrigoVillar changed the title refactor: replace PersistSemaphore with locks and condvars refactor: replace PersistSemaphore with PersistChannel Feb 26, 2026
@RodrigoVillar RodrigoVillar marked this pull request as ready for review February 26, 2026 16:47
Copy link
Contributor

@bernard-avalabs bernard-avalabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions.

@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch 3 times, most recently from a1dab9d to b72ad76 Compare March 3, 2026 16:05
@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch from b72ad76 to 4c33dee Compare March 4, 2026 17:37
@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch from 4c33dee to 8154a8c Compare March 4, 2026 21:16
Copy link
Member

@rkuris rkuris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits and some questions. Please re-request review if you make substantial changes.

@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch from 8154a8c to 9590622 Compare March 5, 2026 14:42
rkuris and others added 4 commits March 5, 2026 09:59
This version might be slightly more efficient and has tighter
backpressure at the cost of a higher synchronization complexity.

IMO the channel version is more flexible and straightforward for message
ordering and multiple work types. It does have more moving parts and
ends out creating a bit more work for the worker thread.
@RodrigoVillar RodrigoVillar force-pushed the rodrigo/replace-persist-semaphore-with-channel-condvar branch from 9590622 to fef1e1f Compare March 5, 2026 15:00
Copy link
Contributor

@demosdemon demosdemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nits on overflowing arithmetic

Comment on lines 100 to +102
let persist_interval = NonZeroU64::new(commit_count.get().div_ceil(2))
.expect("a nonzero div_ceil(2) is always positive");
let persist_threshold = commit_count.get().saturating_sub(persist_interval.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

  • the uwnrap should not lint anymore inside the const { ... } block and div_ceil is defined for NonZero<_> that returns a NonZero<_>.
  • wrapping_sub is fine here because we logically know that persist_interval is strictly less than or equal to commit_count so the subtraction will never wrap.
Suggested change
let persist_interval = NonZeroU64::new(commit_count.get().div_ceil(2))
.expect("a nonzero div_ceil(2) is always positive");
let persist_threshold = commit_count.get().saturating_sub(persist_interval.get());
let persist_interval = commit_count.div_ceil(const { NonZeroU64::new(2).unwrap() });
let persist_threshold = commit_count.get().wrapping_sub(persist_interval.get());

It's important to be aware of the mathematical operations you're doing and not blanket chose saturating_, wrapping_, or checked_. saturating and checked operations have overhead that we can and should skip if we know our logical operations won't trigger overflow.

}

state.latest_committed = Some(revision);
state.permits_available = state.permits_available.saturating_sub(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another example here. The compiler might actually skip the extra work from saturation because we have an earlier check that permits_available is not zero. However, that condition is not exclusive to letting control flow to here, so the compiler might conservatively check again. However, because we know that because state.shutdown is false at this point, we also know that state.permits_available is non-zero and saturation is extra work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Placeholder: investigate switching from a channel to locks and convars The persist_worker could deadlock the parent on panic

4 participants