sim-rs: Improve simulation performance, observability, and memory management#821
Open
sandtreader wants to merge 7 commits intomainfrom
Open
sim-rs: Improve simulation performance, observability, and memory management#821sandtreader wants to merge 7 commits intomainfrom
sandtreader wants to merge 7 commits intomainfrom
Conversation
The flag was parsed but never wired into the simulation. The sim uses virtual time and already runs as fast as possible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Transactions in the per-node txs HashMap were never cleaned up, causing unbounded memory growth proportional to node count * total transactions. Prune transactions that are both older than a configurable max age and no longer present in the mempool. The mempool check ensures TXs that could still be included in future EBs (if previous voting failed) are retained. New config option `linear-tx-max-age-slots` (default: null/disabled). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VTBundleNotGenerated events were already emitted but silently ignored by EventMonitor. Count them by NoVoteReason and print a breakdown after the vote stats, making issues like LateRBHeader immediately visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Linear Leios and Stracciatella variants don't produce IBs, so the end-of-run stats were printing empty/NaN IB lines. Gate all IB stats, IB-in-EB stats, IB latency, and IB network messages on a new LeiosVariant::has_ibs() method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
finish_task() previously sent a ClockEvent::FinishTask through the same mpsc channel as Wait/CancelWait events, creating contention. Now it does an atomic fetch_sub and signals a Notify, letting the coordinator wake without channel round-trips. Also handles the resulting race where time can advance before a Wait event arrives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses three categories of issues encountered during long-running simulations:
a clock coordinator performance bottleneck, unbounded memory growth from accumulated
transactions, and gaps in end-of-run observability.
Performance: Clock coordinator optimisation
The
ClockCoordinatorpreviously receivedFinishTaskevents through its mpsc channel —the same channel used for actor registration and time-wait requests. In simulations with
high CPU task throughput, this created a bottleneck: every task completion required the
coordinator to wake, match the event, and update its counter.
Fix: Replace the channel-based
FinishTaskpath with anAtomicUsizecounterdecremented directly by the completing actor, plus a
tokio::sync::Notifyto wake thecoordinator only when it's actually blocked waiting for tasks to drain. This eliminates
per-task-completion channel traffic entirely.
Also fixes a bug in
ClockBarrier::wait()wherets == self.now()should have beents <= self.now(), causing waits for already-passed timestamps to block instead ofcompleting immediately.
Memory: Slot-based transaction pruning
During long simulations (hundreds of slots), every node's
txs: HashMap<TransactionId, TransactionView>grows without bound — nodes accumulateevery transaction they've ever seen. For 100-node simulations running 500+ slots, this
causes significant memory pressure.
Fix: Add a
prune_old_txs()pass at the start of each slot that removes transactionsolder than
linear-tx-max-age-slots(configurable, default disabled). Transactions stillin the mempool are retained regardless of age. The
Mempoolstruct gains aHashSet<TransactionId>for O(1) membership checks.The age threshold of 23 slots used in
linear.yamlis derived from: vote stage (5) +diffuse stage (5) + 3× header diffusion time (3) + buffer (10).
Observability improvements
LeiosVariant::has_ibs()gates IB-relatedstatistics, so variants like
FullWithoutIbsandLinearno longer show misleadingzero-IB stats.
certification threshold.
votes were not generated (e.g.,
InvalidSlot,ExtraIB,MissingIB).Cleanup
Remove the vestigial
-t/--timescaleCLI flag — the simulator runs in virtual time andthis flag was unused.
New configuration
linear-tx-max-age-slotsnumber | nullnullnulldisables pruning.Test plan
cargo testpasses🤖 Generated with Claude Code