Skip to content

Conversation

@songgaoye
Copy link

@songgaoye songgaoye commented Jan 16, 2026

Closes: #XXX

Description

The optimization provides 40-44% performance improvement across all batch sizes by:

  • Replacing HashMap with Vec for sequential index access

  • Eliminating cloning by using drain() instead of iter() + clone()

  • Pre-allocating vector capacity

Batch Size Legacy (clone) Optimized (drain) Improvement
32 286 µs 171 µs 40%
64 548 µs 321 µs 41%
128 1,115 µs 640 µs 43%

PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).
    • If guide has been updated, tag GitHub user mircea-c
  • Tagged one reviewer who will be the one responsible for shepherding this PR.

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

Copy link

@KirilMihaylov KirilMihaylov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heavy refactoring necessary.

P.S.:
Leaving this review as a member of the @nolus-protocol dev team.

use std::str::FromStr;
use std::time::Duration;

use criterion::{black_box, criterion_group, criterion_main, BatchSize, BenchmarkId, Criterion};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

criterion::black_box is deprecated and says to use std::hint::black_box instead.

} = gm;

match &event_with_height.event {
for gm in odata.batch.drain(..) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capacity is not reused later and is instead replaced (L1739, R1737) with an empty one, which is also pure computationally (const fn) and does not allocate.
That can be done via std::mem::take and will allow the current or future versions of std::vec::Vec to take advantage of the fact that the whole container is consumed.

Suggested change
for gm in odata.batch.drain(..) {
for gm in std::mem::take(&mut odata.batch)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Vec::with_capacity there is intentional: we need to take ownership of every TransitMessage so we can move “timed-out” ones into timed_out and keep only the survivors.
The current flow (for gm in odata.batch.drain(..) { … } followed by odata.batch = retain_batch;) gives us owned gms and preserves ordering; the price is allocating a fresh buffer for retain_batch.
Replacing this with std::mem::take doesn’t actually remove that allocation: once you take the batch you still need a second vector to store the retained items (because retain_mut/retain only hand out &mut references, which we can’t move), so you’d end up allocating again for that second vector or cloning messages.
In short, there’s no practical win here—mem::take can’t simultaneously give us ownership of the drained messages and let us reuse the original buffer without deeper restructuring, so the current approach at relay_path.rs (lines 1688-1739) is fine as-is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I'll have to strongly disagree. Please look at the signature and provided documentation of std::mem::take. It temporarily needs the given mutable reference, replacing it with Default::default(), and giving the old result. It's a convenience wrapper around std::mem::replace.

The code base is already in unacceptable state and we at @nolus-protocol are currently rewriting parts of it, with possibility for upstreaming, so I personally would rather not add more unnecessary complexity.

I have already removed the need for retain_batch as it's very error-prone and directly add back to odata.batch.

Copy link

@KirilMihaylov KirilMihaylov Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving our implementation that can be used as a reference and proof of concept that this approach works.

https://github.com/nolus-protocol/hermes/blob/hermes-lite-stable/crates/relayer/src/link/relay_path.rs#L1723

Link to the commit: nolus-protocol/hermes@52479c3

let mut all_dst_odata = self.dst_operational_data.clone_vec();

let mut timed_out: HashMap<usize, OperationalData> = HashMap::default();
let mut timed_out: Vec<Option<OperationalData>> = vec![None; all_dst_odata.len()];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[timed_out 1/4]

This will only waste computational time and resources as it isn't necessary to allocate it with the same length. On top of that, allocating it with None instances also redundant as all_dst_odata is traversed sequentially from start to end and elements can just be inserted to the back.

Suggested change
let mut timed_out: Vec<Option<OperationalData>> = vec![None; all_dst_odata.len()];
let mut timed_out = vec![];

If proven necessary, it could be allocated with max(1, half-capacity), as I believe that it's highly unlikely during normal circumstances to get past half-capacity.

let mut timed_out = Vec::with_capacity(1.max(all_dst_odata.len() >> 1));

let mut timed_out: Vec<Option<OperationalData>> = vec![None; all_dst_odata.len()];

// For each operational data targeting the destination chain...
for (odata_pos, odata) in all_dst_odata.iter_mut().enumerate() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[timed_out 2/4]

Suggested change
for (odata_pos, odata) in all_dst_odata.iter_mut().enumerate() {
for odata in &mut all_dst_odata {

Comment on lines +1705 to +1719
let slot = timed_out
.get_mut(odata_pos)
.expect("timed_out vector is sized to match operational data");
slot.get_or_insert_with(|| {
OperationalData::new(
dst_current_height,
OperationalDataTarget::Source,
odata_tracking_id,
self.channel.connection_delay,
)
})
.push(TransitMessage {
event_with_height: gm.event_with_height.clone(),
msg: new_msg,
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[timed_out 3/4]

Suggested change
let slot = timed_out
.get_mut(odata_pos)
.expect("timed_out vector is sized to match operational data");
slot.get_or_insert_with(|| {
OperationalData::new(
dst_current_height,
OperationalDataTarget::Source,
odata_tracking_id,
self.channel.connection_delay,
)
})
.push(TransitMessage {
event_with_height: gm.event_with_height.clone(),
msg: new_msg,
});
let mut odata = OperationalData::new(
dst_current_height,
OperationalDataTarget::Source,
odata_tracking_id,
self.channel.connection_delay,
);
odata.push(TransitMessage { msg: new_msg, ..gm });
timed_out.push(odata);


// Schedule new operational data targeting the source chain
for (_, new_od) in timed_out.into_iter() {
for new_od in timed_out.into_iter().flatten() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[timed_out 4/4]

Suggested change
for new_od in timed_out.into_iter().flatten() {
for new_od in timed_out {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This micro-benchmark tests the difference between std::vec::Vec and std::collections::HashMap as it does NOT test against the used implementation but rather a dummy out-of-sync version of the algorithm.

That makes this micro-benchmark wholly redundant as the algorithm I have provided doesn't allocate without necessity AND does not allocate the whole index range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants