Skip to content

apollo_network_benchmark: add message index detection mechanism#11557

Open
sirandreww-starkware wants to merge 1 commit into01-08-apollo_network_benchmark_add_messageindextracker_structfrom
01-08-apollo_network_benchmark_add_message_index_detection_mechanism
Open

apollo_network_benchmark: add message index detection mechanism#11557
sirandreww-starkware wants to merge 1 commit into01-08-apollo_network_benchmark_add_messageindextracker_structfrom
01-08-apollo_network_benchmark_add_message_index_detection_mechanism

Conversation

@sirandreww-starkware
Copy link
Contributor

@sirandreww-starkware sirandreww-starkware commented Jan 8, 2026

Note

Medium Risk
Adds new async coordination between the receive path and an index-tracking task via an unbounded channel, which can impact runtime behavior and memory under high load. Also changes the receive_stress_test_message callback signature and receiver task wiring, which is easy to get wrong and could break compilation or message processing if misused.

Overview
Adds message index gap detection to the broadcast network stress test node by sending (sender_id, message_index) from receive_stress_test_message into a new record_indexed_message task that maintains per-peer MessageIndexTracker state.

Introduces a new gauge metric, RECEIVE_MESSAGE_PENDING_COUNT, and updates the node to run two receive-side tasks (receiver + index tracker) instead of a single receiver task, so dashboards can observe how many messages are still “missing” based on observed indices.

Written by Cursor Bugbot for commit 2186fdd. This will update automatically on new commits. Configure here.

@reviewable-StarkWare
Copy link

This change is Reviewable

Copy link
Contributor Author

sirandreww-starkware commented Jan 8, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

There hasn't been any activity on this pull request recently, and in order to prioritize active work, it has been marked as stale.
This PR will be closed and locked in 7 days if no further activity occurs.
Thank you for your contributions!

@github-actions github-actions bot added the stale label Feb 8, 2026
@github-actions github-actions bot closed this Feb 16, 2026
let mut index_tracker = vec![MessageIndexTracker::default(); num_peers];
let mut all_pending = 0;
while let Some((peer_id, index)) = rx.recv().await {
let old_pending = index_tracker[peer_id].pending_messages_count();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sender ID indexing can panic receiver

High Severity

record_indexed_message indexes index_tracker by sender_id, but the vector is sized with bootstrap.len(). This assumes sender IDs are dense zero-based indices, which NodeArgs.runner.id does not enforce. Valid deployments with sparse or non-zero-based IDs can trigger out-of-bounds access and crash the receiver path.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

}
.boxed()
let (tx, rx) = tokio::sync::mpsc::unbounded_channel();
let num_peers = self.args.runner.bootstrap.len();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index tracker sized by bootstrap peers, indexed by sender_id

High Severity

num_peers is set to self.args.runner.bootstrap.len(), which is the number of other peers (N−1 for N nodes). But record_indexed_message uses this to size index_tracker and indexes it by sender_id (which is runner.id, ranging from 0 to N−1). For the node with the highest ID, index_tracker[N-1] is out of bounds on a vec of length N−1, causing a panic at runtime.

Additional Locations (1)

Fix in Cursor Fix in Web

@sirandreww-starkware sirandreww-starkware force-pushed the 01-08-apollo_network_benchmark_add_message_index_detection_mechanism branch from f10f148 to 6bed8fb Compare February 19, 2026 08:04
@sirandreww-starkware sirandreww-starkware force-pushed the 01-08-apollo_network_benchmark_add_messageindextracker_struct branch from b05e636 to 41a1832 Compare February 19, 2026 08:04
@sirandreww-starkware sirandreww-starkware force-pushed the 01-08-apollo_network_benchmark_add_messageindextracker_struct branch from 41a1832 to 3093d1d Compare March 16, 2026 15:13
@sirandreww-starkware sirandreww-starkware force-pushed the 01-08-apollo_network_benchmark_add_message_index_detection_mechanism branch from 6bed8fb to 2186fdd Compare March 16, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants