Skip to content

apollo_network_benchmark: draft#12726

Open
sirandreww-starkware wants to merge 1 commit into02-18-apollo_network_benchmark_moved_peerid_creation_filesfrom
02-16-apollo_network_benchmark_draft
Open

apollo_network_benchmark: draft#12726
sirandreww-starkware wants to merge 1 commit into02-18-apollo_network_benchmark_moved_peerid_creation_filesfrom
02-16-apollo_network_benchmark_draft

Conversation

@sirandreww-starkware
Copy link
Contributor

@sirandreww-starkware sirandreww-starkware commented Feb 19, 2026

Note

Medium Risk
Adds substantial new orchestration code that builds Docker images, runs shell commands, and generates Kubernetes manifests, so failures could impact developer workflows and cluster resources (namespaces, jobs, port-forwards). No production runtime or security-critical logic appears to be modified beyond the benchmark tooling.

Overview
Introduces a new apollo_network_benchmark_run orchestrator binary to spin up multi-node network stress tests either locally (Docker Compose) or on GKE (Indexed Job), including commands to start/stop deployments, fetch pod logs, and port-forward Grafana/Prometheus.

Adds end-to-end observability packaging for the benchmark: generated Grafana dashboard JSON (covering network, Propeller, and Tokio runtime metrics), Prometheus scrape config generation, and checked-in Grafana/Prometheus/K8s config templates.

Adds Docker runtime assets (Dockerfile.fast/Dockerfile.slow, entrypoint.sh) to run the node with optional traffic shaping via tc (LATENCY/THROUGHPUT) and deterministic node IDs/hostnames for cluster runs, plus new crate dependencies (anyhow, chrono, serde_json, tokio-stream, apollo_propeller, etc.).

Written by Cursor Bugbot for commit 71a3230. This will update automatically on new commits. Configure here.

@reviewable-StarkWare
Copy link

This change is Reviewable

This was referenced Feb 19, 2026
Copy link
Contributor Author

sirandreww-starkware commented Feb 19, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sirandreww-starkware sirandreww-starkware self-assigned this Feb 19, 2026
@sirandreww-starkware sirandreww-starkware marked this pull request as ready for review February 19, 2026 07:27
@sirandreww-starkware sirandreww-starkware requested a review from a team as a code owner February 19, 2026 07:27
@github-actions
Copy link

github-actions bot commented Feb 19, 2026

[security]
disable_initial_admin_creation = true
cookie_secure = false
cookie_samesite = lax
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate INI section causes config split

Medium Severity

The grafana.ini file has two separate [security] sections — one at line 5 setting admin_user, admin_password, allow_sign_up, and disable_gravatar, and another at line 53 setting disable_initial_admin_creation, cookie_secure, and cookie_samesite. Depending on the INI parser behavior, the first section's keys may be silently dropped or the sections may merge unpredictably. These settings belong in a single [security] block.

Additional Locations (1)

Fix in Cursor Fix in Web


delete_namespace(namespace_name, true)?;
create_namespace(namespace_name)?;
delete_namespace(namespace_name, false)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop function unnecessarily creates namespace before deleting

Medium Severity

The cluster_stop::run function performs a delete-create-delete sequence on the Kubernetes namespace. The create_namespace call between the two deletes is problematic: if the first delete_namespace puts the namespace in a Terminating state (common with stuck finalizers or slow resource cleanup), create_namespace will fail because Kubernetes doesn't allow creating a namespace while one with the same name is terminating. This causes the entire stop operation to error out, leaving resources uncleaned.

Fix in Cursor Fix in Web

@sirandreww-starkware sirandreww-starkware force-pushed the 02-18-apollo_network_benchmark_moved_peerid_creation_files branch from d9248de to 41712e4 Compare February 19, 2026 08:04
@sirandreww-starkware sirandreww-starkware force-pushed the 02-16-apollo_network_benchmark_draft branch from 5306714 to 2371fa1 Compare February 19, 2026 08:04
@sirandreww-starkware sirandreww-starkware force-pushed the 02-16-apollo_network_benchmark_draft branch from 2371fa1 to 71a3230 Compare March 16, 2026 15:13
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

(m(&BROADCAST_MESSAGE_THROUGHPUT), "binBps"),
(rate(&RECEIVE_MESSAGE_BYTES_SUM, "20s"), "binBps"),
(m(&RECEIVE_MESSAGE_PENDING_COUNT), "short"),
("receive_message_negative_delay_seconds_count".into(), "short"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashboard references metrics that don't exist in codebase

Medium Severity

The Grafana dashboard configuration references receive_message_negative_delay_seconds_count and receive_message_negative_delay_seconds_bucket in multiple panels, but these metrics are never defined in metrics.rs or registered anywhere. These dashboard panels will permanently show "No data," giving operators a misleading view of clock-skew behavior.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants