Skip to content

Conversation

@rustielin
Copy link
Contributor

@rustielin rustielin commented Dec 17, 2025

Description

Support prometheus and prometheus compatible metrics sinks via remote write. Previously only VictoriaMetrics prometheus import endpoint was supported

  • New prometheus remote write 1.0 client
  • Refactor some of the configs, including to add a BasicAuth type
  • Improve the e2e testing setup for manual testing. For now it may still a bit too complex to run in CI, but the scripts are more portable now so it's possible to replicate.
  • add a healthcheck

How Has This Been Tested?

# Start everything
$ ./e2e-test/setup.sh 

# Run the tests
$ ./e2e-test/run-test.sh

...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Phase 5: Final Verification
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Waiting for metrics to be ingested...

Checking VictoriaMetrics...
  Found 14 time series in VictoriaMetrics

Checking Prometheus...
  Found 14 time series in Prometheus

Checking Loki...
  Found 10 log streams in Loki

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Test Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Test Configuration:
  • Total accounts tested: 11
  • Iterations per account: 1
  • Contract address: beff311f6b71da98c69aa838dd4b36ed3d919ce73743005464749bdf3da07066

Metrics Backends:
  • VictoriaMetrics: 14 series
  • Prometheus:      14 series
  • Loki:            10 streams

========================================
  ✓ E2E TEST PASSED                    
========================================
image

Key Areas to Review

New prometheus remote write sink

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Note

Introduces a Prometheus Remote Write metrics sink and refactors ingestion to support multiple backends with flexible auth.

  • New prometheus_remote_write client (protobuf + snappy) and unified MetricsIngestClient; refactors VictoriaMetrics client
  • Config revamp: add backend_type (victoria_metrics|prometheus_remote_write / humio|loki), auth_type (bearer|basic|none), keys_env_var/basic_auth_env_var, and support both metrics_sink and metrics_sinks
  • Enhance Humio/Loki clients with basic auth and none; extend logs and metrics sinks across custom contracts and main service
  • Add /api/v1/health endpoint; update tests to use new clients; minor renames/struct moves
  • E2E tooling: Docker Compose adds Prometheus and Grafana; new setup/run/cleanup scripts and configs to validate ingestion into VictoriaMetrics, Prometheus, and Loki
  • Dependencies: add prost and snap

Written by Cursor Bugbot for commit a8c9dc0. This will update automatically on new commits. Configure here.

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@rustielin rustielin force-pushed the 12-17-_telemetry_support_prometheus_sink branch 3 times, most recently from c408abc to d795e3c Compare December 18, 2025 16:30
@rustielin rustielin force-pushed the 12-17-_telemetry_support_prometheus_sink branch 2 times, most recently from 82accfa to df61586 Compare January 8, 2026 20:25
@rustielin rustielin marked this pull request as ready for review January 8, 2026 20:26
@rustielin rustielin requested a review from ibalajiarun as a code owner January 8, 2026 20:26
@rustielin rustielin force-pushed the 12-17-_telemetry_support_prometheus_sink branch from df61586 to a8c9dc0 Compare January 8, 2026 20:29
@rustielin rustielin requested review from a team and JoshLind January 8, 2026 20:34
)
}
},
SinkAuthType::None => victoria_metrics::AuthToken::Bearer("".to_string()),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics clients send empty Bearer auth when auth_type is none

Medium Severity

When auth_type: none is configured, metrics clients (VictoriaMetricsClient, PrometheusRemoteWriteClient) still send an Authorization: Bearer header with an empty token. This happens because SinkAuthType::None maps to AuthToken::Bearer("") instead of truly disabling authentication. The Loki client correctly handles this with a LokiAuth::None variant that skips adding any auth header, but the metrics AuthToken enum lacks a None variant. This inconsistency could cause issues with strict servers that reject malformed auth headers.

Additional Locations (2)

Fix in Cursor Fix in Web

@rustielin rustielin force-pushed the 12-17-_telemetry_support_prometheus_sink branch from a8c9dc0 to e032be9 Compare January 8, 2026 21:12
Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblocking!

(Big boss @ibalajiarun can give the final stamp 😄 Most of this is dark magic to me.)

serde = { workspace = true }
serde_json = { workspace = true }
serde_yaml = { workspace = true }
snap = "1.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to use the workspace version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only used in this crate actually, so doesn't make sense to include it in the workspace

@rustielin rustielin enabled auto-merge (squash) January 12, 2026 18:49
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on e3a949a68b4c493613c58ef894c639cad6a115e8 ==> e032be912f281fae7678548f71e1d61cb4873f69

Compatibility test results for e3a949a68b4c493613c58ef894c639cad6a115e8 ==> e032be912f281fae7678548f71e1d61cb4873f69 (PR)
1. Check liveness of validators at old version: e3a949a68b4c493613c58ef894c639cad6a115e8
compatibility::simple-validator-upgrade::liveness-check : committed: 13568.30 txn/s, latency: 2567.66 ms, (p50: 2700 ms, p70: 2900, p90: 3100 ms, p99: 3400 ms), latency samples: 445280
2. Upgrading first Validator to new version: e032be912f281fae7678548f71e1d61cb4873f69
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5887.17 txn/s, latency: 5737.73 ms, (p50: 6400 ms, p70: 6400, p90: 6500 ms, p99: 6600 ms), latency samples: 204440
3. Upgrading rest of first batch to new version: e032be912f281fae7678548f71e1d61cb4873f69
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5849.53 txn/s, latency: 5811.60 ms, (p50: 6400 ms, p70: 6500, p90: 6600 ms, p99: 6700 ms), latency samples: 200160
4. upgrading second batch to new version: e032be912f281fae7678548f71e1d61cb4873f69
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9850.53 txn/s, latency: 3448.40 ms, (p50: 3700 ms, p70: 3800, p90: 3900 ms, p99: 4100 ms), latency samples: 323600
5. check swarm health
Compatibility test for e3a949a68b4c493613c58ef894c639cad6a115e8 ==> e032be912f281fae7678548f71e1d61cb4873f69 passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on e032be912f281fae7678548f71e1d61cb4873f69

two traffics test: inner traffic : committed: 13664.05 txn/s, latency: 2757.00 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 3400 ms), latency samples: 5087060
two traffics test : committed: 100.02 txn/s, latency: 751.19 ms, (p50: 700 ms, p70: 800, p90: 900 ms, p99: 1100 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.235, avg: 2.142", "ConsensusProposalToOrdered: max: 0.170, avg: 0.166", "ConsensusOrderedToCommit: max: 0.048, avg: 0.043", "ConsensusProposalToCommit: max: 0.215, avg: 0.209"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.50s no progress at version 32334 (avg 0.07s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.26s no progress at version 2436322 (avg 0.26s) [limit 16].
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite framework_upgrade success on e3a949a68b4c493613c58ef894c639cad6a115e8 ==> e032be912f281fae7678548f71e1d61cb4873f69

Compatibility test results for e3a949a68b4c493613c58ef894c639cad6a115e8 ==> e032be912f281fae7678548f71e1d61cb4873f69 (PR)
Upgrade the nodes to version: e032be912f281fae7678548f71e1d61cb4873f69
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2199.83 txn/s, submitted: 2208.19 txn/s, failed submission: 8.36 txn/s, expired: 8.36 txn/s, latency: 1316.40 ms, (p50: 1200 ms, p70: 1500, p90: 1800 ms, p99: 2700 ms), latency samples: 200103
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2119.26 txn/s, submitted: 2125.40 txn/s, failed submission: 6.14 txn/s, expired: 6.14 txn/s, latency: 1386.99 ms, (p50: 1200 ms, p70: 1500, p90: 1800 ms, p99: 3500 ms), latency samples: 193200
5. check swarm health
Compatibility test for e3a949a68b4c493613c58ef894c639cad6a115e8 ==> e032be912f281fae7678548f71e1d61cb4873f69 passed
Upgrade the remaining nodes to version: e032be912f281fae7678548f71e1d61cb4873f69
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2419.45 txn/s, submitted: 2426.54 txn/s, failed submission: 7.09 txn/s, expired: 7.09 txn/s, latency: 1201.25 ms, (p50: 1200 ms, p70: 1200, p90: 1800 ms, p99: 2100 ms), latency samples: 218401
Test Ok

@rustielin rustielin merged commit fd0d96e into main Jan 12, 2026
90 checks passed
@rustielin rustielin deleted the 12-17-_telemetry_support_prometheus_sink branch January 12, 2026 19:44
@rustielin rustielin requested a review from a team January 12, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants