feat: Add prometheus metrics by cbrit · Pull Request #288 · PeggyJV/steward

cbrit · 2025-04-14T17:30:15Z

Summary by CodeRabbit

New Features
- Introduced asynchronous metrics monitoring, enabling real-time insights into system performance and operational errors.
Refactor
- Updated configuration to use separate endpoints for different service roles.
- Enhanced error reporting and observability across key processes.

coderabbitai · 2025-04-14T17:33:07Z

Walkthrough

The update integrates Prometheus-based metrics monitoring into the application. A new asynchronous metrics server is launched during the application’s configuration phase via a spawned task. The configuration has been refactored to replace a single metrics listening address with dedicated endpoints for orchestrator and steward components. Various business logic functions, including proposals polling, scheduling, and trust state management, have been extended to increment Prometheus counters for errors and successful operations. Additionally, an unused test import has been removed without affecting exported public entities.

Changes

File(s)	Change Summary
`src/application.rs`, `src/lib.rs`, `src/metrics.rs`	Added new asynchronous metrics server functionality. Introduced a public metrics module with an async function to start the server using Axum, and defined multiple Prometheus counters for monitoring application metrics.
`src/proposals.rs`, `src/pubsub.rs`, `src/pubsub/cache.rs`	Integrated metrics tracking into core business logic. Enhanced error handling in proposals processing, scheduling confirmation, and trust state cache refresh by incrementing counters for gRPC operations, scheduling retries/failures, trust state loads, and cache refresh outcomes.
`src/config.rs`, `src/commands/orchestrator/start.rs`	Updated metrics configuration by removing the `listen_addr` field and adding `orchestrator_endpoint` and `steward_endpoint`. Adjusted the orchestrator start command to use the new `orchestrator_endpoint` for communication.
`tests/remote_signing.rs`	Removed the unused import of `H256` from `ethers::core::types`, streamlining test dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application (after_config)
    participant Task as Async Task (tokio::spawn)
    participant MS as start_metrics_server
    participant HTTP as HTTP Server (Axum)

    App->>Task: Spawn async task to start metrics server
    Task->>MS: Call start_metrics_server() asynchronously
    MS->>HTTP: Launch HTTP server on configured endpoint
    HTTP-->>MS: Serve metrics on /metrics endpoint

Suggested reviewers

zmanian

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)

We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88d58c8 and 9f67c62.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock, !**/*.lock
Cargo.toml is excluded by !**/*.toml

📒 Files selected for processing (9)

src/application.rs (1 hunks)
src/commands/orchestrator/start.rs (1 hunks)
src/config.rs (1 hunks)
src/lib.rs (2 hunks)
src/metrics.rs (1 hunks)
src/proposals.rs (6 hunks)
src/pubsub.rs (3 hunks)
src/pubsub/cache.rs (2 hunks)
tests/remote_signing.rs (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (4)

src/application.rs (2)

src/lib.rs (1)

start_metrics_server (205-210)

src/metrics.rs (1)

start_metrics_server (69-78)

src/pubsub/cache.rs (2)

src/pubsub.rs (1)

get_trust_state (31-155)

src/server.rs (1)

extract_subject_key_identifier (348-362)

src/lib.rs (2)

src/metrics.rs (1)

start_metrics_server (69-78)

src/application.rs (1)

config (47-49)

src/proposals.rs (1)

src/application.rs (2)

config (47-49)

state (52-54)

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: Clippy
GitHub Check: rust-test
GitHub Check: rust-build
GitHub Check: hardhat-build
GitHub Check: steward-build

🔇 Additional comments (21)

tests/remote_signing.rs (1)

3-3: Clean import statement - nice optimization.

Removing the unused H256 import is a good cleanup to maintain a clean codebase by only importing what's actually used.

src/application.rs (1)

79-85: Properly implemented background metrics server.

The metrics server is correctly started as a non-blocking background task using tokio::spawn. Error handling is appropriate, logging errors without failing the application startup.

One note: The double .await pattern (start_metrics_server().await.await) indicates that the function returns a JoinHandle containing another Future. This is confirmed by the relevant code snippets from src/lib.rs and src/metrics.rs.

src/commands/orchestrator/start.rs (1)

123-123: Updated endpoint parameter to use orchestrator-specific endpoint.

This change correctly updates the parameter to use the new orchestrator-specific metrics endpoint instead of the previous shared endpoint, aligning with the configuration changes that replaced listen_addr with separate orchestrator_endpoint and steward_endpoint fields.

src/pubsub.rs (3)

21-21: Added metrics imports for trust state monitoring.

The imports for TRUST_STATE_LOAD_ERRORS and TRUST_STATE_LOAD_SUCCESS metrics are properly added from the new metrics module.

61-61: Properly incrementing error metric when subscription mappings are empty.

This correctly implements error tracking by incrementing the error counter when no subscriptions are found, providing visibility into trust state loading failures.

153-153: Added success metric tracking before returning trust state data.

The success counter is properly incremented just before the successful return of the trust state data, allowing for monitoring of successful trust state loads.

src/pubsub/cache.rs (2)

20-21: New metrics for tracking cache refresh operations.

The new imports add telemetry for tracking successful refreshes and errors when updating the publisher trust state cache.

53-62: Improved error handling with metrics tracking.

This change enhances the error handling by properly tracking both successful and failed cache refreshes. The match statement cleanly separates the success and error paths, ensuring metrics are incremented appropriately in each case.

src/config.rs (2)

338-339: Configuration restructured for separate metrics endpoints.

The single listen_addr has been replaced with separate endpoints for orchestrator and steward components, allowing for more granular control over metrics collection.

Ensure that any existing configuration files in production or test environments are updated to use the new fields instead of the deprecated listen_addr.

345-346: Default configuration updated for separate endpoints.

The default configuration now sets separate ports for orchestrator (3000) and steward (3001) metrics endpoints, ensuring the application works correctly with minimal configuration.
src/lib.rs (2)

21-21: New metrics module added.

The metrics module is now publicly exposed, allowing it to be accessed from other parts of the application.

205-210:
Details

❓ Verification inconclusive

Added metrics server initialization function.

This function provides a centralized way to start the metrics server using the configured endpoint. It properly obtains the configuration and delegates to the actual implementation in the metrics module.

Make sure this function is called at the appropriate point in the application lifecycle, likely in the after_config method as mentioned in the AI summary.

🏁 Script executed:
#!/bin/bash
# Check where start_metrics_server is being called
rg -A 3 "start_metrics_server" --type rust
Length of output: 837
Metrics Server Initialization Verification

The start_metrics_server function in src/lib.rs correctly retrieves the configuration and delegates to the metrics::start_metrics_server implementation using the configured endpoint. The shell script confirms that this function is invoked in src/application.rs, indicating that the metrics server is started during the application’s asynchronous initialization.

Please verify that this invocation occurs at the appropriate lifecycle phase (e.g., immediately after configuration is loaded or within the after_config hook) to ensure proper startup sequencing.
pub async fn start_metrics_server() -> tokio::task::JoinHandle<()> {
    use abscissa_core::Application;

    let config = crate::prelude::APP.config();
    metrics::start_metrics_server(config.metrics.steward_endpoint).await
}
src/proposals.rs (6)

25-25: Added metrics imports for proposal operations.

These metrics will help track GRPC connection issues, request counts, and scheduling problems.

127-132: Enhanced error handling with metrics for GRPC connections.

This change adds proper error tracking for GRPC connection failures, which improves observability for debugging connectivity issues.

134-134: Added request tracking metrics.

This counter will provide visibility into the volume of GRPC requests being made, which is useful for capacity planning and performance monitoring.

378-388: Added metrics for scheduling confirmation failures.

The code now tracks when scheduling confirmations fail, which provides better observability into retry patterns. This is important for understanding reliability issues.

391-401: Consistent metrics for Axelar scheduling confirmations.

Similar to the Ethereum implementation, this adds metrics tracking for Axelar scheduling confirmations, ensuring consistent observability across different chain types.

414-414: Added metrics for scheduling failures.

This metric will help track the overall rate of scheduling failures, which is critical for monitoring system health.
src/metrics.rs (3)

1-5: Well-structured imports.

Clean and minimal imports covering the necessary dependencies for metrics and HTTP server functionality.

6-61: Well-organized Prometheus metrics with clear naming conventions.

The metrics are logically grouped by functionality (proposals, trust state, GRPC, etc.) and follow Prometheus naming conventions with the steward_ prefix and _total suffix. Each counter has clear and descriptive help text.

1-78:
Details

❓ Verification inconclusive

Security consideration: Metrics endpoint lacks authentication.

The metrics endpoint is exposed without authentication. While this is a common pattern with Prometheus, ensure that access to this endpoint is restricted at the network level to prevent unauthorized access to potentially sensitive performance data.

Consider whether the metrics server should:

Be bound to localhost only if it's meant for local scraping

Implement authentication if exposed to a wider network

Be accessed through a reverse proxy that handles authentication

🏁 Script executed:
#!/bin/bash
# Check if there's any authentication middleware or network restrictions defined elsewhere in the codebase
echo "Checking for potential authentication middleware..."
rg "auth.*middleware" -n
echo "Checking for network configuration related to metrics..."
rg -n "metrics.*addr|addr.*metrics" 
Length of output: 390
Security Notice: Metrics Endpoint Lacks In-App Authentication

The metrics endpoint in src/metrics.rs does not include any built-in authentication.

A repository-wide search for authentication middleware or network-level restrictions did not reveal any additional protection.

If this endpoint is exposed beyond a secure internal environment, please consider one of the following:

Binding the metrics server to localhost (or another internal address) if it is solely for local scraping.

Implementing authentication directly on the endpoint.

Routing requests through a reverse proxy that enforces proper access controls.

Please verify that your deployment strategy incorporates appropriate network-level restrictions to safeguard sensitive performance data.

coderabbitai · 2025-04-14T17:33:10Z

src/metrics.rs

+async fn metrics_handler() -> impl IntoResponse {
+    prometheus::TextEncoder::new()
+        .encode_to_string(&prometheus::gather())
+        .unwrap()
+}


🛠️ Refactor suggestion

Replace unwrap() with proper error handling in metrics_handler.

The unwrap() call could lead to panics in production if encoding fails.

async fn metrics_handler() -> impl IntoResponse { - prometheus::TextEncoder::new() - .encode_to_string(&prometheus::gather()) - .unwrap() + match prometheus::TextEncoder::new().encode_to_string(&prometheus::gather()) { + Ok(metrics) => metrics, + Err(e) => { + tracing::error!("Failed to encode metrics: {}", e); + "# ERROR: Failed to encode metrics".to_string() + } + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async fn metrics_handler() -> impl IntoResponse {

prometheus::TextEncoder::new()

.encode_to_string(&prometheus::gather())

.unwrap()

}

async fn metrics_handler() -> impl IntoResponse {

match prometheus::TextEncoder::new().encode_to_string(&prometheus::gather()) {

Ok(metrics) => metrics,

Err(e) => {

tracing::error!("Failed to encode metrics: {}", e);

"# ERROR: Failed to encode metrics".to_string()

}

}

}

coderabbitai · 2025-04-14T17:33:10Z

src/metrics.rs

+pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {
+    let app = Router::new().route("/metrics", get(metrics_handler));
+
+    tokio::spawn(async move {
+        axum::Server::bind(&addr)
+            .serve(app.into_make_service())
+            .await
+            .unwrap();
+    })
+}


🛠️ Refactor suggestion

Improve error handling in start_metrics_server and add logging.

The server start function uses unwrap() which could cause the application to panic if the server fails to start. It also lacks logging for server operations.

pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> { let app = Router::new().route("/metrics", get(metrics_handler)); + tracing::info!("Starting metrics server on {}", addr); tokio::spawn(async move { - axum::Server::bind(&addr) - .serve(app.into_make_service()) - .await - .unwrap(); + if let Err(e) = axum::Server::bind(&addr) + .serve(app.into_make_service()) + .await + { + tracing::error!("Metrics server error: {}", e); + } }) }

Additionally, consider adding a graceful shutdown mechanism to ensure the server can be properly stopped when the application exits.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {

let app = Router::new().route("/metrics", get(metrics_handler));

tokio::spawn(async move {

axum::Server::bind(&addr)

.serve(app.into_make_service())

.await

.unwrap();

})

}

pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {

let app = Router::new().route("/metrics", get(metrics_handler));

tracing::info!("Starting metrics server on {}", addr);

tokio::spawn(async move {

if let Err(e) = axum::Server::bind(&addr)

.serve(app.into_make_service())

.await

{

tracing::error!("Metrics server error: {}", e);

}

})

}

Add prometheus metrics

9f67c62

cbrit requested a review from zmanian April 14, 2025 17:30

coderabbitai bot reviewed Apr 14, 2025

View reviewed changes

Move proto build into crate

be6d976

cbrit had a problem deploying to CI April 14, 2025 17:45 — with GitHub Actions Failure

cbrit added 4 commits April 14, 2025 12:48

Move abi build to crate

238dc56

Depend on new ABI crate

7b37803

Ignore clippy warnings in steward-abi

f863b6f

Merge branch 'collin/build-perf' into collin/prometheus

e50feb1

cbrit changed the base branch from main to collin/build-perf April 14, 2025 21:04

cbrit had a problem deploying to CI April 14, 2025 21:18 — with GitHub Actions Failure

Merge branch 'main' into collin/prometheus

f21cb4c

cbrit changed the base branch from collin/build-perf to main April 15, 2025 14:10

cbrit had a problem deploying to CI April 15, 2025 14:24 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add prometheus metrics#288

feat: Add prometheus metrics#288
cbrit wants to merge 7 commits intomainfrom
collin/prometheus

cbrit commented Apr 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2025

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 14, 2025

Uh oh!

coderabbitai bot Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cbrit commented Apr 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2025

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cbrit commented Apr 14, 2025 •

edited by coderabbitai bot

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)