Skip to content

feat: Add prometheus metrics#288

Open
cbrit wants to merge 7 commits intomainfrom
collin/prometheus
Open

feat: Add prometheus metrics#288
cbrit wants to merge 7 commits intomainfrom
collin/prometheus

Conversation

@cbrit
Copy link
Member

@cbrit cbrit commented Apr 14, 2025

Summary by CodeRabbit

  • New Features
    • Introduced asynchronous metrics monitoring, enabling real-time insights into system performance and operational errors.
  • Refactor
    • Updated configuration to use separate endpoints for different service roles.
    • Enhanced error reporting and observability across key processes.

@cbrit cbrit requested a review from zmanian April 14, 2025 17:30
@coderabbitai
Copy link

coderabbitai bot commented Apr 14, 2025

Walkthrough

The update integrates Prometheus-based metrics monitoring into the application. A new asynchronous metrics server is launched during the application’s configuration phase via a spawned task. The configuration has been refactored to replace a single metrics listening address with dedicated endpoints for orchestrator and steward components. Various business logic functions, including proposals polling, scheduling, and trust state management, have been extended to increment Prometheus counters for errors and successful operations. Additionally, an unused test import has been removed without affecting exported public entities.

Changes

File(s) Change Summary
src/application.rs, src/lib.rs, src/metrics.rs Added new asynchronous metrics server functionality. Introduced a public metrics module with an async function to start the server using Axum, and defined multiple Prometheus counters for monitoring application metrics.
src/proposals.rs, src/pubsub.rs, src/pubsub/cache.rs Integrated metrics tracking into core business logic. Enhanced error handling in proposals processing, scheduling confirmation, and trust state cache refresh by incrementing counters for gRPC operations, scheduling retries/failures, trust state loads, and cache refresh outcomes.
src/config.rs, src/commands/orchestrator/start.rs Updated metrics configuration by removing the listen_addr field and adding orchestrator_endpoint and steward_endpoint. Adjusted the orchestrator start command to use the new orchestrator_endpoint for communication.
tests/remote_signing.rs Removed the unused import of H256 from ethers::core::types, streamlining test dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application (after_config)
    participant Task as Async Task (tokio::spawn)
    participant MS as start_metrics_server
    participant HTTP as HTTP Server (Axum)

    App->>Task: Spawn async task to start metrics server
    Task->>MS: Call start_metrics_server() asynchronously
    MS->>HTTP: Launch HTTP server on configured endpoint
    HTTP-->>MS: Serve metrics on /metrics endpoint
Loading

Suggested reviewers

  • zmanian

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88d58c8 and 9f67c62.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock, !**/*.lock
  • Cargo.toml is excluded by !**/*.toml
📒 Files selected for processing (9)
  • src/application.rs (1 hunks)
  • src/commands/orchestrator/start.rs (1 hunks)
  • src/config.rs (1 hunks)
  • src/lib.rs (2 hunks)
  • src/metrics.rs (1 hunks)
  • src/proposals.rs (6 hunks)
  • src/pubsub.rs (3 hunks)
  • src/pubsub/cache.rs (2 hunks)
  • tests/remote_signing.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
src/application.rs (2)
src/lib.rs (1)
  • start_metrics_server (205-210)
src/metrics.rs (1)
  • start_metrics_server (69-78)
src/pubsub/cache.rs (2)
src/pubsub.rs (1)
  • get_trust_state (31-155)
src/server.rs (1)
  • extract_subject_key_identifier (348-362)
src/lib.rs (2)
src/metrics.rs (1)
  • start_metrics_server (69-78)
src/application.rs (1)
  • config (47-49)
src/proposals.rs (1)
src/application.rs (2)
  • config (47-49)
  • state (52-54)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Clippy
  • GitHub Check: rust-test
  • GitHub Check: rust-build
  • GitHub Check: hardhat-build
  • GitHub Check: steward-build
🔇 Additional comments (21)
tests/remote_signing.rs (1)

3-3: Clean import statement - nice optimization.

Removing the unused H256 import is a good cleanup to maintain a clean codebase by only importing what's actually used.

src/application.rs (1)

79-85: Properly implemented background metrics server.

The metrics server is correctly started as a non-blocking background task using tokio::spawn. Error handling is appropriate, logging errors without failing the application startup.

One note: The double .await pattern (start_metrics_server().await.await) indicates that the function returns a JoinHandle containing another Future. This is confirmed by the relevant code snippets from src/lib.rs and src/metrics.rs.

src/commands/orchestrator/start.rs (1)

123-123: Updated endpoint parameter to use orchestrator-specific endpoint.

This change correctly updates the parameter to use the new orchestrator-specific metrics endpoint instead of the previous shared endpoint, aligning with the configuration changes that replaced listen_addr with separate orchestrator_endpoint and steward_endpoint fields.

src/pubsub.rs (3)

21-21: Added metrics imports for trust state monitoring.

The imports for TRUST_STATE_LOAD_ERRORS and TRUST_STATE_LOAD_SUCCESS metrics are properly added from the new metrics module.


61-61: Properly incrementing error metric when subscription mappings are empty.

This correctly implements error tracking by incrementing the error counter when no subscriptions are found, providing visibility into trust state loading failures.


153-153: Added success metric tracking before returning trust state data.

The success counter is properly incremented just before the successful return of the trust state data, allowing for monitoring of successful trust state loads.

src/pubsub/cache.rs (2)

20-21: New metrics for tracking cache refresh operations.

The new imports add telemetry for tracking successful refreshes and errors when updating the publisher trust state cache.


53-62: Improved error handling with metrics tracking.

This change enhances the error handling by properly tracking both successful and failed cache refreshes. The match statement cleanly separates the success and error paths, ensuring metrics are incremented appropriately in each case.

src/config.rs (2)

338-339: Configuration restructured for separate metrics endpoints.

The single listen_addr has been replaced with separate endpoints for orchestrator and steward components, allowing for more granular control over metrics collection.

Ensure that any existing configuration files in production or test environments are updated to use the new fields instead of the deprecated listen_addr.


345-346: Default configuration updated for separate endpoints.

The default configuration now sets separate ports for orchestrator (3000) and steward (3001) metrics endpoints, ensuring the application works correctly with minimal configuration.

src/lib.rs (2)

21-21: New metrics module added.

The metrics module is now publicly exposed, allowing it to be accessed from other parts of the application.


205-210:

Details

❓ Verification inconclusive

Added metrics server initialization function.

This function provides a centralized way to start the metrics server using the configured endpoint. It properly obtains the configuration and delegates to the actual implementation in the metrics module.

Make sure this function is called at the appropriate point in the application lifecycle, likely in the after_config method as mentioned in the AI summary.


🏁 Script executed:

#!/bin/bash
# Check where start_metrics_server is being called
rg -A 3 "start_metrics_server" --type rust

Length of output: 837


Metrics Server Initialization Verification

The start_metrics_server function in src/lib.rs correctly retrieves the configuration and delegates to the metrics::start_metrics_server implementation using the configured endpoint. The shell script confirms that this function is invoked in src/application.rs, indicating that the metrics server is started during the application’s asynchronous initialization.

Please verify that this invocation occurs at the appropriate lifecycle phase (e.g., immediately after configuration is loaded or within the after_config hook) to ensure proper startup sequencing.

pub async fn start_metrics_server() -> tokio::task::JoinHandle<()> {
    use abscissa_core::Application;

    let config = crate::prelude::APP.config();
    metrics::start_metrics_server(config.metrics.steward_endpoint).await
}
src/proposals.rs (6)

25-25: Added metrics imports for proposal operations.

These metrics will help track GRPC connection issues, request counts, and scheduling problems.


127-132: Enhanced error handling with metrics for GRPC connections.

This change adds proper error tracking for GRPC connection failures, which improves observability for debugging connectivity issues.


134-134: Added request tracking metrics.

This counter will provide visibility into the volume of GRPC requests being made, which is useful for capacity planning and performance monitoring.


378-388: Added metrics for scheduling confirmation failures.

The code now tracks when scheduling confirmations fail, which provides better observability into retry patterns. This is important for understanding reliability issues.


391-401: Consistent metrics for Axelar scheduling confirmations.

Similar to the Ethereum implementation, this adds metrics tracking for Axelar scheduling confirmations, ensuring consistent observability across different chain types.


414-414: Added metrics for scheduling failures.

This metric will help track the overall rate of scheduling failures, which is critical for monitoring system health.

src/metrics.rs (3)

1-5: Well-structured imports.

Clean and minimal imports covering the necessary dependencies for metrics and HTTP server functionality.


6-61: Well-organized Prometheus metrics with clear naming conventions.

The metrics are logically grouped by functionality (proposals, trust state, GRPC, etc.) and follow Prometheus naming conventions with the steward_ prefix and _total suffix. Each counter has clear and descriptive help text.


1-78:

Details

❓ Verification inconclusive

Security consideration: Metrics endpoint lacks authentication.

The metrics endpoint is exposed without authentication. While this is a common pattern with Prometheus, ensure that access to this endpoint is restricted at the network level to prevent unauthorized access to potentially sensitive performance data.

Consider whether the metrics server should:

  1. Be bound to localhost only if it's meant for local scraping
  2. Implement authentication if exposed to a wider network
  3. Be accessed through a reverse proxy that handles authentication

🏁 Script executed:

#!/bin/bash
# Check if there's any authentication middleware or network restrictions defined elsewhere in the codebase
echo "Checking for potential authentication middleware..."
rg "auth.*middleware" -n
echo "Checking for network configuration related to metrics..."
rg -n "metrics.*addr|addr.*metrics" 

Length of output: 390


Security Notice: Metrics Endpoint Lacks In-App Authentication

  • The metrics endpoint in src/metrics.rs does not include any built-in authentication.
  • A repository-wide search for authentication middleware or network-level restrictions did not reveal any additional protection.
  • If this endpoint is exposed beyond a secure internal environment, please consider one of the following:
    • Binding the metrics server to localhost (or another internal address) if it is solely for local scraping.
    • Implementing authentication directly on the endpoint.
    • Routing requests through a reverse proxy that enforces proper access controls.

Please verify that your deployment strategy incorporates appropriate network-level restrictions to safeguard sensitive performance data.

Comment on lines +63 to +67
async fn metrics_handler() -> impl IntoResponse {
prometheus::TextEncoder::new()
.encode_to_string(&prometheus::gather())
.unwrap()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace unwrap() with proper error handling in metrics_handler.

The unwrap() call could lead to panics in production if encoding fails.

async fn metrics_handler() -> impl IntoResponse {
-    prometheus::TextEncoder::new()
-        .encode_to_string(&prometheus::gather())
-        .unwrap()
+    match prometheus::TextEncoder::new().encode_to_string(&prometheus::gather()) {
+        Ok(metrics) => metrics,
+        Err(e) => {
+            tracing::error!("Failed to encode metrics: {}", e);
+            "# ERROR: Failed to encode metrics".to_string()
+        }
+    }
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fn metrics_handler() -> impl IntoResponse {
prometheus::TextEncoder::new()
.encode_to_string(&prometheus::gather())
.unwrap()
}
async fn metrics_handler() -> impl IntoResponse {
match prometheus::TextEncoder::new().encode_to_string(&prometheus::gather()) {
Ok(metrics) => metrics,
Err(e) => {
tracing::error!("Failed to encode metrics: {}", e);
"# ERROR: Failed to encode metrics".to_string()
}
}
}

Comment on lines +69 to +78
pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {
let app = Router::new().route("/metrics", get(metrics_handler));

tokio::spawn(async move {
axum::Server::bind(&addr)
.serve(app.into_make_service())
.await
.unwrap();
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in start_metrics_server and add logging.

The server start function uses unwrap() which could cause the application to panic if the server fails to start. It also lacks logging for server operations.

pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {
    let app = Router::new().route("/metrics", get(metrics_handler));
+    tracing::info!("Starting metrics server on {}", addr);

    tokio::spawn(async move {
-        axum::Server::bind(&addr)
-            .serve(app.into_make_service())
-            .await
-            .unwrap();
+        if let Err(e) = axum::Server::bind(&addr)
+            .serve(app.into_make_service())
+            .await 
+        {
+            tracing::error!("Metrics server error: {}", e);
+        }
    })
}

Additionally, consider adding a graceful shutdown mechanism to ensure the server can be properly stopped when the application exits.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {
let app = Router::new().route("/metrics", get(metrics_handler));
tokio::spawn(async move {
axum::Server::bind(&addr)
.serve(app.into_make_service())
.await
.unwrap();
})
}
pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {
let app = Router::new().route("/metrics", get(metrics_handler));
tracing::info!("Starting metrics server on {}", addr);
tokio::spawn(async move {
if let Err(e) = axum::Server::bind(&addr)
.serve(app.into_make_service())
.await
{
tracing::error!("Metrics server error: {}", e);
}
})
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant