Conversation
WalkthroughThe update integrates Prometheus-based metrics monitoring into the application. A new asynchronous metrics server is launched during the application’s configuration phase via a spawned task. The configuration has been refactored to replace a single metrics listening address with dedicated endpoints for orchestrator and steward components. Various business logic functions, including proposals polling, scheduling, and trust state management, have been extended to increment Prometheus counters for errors and successful operations. Additionally, an unused test import has been removed without affecting exported public entities. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application (after_config)
participant Task as Async Task (tokio::spawn)
participant MS as start_metrics_server
participant HTTP as HTTP Server (Axum)
App->>Task: Spawn async task to start metrics server
Task->>MS: Call start_metrics_server() asynchronously
MS->>HTTP: Launch HTTP server on configured endpoint
HTTP-->>MS: Serve metrics on /metrics endpoint
Suggested reviewers
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lock,!**/*.lockCargo.tomlis excluded by!**/*.toml
📒 Files selected for processing (9)
src/application.rs(1 hunks)src/commands/orchestrator/start.rs(1 hunks)src/config.rs(1 hunks)src/lib.rs(2 hunks)src/metrics.rs(1 hunks)src/proposals.rs(6 hunks)src/pubsub.rs(3 hunks)src/pubsub/cache.rs(2 hunks)tests/remote_signing.rs(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
src/application.rs (2)
src/lib.rs (1)
start_metrics_server(205-210)src/metrics.rs (1)
start_metrics_server(69-78)
src/pubsub/cache.rs (2)
src/pubsub.rs (1)
get_trust_state(31-155)src/server.rs (1)
extract_subject_key_identifier(348-362)
src/lib.rs (2)
src/metrics.rs (1)
start_metrics_server(69-78)src/application.rs (1)
config(47-49)
src/proposals.rs (1)
src/application.rs (2)
config(47-49)state(52-54)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Clippy
- GitHub Check: rust-test
- GitHub Check: rust-build
- GitHub Check: hardhat-build
- GitHub Check: steward-build
🔇 Additional comments (21)
tests/remote_signing.rs (1)
3-3: Clean import statement - nice optimization.Removing the unused
H256import is a good cleanup to maintain a clean codebase by only importing what's actually used.src/application.rs (1)
79-85: Properly implemented background metrics server.The metrics server is correctly started as a non-blocking background task using
tokio::spawn. Error handling is appropriate, logging errors without failing the application startup.One note: The double
.awaitpattern (start_metrics_server().await.await) indicates that the function returns aJoinHandlecontaining anotherFuture. This is confirmed by the relevant code snippets fromsrc/lib.rsandsrc/metrics.rs.src/commands/orchestrator/start.rs (1)
123-123: Updated endpoint parameter to use orchestrator-specific endpoint.This change correctly updates the parameter to use the new orchestrator-specific metrics endpoint instead of the previous shared endpoint, aligning with the configuration changes that replaced
listen_addrwith separateorchestrator_endpointandsteward_endpointfields.src/pubsub.rs (3)
21-21: Added metrics imports for trust state monitoring.The imports for
TRUST_STATE_LOAD_ERRORSandTRUST_STATE_LOAD_SUCCESSmetrics are properly added from the new metrics module.
61-61: Properly incrementing error metric when subscription mappings are empty.This correctly implements error tracking by incrementing the error counter when no subscriptions are found, providing visibility into trust state loading failures.
153-153: Added success metric tracking before returning trust state data.The success counter is properly incremented just before the successful return of the trust state data, allowing for monitoring of successful trust state loads.
src/pubsub/cache.rs (2)
20-21: New metrics for tracking cache refresh operations.The new imports add telemetry for tracking successful refreshes and errors when updating the publisher trust state cache.
53-62: Improved error handling with metrics tracking.This change enhances the error handling by properly tracking both successful and failed cache refreshes. The match statement cleanly separates the success and error paths, ensuring metrics are incremented appropriately in each case.
src/config.rs (2)
338-339: Configuration restructured for separate metrics endpoints.The single
listen_addrhas been replaced with separate endpoints for orchestrator and steward components, allowing for more granular control over metrics collection.Ensure that any existing configuration files in production or test environments are updated to use the new fields instead of the deprecated
listen_addr.
345-346: Default configuration updated for separate endpoints.The default configuration now sets separate ports for orchestrator (3000) and steward (3001) metrics endpoints, ensuring the application works correctly with minimal configuration.
src/lib.rs (2)
21-21: New metrics module added.The metrics module is now publicly exposed, allowing it to be accessed from other parts of the application.
205-210:Details
❓ Verification inconclusive
Added metrics server initialization function.
This function provides a centralized way to start the metrics server using the configured endpoint. It properly obtains the configuration and delegates to the actual implementation in the metrics module.
Make sure this function is called at the appropriate point in the application lifecycle, likely in the
after_configmethod as mentioned in the AI summary.
🏁 Script executed:
#!/bin/bash # Check where start_metrics_server is being called rg -A 3 "start_metrics_server" --type rustLength of output: 837
Metrics Server Initialization Verification
The
start_metrics_serverfunction insrc/lib.rscorrectly retrieves the configuration and delegates to themetrics::start_metrics_serverimplementation using the configured endpoint. The shell script confirms that this function is invoked insrc/application.rs, indicating that the metrics server is started during the application’s asynchronous initialization.Please verify that this invocation occurs at the appropriate lifecycle phase (e.g., immediately after configuration is loaded or within the
after_confighook) to ensure proper startup sequencing.pub async fn start_metrics_server() -> tokio::task::JoinHandle<()> { use abscissa_core::Application; let config = crate::prelude::APP.config(); metrics::start_metrics_server(config.metrics.steward_endpoint).await }src/proposals.rs (6)
25-25: Added metrics imports for proposal operations.These metrics will help track GRPC connection issues, request counts, and scheduling problems.
127-132: Enhanced error handling with metrics for GRPC connections.This change adds proper error tracking for GRPC connection failures, which improves observability for debugging connectivity issues.
134-134: Added request tracking metrics.This counter will provide visibility into the volume of GRPC requests being made, which is useful for capacity planning and performance monitoring.
378-388: Added metrics for scheduling confirmation failures.The code now tracks when scheduling confirmations fail, which provides better observability into retry patterns. This is important for understanding reliability issues.
391-401: Consistent metrics for Axelar scheduling confirmations.Similar to the Ethereum implementation, this adds metrics tracking for Axelar scheduling confirmations, ensuring consistent observability across different chain types.
414-414: Added metrics for scheduling failures.This metric will help track the overall rate of scheduling failures, which is critical for monitoring system health.
src/metrics.rs (3)
1-5: Well-structured imports.Clean and minimal imports covering the necessary dependencies for metrics and HTTP server functionality.
6-61: Well-organized Prometheus metrics with clear naming conventions.The metrics are logically grouped by functionality (proposals, trust state, GRPC, etc.) and follow Prometheus naming conventions with the
steward_prefix and_totalsuffix. Each counter has clear and descriptive help text.
1-78:Details
❓ Verification inconclusive
Security consideration: Metrics endpoint lacks authentication.
The metrics endpoint is exposed without authentication. While this is a common pattern with Prometheus, ensure that access to this endpoint is restricted at the network level to prevent unauthorized access to potentially sensitive performance data.
Consider whether the metrics server should:
- Be bound to localhost only if it's meant for local scraping
- Implement authentication if exposed to a wider network
- Be accessed through a reverse proxy that handles authentication
🏁 Script executed:
#!/bin/bash # Check if there's any authentication middleware or network restrictions defined elsewhere in the codebase echo "Checking for potential authentication middleware..." rg "auth.*middleware" -n echo "Checking for network configuration related to metrics..." rg -n "metrics.*addr|addr.*metrics"Length of output: 390
Security Notice: Metrics Endpoint Lacks In-App Authentication
- The metrics endpoint in
src/metrics.rsdoes not include any built-in authentication.- A repository-wide search for authentication middleware or network-level restrictions did not reveal any additional protection.
- If this endpoint is exposed beyond a secure internal environment, please consider one of the following:
- Binding the metrics server to
localhost(or another internal address) if it is solely for local scraping.- Implementing authentication directly on the endpoint.
- Routing requests through a reverse proxy that enforces proper access controls.
Please verify that your deployment strategy incorporates appropriate network-level restrictions to safeguard sensitive performance data.
| async fn metrics_handler() -> impl IntoResponse { | ||
| prometheus::TextEncoder::new() | ||
| .encode_to_string(&prometheus::gather()) | ||
| .unwrap() | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Replace unwrap() with proper error handling in metrics_handler.
The unwrap() call could lead to panics in production if encoding fails.
async fn metrics_handler() -> impl IntoResponse {
- prometheus::TextEncoder::new()
- .encode_to_string(&prometheus::gather())
- .unwrap()
+ match prometheus::TextEncoder::new().encode_to_string(&prometheus::gather()) {
+ Ok(metrics) => metrics,
+ Err(e) => {
+ tracing::error!("Failed to encode metrics: {}", e);
+ "# ERROR: Failed to encode metrics".to_string()
+ }
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async fn metrics_handler() -> impl IntoResponse { | |
| prometheus::TextEncoder::new() | |
| .encode_to_string(&prometheus::gather()) | |
| .unwrap() | |
| } | |
| async fn metrics_handler() -> impl IntoResponse { | |
| match prometheus::TextEncoder::new().encode_to_string(&prometheus::gather()) { | |
| Ok(metrics) => metrics, | |
| Err(e) => { | |
| tracing::error!("Failed to encode metrics: {}", e); | |
| "# ERROR: Failed to encode metrics".to_string() | |
| } | |
| } | |
| } |
| pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> { | ||
| let app = Router::new().route("/metrics", get(metrics_handler)); | ||
|
|
||
| tokio::spawn(async move { | ||
| axum::Server::bind(&addr) | ||
| .serve(app.into_make_service()) | ||
| .await | ||
| .unwrap(); | ||
| }) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Improve error handling in start_metrics_server and add logging.
The server start function uses unwrap() which could cause the application to panic if the server fails to start. It also lacks logging for server operations.
pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> {
let app = Router::new().route("/metrics", get(metrics_handler));
+ tracing::info!("Starting metrics server on {}", addr);
tokio::spawn(async move {
- axum::Server::bind(&addr)
- .serve(app.into_make_service())
- .await
- .unwrap();
+ if let Err(e) = axum::Server::bind(&addr)
+ .serve(app.into_make_service())
+ .await
+ {
+ tracing::error!("Metrics server error: {}", e);
+ }
})
}Additionally, consider adding a graceful shutdown mechanism to ensure the server can be properly stopped when the application exits.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> { | |
| let app = Router::new().route("/metrics", get(metrics_handler)); | |
| tokio::spawn(async move { | |
| axum::Server::bind(&addr) | |
| .serve(app.into_make_service()) | |
| .await | |
| .unwrap(); | |
| }) | |
| } | |
| pub async fn start_metrics_server(addr: SocketAddr) -> JoinHandle<()> { | |
| let app = Router::new().route("/metrics", get(metrics_handler)); | |
| tracing::info!("Starting metrics server on {}", addr); | |
| tokio::spawn(async move { | |
| if let Err(e) = axum::Server::bind(&addr) | |
| .serve(app.into_make_service()) | |
| .await | |
| { | |
| tracing::error!("Metrics server error: {}", e); | |
| } | |
| }) | |
| } |
Summary by CodeRabbit