-
Notifications
You must be signed in to change notification settings - Fork 146
fix(epoxy): add metrics #3847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 01-12-chore_clean_up_errors
Are you sure you want to change the base?
fix(epoxy): add metrics #3847
Conversation
PR Review: Add metrics to EpoxySummaryThis PR adds comprehensive Prometheus metrics to the Epoxy consensus system. The implementation covers consensus operations, message handling, quorum tracking, replica state, cluster health, and error conditions. ✅ Strengths
🔍 Issues & SuggestionsCritical Issues1. Error classification mismatch in proposals (ops/propose.rs:94-99) The proposal status metric incorrectly classifies metrics::PROPOSALS_TOTAL
.with_label_values(&[if let ProposalResult::Committed = res {
"ok"
} else {
"err"
}])
.inc();According to the code comments at
Issue: Recommendation: Change the status label to distinguish three cases: metrics::PROPOSALS_TOTAL
.with_label_values(&[match &res {
ProposalResult::Committed => "committed",
ProposalResult::CommandError(_) => "command_error",
ProposalResult::ConsensusFailed => "consensus_failed",
}])
.inc();This provides better observability by separating:
2. Commit error classification (replica/messages/commit.rs:43-45) Similar issue with commit metrics: metrics::COMMIT_TOTAL
.with_label_values(&[if cmd_err.is_none() { "ok" } else { "cmd_err" }])
.inc();While this is better than the proposal metric, it's inconsistent. The commit always succeeds at the consensus level even if there's a command error. Consider using labels like Medium Priority3. Missing error details in request metrics (replica/message_request.rs:35-36) metrics::REQUESTS_TOTAL
.with_label_values(&[request_type, if res.is_ok() { "ok" } else { "err" }])
.inc();Issue: All errors are bucketed together. For a distributed consensus system, different error types have different operational implications (network failures, ballot rejections, validation failures, etc.). Recommendation: Consider adding error type information, though be careful about cardinality. At minimum, you could add a separate counter for specific critical error types. 4. Quorum metric placement (http_client.rs:95-104) The quorum metric is recorded after the function returns successful responses, but there's no failure path metric if an error occurs before reaching that point. Issue: If Recommendation: Use a defer-like pattern or move the metric to a finally block to ensure it's always recorded. You could use a pattern like: let result = /* ... */;
let status = if result.is_ok() { "ok" } else { "error" };
metrics::QUORUM_ATTEMPTS_TOTAL
.with_label_values(&[quorum_type.to_string().as_str(), status])
.inc();
resultMinor Issues5. Inconsistent result label naming The PR uses different label values for similar concepts:
Recommendation: Standardize on either 6. Missing histogram for quorum operations You have Suggestion: Consider adding a 7. Ballot gauge semantics (replica/ballot.rs:53-54) metrics::BALLOT_EPOCH.set(current_ballot.epoch as i64);
metrics::BALLOT_NUMBER.set(current_ballot.ballot as i64);These are only updated in Verification needed: Ensure Code Quality8. Good use of existing reqwest client (http_client.rs:146) ✅ Correctly uses 9. Proper logging style ✅ Log messages follow the guidelines:
10. Import organization ✅ Imports are properly placed at the top of files, not inline within functions (per CLAUDE.md style guide). Performance Considerations
Security ConsiderationsNo security concerns identified. Metrics don't expose sensitive data. Test CoverageThe PR doesn't include tests for the metrics. While metrics themselves are typically not unit tested, consider:
DocumentationSuggestion: Consider adding a
Overall AssessmentThis is a solid addition to the Epoxy package that will greatly improve observability. The main issues are around error classification semantics, which could lead to misleading dashboards if not addressed. Once the critical issues are fixed, this will be ready to merge. Recommendation: Request changes to fix the error classification issues before merging. Generated by Claude Code |
More templates
@rivetkit/cloudflare-workers
@rivetkit/db
@rivetkit/framework-base
@rivetkit/next-js
@rivetkit/react
rivetkit
@rivetkit/sql-loader
@rivetkit/virtual-websocket
@rivetkit/engine-runner
@rivetkit/engine-runner-protocol
commit: |
9d56f3c to
816b2fc
Compare
485634a to
964ae0f
Compare
02463d5 to
2ae3899
Compare
964ae0f to
f3c17ef
Compare
2ae3899 to
5d200c7
Compare
f3c17ef to
b5999a3
Compare
5d200c7 to
69795ba
Compare
27fc351 to
77119ca
Compare
7c91643 to
40274fa
Compare
77119ca to
1fe301c
Compare
40274fa to
cb0fad3
Compare
830c32d to
dfbb13a
Compare
cb0fad3 to
7a06ae3
Compare
dfbb13a to
81663b7
Compare
5109a19 to
2e32f50
Compare
81663b7 to
260900c
Compare
2e32f50 to
b007d1a
Compare
260900c to
a4f42a6
Compare

Fixes RVT-5331