Skip to content

Conversation

@MasterPtato
Copy link
Contributor

@MasterPtato MasterPtato commented Jan 12, 2026

Fixes RVT-5331

Copy link
Contributor Author

MasterPtato commented Jan 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@linear
Copy link

linear bot commented Jan 12, 2026

@claude
Copy link

claude bot commented Jan 12, 2026

PR Review: Add metrics to Epoxy

Summary

This PR adds comprehensive Prometheus metrics to the Epoxy consensus system. The implementation covers consensus operations, message handling, quorum tracking, replica state, cluster health, and error conditions.

✅ Strengths

  1. Comprehensive coverage - Metrics cover all critical aspects of the EPaxos consensus protocol
  2. Well-organized - Metrics are logically grouped with clear MARK comments
  3. Proper metric types - Correct use of counters, gauges, and histograms
  4. Good label cardinality - Labels are well-chosen without creating cardinality explosions
  5. Consistent naming - All metrics follow the epoxy_* naming convention

🔍 Issues & Suggestions

Critical Issues

1. Error classification mismatch in proposals (ops/propose.rs:94-99)

The proposal status metric incorrectly classifies CommandError as an error:

metrics::PROPOSALS_TOTAL
    .with_label_values(&[if let ProposalResult::Committed = res {
        "ok"
    } else {
        "err"
    }])
    .inc();

According to the code comments at ops/propose.rs:18-20:

Command errors indicate that a proposal succeeded but the command did not apply.
Proposals that have command errors are still written to the log but have no effect.

Issue: CommandError represents a successful consensus operation with an expected application-level failure (e.g., optimistic locking failure). Tagging it as "err" will make it appear as if consensus failed, which is misleading for monitoring.

Recommendation: Change the status label to distinguish three cases:

metrics::PROPOSALS_TOTAL
    .with_label_values(&[match &res {
        ProposalResult::Committed => "committed",
        ProposalResult::CommandError(_) => "command_error",
        ProposalResult::ConsensusFailed => "consensus_failed",
    }])
    .inc();

This provides better observability by separating:

  • committed - Full success
  • command_error - Consensus succeeded but command didn't apply (expected in optimistic concurrency scenarios)
  • consensus_failed - Actual consensus failure

2. Commit error classification (replica/messages/commit.rs:43-45)

Similar issue with commit metrics:

metrics::COMMIT_TOTAL
    .with_label_values(&[if cmd_err.is_none() { "ok" } else { "cmd_err" }])
    .inc();

While this is better than the proposal metric, it's inconsistent. The commit always succeeds at the consensus level even if there's a command error. Consider using labels like "success" / "command_error" to make it clear this is tracking successful commits, with command errors being a subset.

Medium Priority

3. Missing error details in request metrics (replica/message_request.rs:35-36)

metrics::REQUESTS_TOTAL
    .with_label_values(&[request_type, if res.is_ok() { "ok" } else { "err" }])
    .inc();

Issue: All errors are bucketed together. For a distributed consensus system, different error types have different operational implications (network failures, ballot rejections, validation failures, etc.).

Recommendation: Consider adding error type information, though be careful about cardinality. At minimum, you could add a separate counter for specific critical error types.

4. Quorum metric placement (http_client.rs:95-104)

The quorum metric is recorded after the function returns successful responses, but there's no failure path metric if an error occurs before reaching that point.

Issue: If fanout_to_replicas returns an error (e.g., at line 106), the metric won't be recorded, creating a blind spot in monitoring.

Recommendation: Use a defer-like pattern or move the metric to a finally block to ensure it's always recorded. You could use a pattern like:

let result = /* ... */;
let status = if result.is_ok() { "ok" } else { "error" };
metrics::QUORUM_ATTEMPTS_TOTAL
    .with_label_values(&[quorum_type.to_string().as_str(), status])
    .inc();
result

Minor Issues

5. Inconsistent result label naming

The PR uses different label values for similar concepts:

  • PROPOSALS_TOTAL: "ok" / "err"
  • PRE_ACCEPT_TOTAL: "ok" / "invalid_ballot"
  • ACCEPT_TOTAL: "ok" / "invalid_ballot"
  • COMMIT_TOTAL: "ok" / "cmd_err"
  • REQUESTS_TOTAL: "ok" / "err"
  • QUORUM_ATTEMPTS_TOTAL: "ok" / "insufficient_responses"

Recommendation: Standardize on either "ok"/"error" or "success"/"failure" for binary outcomes, and use specific descriptive labels for other cases.

6. Missing histogram for quorum operations

You have PROPOSAL_DURATION (histogram) but QUORUM_ATTEMPTS_TOTAL only has a counter. Quorum latency would be valuable for debugging slow consensus.

Suggestion: Consider adding a QUORUM_DURATION histogram to track how long quorum operations take.

7. Ballot gauge semantics (replica/ballot.rs:53-54)

metrics::BALLOT_EPOCH.set(current_ballot.epoch as i64);
metrics::BALLOT_NUMBER.set(current_ballot.ballot as i64);

These are only updated in increment_ballot, not when ballots are read or validated. This means the metrics might not reflect the true current state if ballots are updated through other paths.

Verification needed: Ensure increment_ballot is the only place ballots can change, or add metric updates to other ballot modification points.

Code Quality

8. Good use of existing reqwest client (http_client.rs:146)

✅ Correctly uses rivet_pools::reqwest::client().await? instead of creating a new client. This follows the CLAUDE.md optimization guidelines.

9. Proper logging style

✅ Log messages follow the guidelines:

  • Lowercase: "updating config" (line update_config.rs:13)
  • Structured logging with ? for debug: ?err patterns
  • Complete sentences

10. Import organization

✅ Imports are properly placed at the top of files, not inline within functions (per CLAUDE.md style guide).

Performance Considerations

  • Metric overhead: The metrics add minimal overhead (microseconds per operation). The consensus protocol involves network I/O and database transactions, so metric recording is negligible in comparison.
  • Label cardinality: Current label cardinality is low and safe. The highest cardinality is request_type which has ~11 values, well within Prometheus best practices.

Security Considerations

No security concerns identified. Metrics don't expose sensitive data.

Test Coverage

The PR doesn't include tests for the metrics. While metrics themselves are typically not unit tested, consider:

  1. Adding integration tests that verify key metrics are recorded during consensus operations
  2. Documenting expected metric behavior in the package README

Documentation

Suggestion: Consider adding a METRICS.md file documenting:

  • What each metric measures
  • Expected values/ranges
  • Alerting thresholds
  • Example PromQL queries for common operational questions

Overall Assessment

This is a solid addition to the Epoxy package that will greatly improve observability. The main issues are around error classification semantics, which could lead to misleading dashboards if not addressed. Once the critical issues are fixed, this will be ready to merge.

Recommendation: Request changes to fix the error classification issues before merging.


Generated by Claude Code

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 12, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3847

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3847

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3847

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3847

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3847

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3847

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3847

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3847

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3847

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3847

commit: b007d1a

@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 9d56f3c to 816b2fc Compare January 13, 2026 00:27
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 485634a to 964ae0f Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 02463d5 to 2ae3899 Compare January 13, 2026 01:01
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 964ae0f to f3c17ef Compare January 13, 2026 01:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 2ae3899 to 5d200c7 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from f3c17ef to b5999a3 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 5d200c7 to 69795ba Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 27fc351 to 77119ca Compare January 14, 2026 02:53
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 7c91643 to 40274fa Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 77119ca to 1fe301c Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 40274fa to cb0fad3 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 830c32d to dfbb13a Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from cb0fad3 to 7a06ae3 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from dfbb13a to 81663b7 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 5109a19 to 2e32f50 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 81663b7 to 260900c Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 2e32f50 to b007d1a Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 260900c to a4f42a6 Compare January 14, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants