fix(epoxy): add metrics #3847

MasterPtato · 2026-01-12T22:13:08Z

Fixes RVT-5331

MasterPtato · 2026-01-12T22:13:27Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

linear · 2026-01-12T22:13:39Z

RVT-5331 Add metrics to epoxy

claude · 2026-01-12T22:15:07Z

PR Review: Add metrics to Epoxy

Summary

This PR adds comprehensive Prometheus metrics to the Epoxy consensus system. The implementation covers consensus operations, message handling, quorum tracking, replica state, cluster health, and error conditions.

✅ Strengths

Comprehensive coverage - Metrics cover all critical aspects of the EPaxos consensus protocol
Well-organized - Metrics are logically grouped with clear MARK comments
Proper metric types - Correct use of counters, gauges, and histograms
Good label cardinality - Labels are well-chosen without creating cardinality explosions
Consistent naming - All metrics follow the epoxy_* naming convention

🔍 Issues & Suggestions

Critical Issues

1. Error classification mismatch in proposals (ops/propose.rs:94-99)

The proposal status metric incorrectly classifies CommandError as an error:

metrics::PROPOSALS_TOTAL
    .with_label_values(&[if let ProposalResult::Committed = res {
        "ok"
    } else {
        "err"
    }])
    .inc();

According to the code comments at ops/propose.rs:18-20:

Command errors indicate that a proposal succeeded but the command did not apply.
Proposals that have command errors are still written to the log but have no effect.

Issue: CommandError represents a successful consensus operation with an expected application-level failure (e.g., optimistic locking failure). Tagging it as "err" will make it appear as if consensus failed, which is misleading for monitoring.

Recommendation: Change the status label to distinguish three cases:

metrics::PROPOSALS_TOTAL
    .with_label_values(&[match &res {
        ProposalResult::Committed => "committed",
        ProposalResult::CommandError(_) => "command_error",
        ProposalResult::ConsensusFailed => "consensus_failed",
    }])
    .inc();

This provides better observability by separating:

committed - Full success
command_error - Consensus succeeded but command didn't apply (expected in optimistic concurrency scenarios)
consensus_failed - Actual consensus failure

2. Commit error classification (replica/messages/commit.rs:43-45)

Similar issue with commit metrics:

metrics::COMMIT_TOTAL
    .with_label_values(&[if cmd_err.is_none() { "ok" } else { "cmd_err" }])
    .inc();

While this is better than the proposal metric, it's inconsistent. The commit always succeeds at the consensus level even if there's a command error. Consider using labels like "success" / "command_error" to make it clear this is tracking successful commits, with command errors being a subset.

Medium Priority

3. Missing error details in request metrics (replica/message_request.rs:35-36)

metrics::REQUESTS_TOTAL
    .with_label_values(&[request_type, if res.is_ok() { "ok" } else { "err" }])
    .inc();

Issue: All errors are bucketed together. For a distributed consensus system, different error types have different operational implications (network failures, ballot rejections, validation failures, etc.).

Recommendation: Consider adding error type information, though be careful about cardinality. At minimum, you could add a separate counter for specific critical error types.

4. Quorum metric placement (http_client.rs:95-104)

The quorum metric is recorded after the function returns successful responses, but there's no failure path metric if an error occurs before reaching that point.

Issue: If fanout_to_replicas returns an error (e.g., at line 106), the metric won't be recorded, creating a blind spot in monitoring.

Recommendation: Use a defer-like pattern or move the metric to a finally block to ensure it's always recorded. You could use a pattern like:

let result = /* ... */;
let status = if result.is_ok() { "ok" } else { "error" };
metrics::QUORUM_ATTEMPTS_TOTAL
    .with_label_values(&[quorum_type.to_string().as_str(), status])
    .inc();
result

Minor Issues

5. Inconsistent result label naming

The PR uses different label values for similar concepts:

PROPOSALS_TOTAL: "ok" / "err"
PRE_ACCEPT_TOTAL: "ok" / "invalid_ballot"
ACCEPT_TOTAL: "ok" / "invalid_ballot"
COMMIT_TOTAL: "ok" / "cmd_err"
REQUESTS_TOTAL: "ok" / "err"
QUORUM_ATTEMPTS_TOTAL: "ok" / "insufficient_responses"

Recommendation: Standardize on either "ok"/"error" or "success"/"failure" for binary outcomes, and use specific descriptive labels for other cases.

6. Missing histogram for quorum operations

You have PROPOSAL_DURATION (histogram) but QUORUM_ATTEMPTS_TOTAL only has a counter. Quorum latency would be valuable for debugging slow consensus.

Suggestion: Consider adding a QUORUM_DURATION histogram to track how long quorum operations take.

7. Ballot gauge semantics (replica/ballot.rs:53-54)

metrics::BALLOT_EPOCH.set(current_ballot.epoch as i64);
metrics::BALLOT_NUMBER.set(current_ballot.ballot as i64);

These are only updated in increment_ballot, not when ballots are read or validated. This means the metrics might not reflect the true current state if ballots are updated through other paths.

Verification needed: Ensure increment_ballot is the only place ballots can change, or add metric updates to other ballot modification points.

Code Quality

8. Good use of existing reqwest client (http_client.rs:146)

✅ Correctly uses rivet_pools::reqwest::client().await? instead of creating a new client. This follows the CLAUDE.md optimization guidelines.

9. Proper logging style

✅ Log messages follow the guidelines:

Lowercase: "updating config" (line update_config.rs:13)
Structured logging with ? for debug: ?err patterns
Complete sentences

10. Import organization

✅ Imports are properly placed at the top of files, not inline within functions (per CLAUDE.md style guide).

Performance Considerations

Metric overhead: The metrics add minimal overhead (microseconds per operation). The consensus protocol involves network I/O and database transactions, so metric recording is negligible in comparison.
Label cardinality: Current label cardinality is low and safe. The highest cardinality is request_type which has ~11 values, well within Prometheus best practices.

Security Considerations

No security concerns identified. Metrics don't expose sensitive data.

Test Coverage

The PR doesn't include tests for the metrics. While metrics themselves are typically not unit tested, consider:

Adding integration tests that verify key metrics are recorded during consensus operations
Documenting expected metric behavior in the package README

Documentation

Suggestion: Consider adding a METRICS.md file documenting:

What each metric measures
Expected values/ranges
Alerting thresholds
Example PromQL queries for common operational questions

Overall Assessment

This is a solid addition to the Epoxy package that will greatly improve observability. The main issues are around error classification semantics, which could lead to misleading dashboards if not addressed. Once the critical issues are fixed, this will be ready to merge.

Recommendation: Request changes to fix the error classification issues before merging.

Generated by Claude Code

pkg-pr-new · 2026-01-12T22:15:21Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3847

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3847

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3847

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3847

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3847

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3847

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3847

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3847

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3847

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3847

commit: b007d1a

MasterPtato requested a review from NathanFlurry January 12, 2026 22:13

NathanFlurry mentioned this pull request Jan 12, 2026

chore(engine): add support for epoxy debugging #3794

Open

MasterPtato mentioned this pull request Jan 12, 2026

chore: clean up errors #3846

Open

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 9d56f3c to 816b2fc Compare January 13, 2026 00:27

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 485634a to 964ae0f Compare January 13, 2026 00:44

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 02463d5 to 2ae3899 Compare January 13, 2026 01:01

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 964ae0f to f3c17ef Compare January 13, 2026 01:02

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 2ae3899 to 5d200c7 Compare January 13, 2026 01:29

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from f3c17ef to b5999a3 Compare January 13, 2026 01:29

This was referenced Jan 13, 2026

fix: add ready_chunks to worker bumps, bumps per tick metric #3855

Open

fix(guard): clean up guard #3862

Open

fix(otel): enrich http traces #3878

Open

chore(gas): add overview and history docs #3879

Open

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 5d200c7 to 69795ba Compare January 14, 2026 02:05

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 27fc351 to 77119ca Compare January 14, 2026 02:53

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 7c91643 to 40274fa Compare January 14, 2026 03:06

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 77119ca to 1fe301c Compare January 14, 2026 03:06

NathanFlurry approved these changes Jan 14, 2026

View reviewed changes

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 40274fa to cb0fad3 Compare January 14, 2026 19:47

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 830c32d to dfbb13a Compare January 14, 2026 20:01

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from cb0fad3 to 7a06ae3 Compare January 14, 2026 20:01

This was referenced Jan 14, 2026

fix(engine-runner): handle shutdown close codes correctly #3896

Open

fix(pegboard): cache none runner pool errors too #3899

Open

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from dfbb13a to 81663b7 Compare January 14, 2026 22:52

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 5109a19 to 2e32f50 Compare January 14, 2026 23:02

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 81663b7 to 260900c Compare January 14, 2026 23:02

fix(epoxy): add metrics

b007d1a

MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 2e32f50 to b007d1a Compare January 14, 2026 23:07

MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 260900c to a4f42a6 Compare January 14, 2026 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(epoxy): add metrics #3847

fix(epoxy): add metrics #3847

MasterPtato commented Jan 12, 2026 •

edited

Loading

Uh oh!

MasterPtato commented Jan 12, 2026 •

edited

Loading

Uh oh!

linear bot commented Jan 12, 2026

Uh oh!

claude bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(epoxy): add metrics #3847

Are you sure you want to change the base?

fix(epoxy): add metrics #3847

Conversation

MasterPtato commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

linear bot commented Jan 12, 2026

Uh oh!

claude bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add metrics to Epoxy

Summary

✅ Strengths

🔍 Issues & Suggestions

Critical Issues

Medium Priority

Minor Issues

Code Quality

Performance Considerations

Security Considerations

Test Coverage

Documentation

Overall Assessment

Uh oh!

pkg-pr-new bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MasterPtato commented Jan 12, 2026 •

edited

Loading

MasterPtato commented Jan 12, 2026 •

edited

Loading

claude bot commented Jan 12, 2026 •

edited

Loading

pkg-pr-new bot commented Jan 12, 2026 •

edited

Loading