refactor(metrics): migrate ffi histogram recording to macros by rkuris · Pull Request #1712 · ava-labs/firewood

rkuris · 2026-02-24T22:05:23Z

Why this should be merged

Centralize the histogram code and follow the same pattern as used for the counters and timing metrics.

How this works

Brand new macro. I kept "expensive" as a specific option because I think we might want some histograms even if expensive is disabled.

How this was tested

Just unit tests and CI.

Elvis339

A Prometheus histogram already provides _sum and _count, so the manual counters are redundant:

rate(ffi_commit_ms_sum[5m]) replaces rate(ffi_commit_ms[5m])
rate(ffi_commit_ms_count[5m]) replaces rate(ffi_commit[5m])

COMMIT_MS, COMMIT_COUNT, PROPOSE_MS, PROPOSE_COUNT, BATCH_MS, BATCH_COUNT can all go, along with the token: impl FnOnce(Duration) callback that only exists to feed them.

The bucket range also needs fixing. It caps at 50ms, but non-deferred commits run 280-360ms and proposes run 50-200ms everything piles into +Inf and the distribution tells us nothing.

Since this PR is establishing the macro as the standard approach, worth getting the buckets right before this ships.

One naming note: the awkward ffi_commit_ms_bucket_sum (double _bucket) comes from the histogram being named ffi.commit_ms_bucket. Renaming it to something like ffi.commit_latency_ms would clean that up.

Elvis339 · 2026-02-25T20:02:37Z

ffi/src/proposal.rs

The #[histogram_timing] macro on line 76 wraps the entire function, so the histogram

(ffi.commit_ms_bucket) measures: cached view promotion + proposal.commit() + cached view clearing + token callback.

But the manual timer on line 93 only measures proposal.commit(), and that's what gets passed to the counter (ffi.commit_ms).

After this change, the counter and histogram for the same "commit" operation measure different scopes. Deriving average latency from the counter versus the histogram's _sum / _count will get different numbers with no obvious explanation.

Ah actually I see what you're talking about, ignore my prior comment.

Resolved in latest push

ffi/src/handle.rs

Elvis339

Who owns the bucket configuration for these histograms, and how was it decided?

Latency distributions are likely different.

Elvis339 · 2026-02-27T17:10:47Z

metrics/src/lib.rs

+#[macro_export]
+macro_rules! fwd_timed_result {
+    ($name:expr, expensive, $expr:expr) => {{
+        let __fwd_start = ::coarsetime::Instant::now();


Should we replace coarsetime with std::time this would partially solve. #1720

This is done under another PR.

Can you link it here #1736 please? I couldn't find it.

Sorry I mis-spoke. I meant to say "should be done under another PR". I think we're looking for numbers before deciding to make this change though.

Elvis339 · 2026-02-27T17:18:10Z

metrics/src/lib.rs

+        let __fwd_result = $expr;
+        let __fwd_elapsed = __fwd_start.elapsed();
+        if __fwd_result.is_ok() {
+            $crate::fwd_histogram_record!($name, __fwd_elapsed.as_f64() * 1000.0, expensive);


What's the rationale for __fwd_elapsed.as_f64() * 1000.0? Could we replace it with __fwd_elapsed..as_millis()?

This was pre-existing and probably broken. It could use another issue as well.

Seems like this is related: #1735

Elvis339 · 2026-02-27T17:19:48Z

metrics/src/lib.rs

+/// let value = result?;
+/// ```
+#[macro_export]
+macro_rules! fwd_timed_result {


Would it make sense to split this into fwd_timed_result! (always records) and fwd_expensive_timed_result!? It's cleaner and no ambiguity about argument order.

Elvis339 · 2026-02-27T17:24:11Z

firewood-macros/src/lib.rs

+/// }
+/// ```
+#[proc_macro_attribute]
+pub fn histogram_timing(args: TokenStream, input: TokenStream) -> TokenStream {


Is this still needed? It seems like it isn’t used anywhere.

If it's needed and you want to keep it, clarify in the docs that it measures the whole function body, not a specific sub-expression unlike fwd_timed_result!, which can wrap only the part you care about. That difference matters when deciding between the attribute and the macro.

Elvis339 · 2026-02-27T17:35:48Z

ffi/src/proposal.rs

+        let (proposal_result, propose_time) =
+            fwd_timed_result!(crate::registry::PROPOSE_MS_BUCKET, expensive, f());
+        let proposal = proposal_result?;
        firewood_increment!(crate::registry::PROPOSE_MS, propose_time.as_millis());
        firewood_increment!(crate::registry::PROPOSE_COUNT, 1);


We accidentally built a manual histogram. PROPOSE_MS accumulates total latency, PROPOSE_COUNT tracks observations that's exactly what _sum and _count give us for free on any histogram.

Then we added an actual histogram and put it behind the expensive flag, so the worse version runs always and the better one would get dropped if expensive flag is off.

The histogram is strictly better because counters only give you averages, which hide tail latency. With the histogram we can answer questions like "what % of proposes exceeded X ms?"

Elvis339 · 2026-02-27T17:47:32Z

metrics/src/lib.rs

 /// ```
 #[macro_export]
-macro_rules! firewood_record {
+macro_rules! fwd_histogram_record {


This looks like an unnecessary abstraction over metrics::histogram! and it drops label support in the process.

Callsites could just use metrics::histogram! directly.

This looks like an unnecessary abstraction over metrics::histogram! and it drops label support in the process.

Callsites could just use metrics::histogram! directly.

This was just a rename, not a redesign of this macro. Changing this macro should be done under another PR.

Let me know and I can open an issue that describes all these problems for followups.

Yes, let's add it here.

#1736

Elvis339 · 2026-02-27T17:54:52Z

ffi/src/handle.rs

-
-        let elapsed = start_time.elapsed();
+        let (root_hash_result, elapsed) =
+            fwd_timed_result!(crate::registry::BATCH_MS_BUCKET, expensive, {


Same comment as in propsal.

Elvis339 · 2026-02-27T17:56:53Z

ffi/src/proposal.rs

-        let proposal = f()?;
-        let propose_time = start_time.elapsed();
+        let (proposal_result, propose_time) =
+            fwd_timed_result!(crate::registry::PROPOSE_MS_BUCKET, expensive, f());


_BUCKET is redundant rename PROPOSE_MS_BUCKET to just PROPOSE_MS and remove PROPOSE_COUNT. See comment below for rationale.

rkuris · 2026-03-04T06:15:38Z

I don't want this PR to turn into a "fix every metric issue" so I'm going to push back some here.

A Prometheus histogram already provides _sum and _count, so the manual counters are redundant:

rate(ffi_commit_ms_sum[5m]) replaces rate(ffi_commit_ms[5m])

rate(ffi_commit_ms_count[5m]) replaces rate(ffi_commit[5m])

COMMIT_MS, COMMIT_COUNT, PROPOSE_MS, PROPOSE_COUNT, BATCH_MS, BATCH_COUNT can all go, along with the token: impl FnOnce(Duration) callback that only exists to feed them.

This problem existed before this PR was started, and as such is unrelated to creating this macro. Removing these extra counters can (and should) be done under a separate PR.

The bucket range also needs fixing. It caps at 50ms, but non-deferred commits run 280-360ms and proposes run 50-200ms everything piles into +Inf and the distribution tells us nothing.

Also a pre-existing problem. Looks like @PlatCore is looking to fix this in a separate PR.

Since this PR is establishing the macro as the standard approach, worth getting the buckets right before this ships.

Let's separate the macroization from the fixes. I think the fixes will be easier once we have the macro in place.

One naming note: the awkward ffi_commit_ms_bucket_sum (double _bucket) comes from the histogram being named ffi.commit_ms_bucket. Renaming it to something like ffi.commit_latency_ms would clean that up.

Also pre-existing.

Elvis339 · 2026-03-04T14:04:38Z

I don't want this PR to turn into a "fix every metric issue" so I'm going to push back some here.

A Prometheus histogram already provides _sum and _count, so the manual counters are redundant:

rate(ffi_commit_ms_sum[5m]) replaces rate(ffi_commit_ms[5m])

rate(ffi_commit_ms_count[5m]) replaces rate(ffi_commit[5m])

COMMIT_MS, COMMIT_COUNT, PROPOSE_MS, PROPOSE_COUNT, BATCH_MS, BATCH_COUNT can all go, along with the token: impl FnOnce(Duration) callback that only exists to feed them.

This problem existed before this PR was started, and as such is unrelated to creating this macro. Removing these extra counters can (and should) be done under a separate PR.

The bucket range also needs fixing. It caps at 50ms, but non-deferred commits run 280-360ms and proposes run 50-200ms everything piles into +Inf and the distribution tells us nothing.

Also a pre-existing problem. Looks like @PlatCore is looking to fix this in a separate PR.

Since this PR is establishing the macro as the standard approach, worth getting the buckets right before this ships.

Let's separate the macroization from the fixes. I think the fixes will be easier once we have the macro in place.

One naming note: the awkward ffi_commit_ms_bucket_sum (double _bucket) comes from the histogram being named ffi.commit_ms_bucket. Renaming it to something like ffi.commit_latency_ms would clean that up.

Also pre-existing.

I opened #1736, feel free to edit if needed. My intent is to have these things centralized for simpler tracking.

demosdemon · 2026-03-09T15:06:02Z

ffi/src/proposal.rs

 #[derive(Debug)]
 pub struct CreateProposalResult<'db> {
    pub handle: ProposalHandle<'db>,
-    pub start_time: coarsetime::Instant,
 }


You can get rid of this struct and return the ProposalHandle directly instead. The only reason I made a struct for this was to include the time field. But, that's no longer needed with this change.

We need to scope the histogram correctly, this should be correct.

rkuris requested a review from Elvis339 February 24, 2026 22:11

rkuris marked this pull request as ready for review February 24, 2026 22:30

rkuris requested review from RodrigoVillar, alarso16, bernard-avalabs and demosdemon as code owners February 24, 2026 22:30

rkuris added the observability logging, tracing and metrics label Feb 24, 2026

rkuris self-assigned this Feb 24, 2026

Elvis339 requested changes Feb 25, 2026

View reviewed changes

Elvis339 mentioned this pull request Feb 25, 2026

Clock source mismatch between Go and Rust FFI causes incorrect commit latency metrics #1720

Open

Elvis339 requested changes Feb 27, 2026

View reviewed changes

rkuris force-pushed the rkuris/histogram-cleanups branch from f9bdddb to 1b725af Compare March 3, 2026 23:57

Elvis339 mentioned this pull request Mar 4, 2026

Histogram metrics followup #1736

Open

3 tasks

rkuris force-pushed the rkuris/histogram-cleanups branch from 1b725af to ce9ea76 Compare March 4, 2026 15:57

rkuris requested a review from Elvis339 March 4, 2026 16:02

Elvis339 mentioned this pull request Mar 6, 2026

fix: clock source miss-match between Rust and Go #1742

Draft

5 tasks

Elvis339 approved these changes Mar 6, 2026

View reviewed changes

demosdemon approved these changes Mar 9, 2026

View reviewed changes

rkuris force-pushed the rkuris/histogram-cleanups branch 2 times, most recently from 62460e5 to 0efd284 Compare March 9, 2026 21:32

rkuris enabled auto-merge (squash) March 9, 2026 21:35

Address reviewer comments

7d00124

We need to scope the histogram correctly, this should be correct.

rkuris force-pushed the rkuris/histogram-cleanups branch from 0efd284 to 7d00124 Compare March 9, 2026 21:35

rkuris merged commit a293cd3 into main Mar 9, 2026
48 checks passed

rkuris deleted the rkuris/histogram-cleanups branch March 9, 2026 21:43

Conversation

rkuris commented Feb 24, 2026

Why this should be merged

How this works

How this was tested

Uh oh!

Elvis339 left a comment

Choose a reason for hiding this comment

Uh oh!

Elvis339 Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Elvis339 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkuris Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Elvis339 Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkuris commented Mar 4, 2026

Uh oh!

Elvis339 commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Elvis339 Feb 25, 2026 •

edited

Loading

rkuris Mar 4, 2026 •

edited

Loading

Elvis339 Mar 4, 2026 •

edited

Loading