refactor(metrics): migrate ffi histogram recording to macros#1712
refactor(metrics): migrate ffi histogram recording to macros#1712
Conversation
Elvis339
left a comment
There was a problem hiding this comment.
A Prometheus histogram already provides _sum and _count, so the manual counters are redundant:
- rate(ffi_commit_ms_sum[5m]) replaces rate(ffi_commit_ms[5m])
- rate(ffi_commit_ms_count[5m]) replaces rate(ffi_commit[5m])
COMMIT_MS, COMMIT_COUNT, PROPOSE_MS, PROPOSE_COUNT, BATCH_MS, BATCH_COUNT can all go, along with the token: impl FnOnce(Duration) callback that only exists to feed them.
The bucket range also needs fixing. It caps at 50ms, but non-deferred commits run 280-360ms and proposes run 50-200ms everything piles into +Inf and the distribution tells us nothing.
Since this PR is establishing the macro as the standard approach, worth getting the buckets right before this ships.
One naming note: the awkward ffi_commit_ms_bucket_sum (double _bucket) comes from the histogram being named ffi.commit_ms_bucket. Renaming it to something like ffi.commit_latency_ms would clean that up.
ffi/src/proposal.rs
Outdated
There was a problem hiding this comment.
The #[histogram_timing] macro on line 76 wraps the entire function, so the histogram
(ffi.commit_ms_bucket) measures: cached view promotion + proposal.commit() + cached view clearing + token callback.
But the manual timer on line 93 only measures proposal.commit(), and that's what gets passed to the counter (ffi.commit_ms).
After this change, the counter and histogram for the same "commit" operation measure different scopes. Deriving average latency from the counter versus the histogram's _sum / _count will get different numbers with no obvious explanation.
There was a problem hiding this comment.
Ah actually I see what you're talking about, ignore my prior comment.
Elvis339
left a comment
There was a problem hiding this comment.
Who owns the bucket configuration for these histograms, and how was it decided?
Latency distributions are likely different.
| #[macro_export] | ||
| macro_rules! fwd_timed_result { | ||
| ($name:expr, expensive, $expr:expr) => {{ | ||
| let __fwd_start = ::coarsetime::Instant::now(); |
There was a problem hiding this comment.
Should we replace coarsetime with std::time this would partially solve. #1720
There was a problem hiding this comment.
This is done under another PR.
There was a problem hiding this comment.
Can you link it here #1736 please? I couldn't find it.
There was a problem hiding this comment.
Sorry I mis-spoke. I meant to say "should be done under another PR". I think we're looking for numbers before deciding to make this change though.
metrics/src/lib.rs
Outdated
| let __fwd_result = $expr; | ||
| let __fwd_elapsed = __fwd_start.elapsed(); | ||
| if __fwd_result.is_ok() { | ||
| $crate::fwd_histogram_record!($name, __fwd_elapsed.as_f64() * 1000.0, expensive); |
There was a problem hiding this comment.
What's the rationale for __fwd_elapsed.as_f64() * 1000.0? Could we replace it with __fwd_elapsed..as_millis()?
There was a problem hiding this comment.
This was pre-existing and probably broken. It could use another issue as well.
| /// let value = result?; | ||
| /// ``` | ||
| #[macro_export] | ||
| macro_rules! fwd_timed_result { |
There was a problem hiding this comment.
Would it make sense to split this into fwd_timed_result! (always records) and fwd_expensive_timed_result!? It's cleaner and no ambiguity about argument order.
firewood-macros/src/lib.rs
Outdated
| /// } | ||
| /// ``` | ||
| #[proc_macro_attribute] | ||
| pub fn histogram_timing(args: TokenStream, input: TokenStream) -> TokenStream { |
There was a problem hiding this comment.
Is this still needed? It seems like it isn’t used anywhere.
If it's needed and you want to keep it, clarify in the docs that it measures the whole function body, not a specific sub-expression unlike fwd_timed_result!, which can wrap only the part you care about. That difference matters when deciding between the attribute and the macro.
| let (proposal_result, propose_time) = | ||
| fwd_timed_result!(crate::registry::PROPOSE_MS_BUCKET, expensive, f()); | ||
| let proposal = proposal_result?; | ||
| firewood_increment!(crate::registry::PROPOSE_MS, propose_time.as_millis()); | ||
| firewood_increment!(crate::registry::PROPOSE_COUNT, 1); |
There was a problem hiding this comment.
We accidentally built a manual histogram. PROPOSE_MS accumulates total latency, PROPOSE_COUNT tracks observations that's exactly what _sum and _count give us for free on any histogram.
Then we added an actual histogram and put it behind the expensive flag, so the worse version runs always and the better one would get dropped if expensive flag is off.
The histogram is strictly better because counters only give you averages, which hide tail latency. With the histogram we can answer questions like "what % of proposes exceeded X ms?"
metrics/src/lib.rs
Outdated
| /// ``` | ||
| #[macro_export] | ||
| macro_rules! firewood_record { | ||
| macro_rules! fwd_histogram_record { |
There was a problem hiding this comment.
This looks like an unnecessary abstraction over metrics::histogram! and it drops label support in the process.
Callsites could just use metrics::histogram! directly.
There was a problem hiding this comment.
This looks like an unnecessary abstraction over metrics::histogram! and it drops label support in the process.
Callsites could just use
metrics::histogram!directly.
This was just a rename, not a redesign of this macro. Changing this macro should be done under another PR.
Let me know and I can open an issue that describes all these problems for followups.
There was a problem hiding this comment.
Yes, let's add it here.
ffi/src/handle.rs
Outdated
|
|
||
| let elapsed = start_time.elapsed(); | ||
| let (root_hash_result, elapsed) = | ||
| fwd_timed_result!(crate::registry::BATCH_MS_BUCKET, expensive, { |
There was a problem hiding this comment.
Same comment as in propsal.
ffi/src/proposal.rs
Outdated
| let proposal = f()?; | ||
| let propose_time = start_time.elapsed(); | ||
| let (proposal_result, propose_time) = | ||
| fwd_timed_result!(crate::registry::PROPOSE_MS_BUCKET, expensive, f()); |
There was a problem hiding this comment.
_BUCKET is redundant rename PROPOSE_MS_BUCKET to just PROPOSE_MS and remove PROPOSE_COUNT. See comment below for rationale.
f9bdddb to
1b725af
Compare
|
I don't want this PR to turn into a "fix every metric issue" so I'm going to push back some here.
This problem existed before this PR was started, and as such is unrelated to creating this macro. Removing these extra counters can (and should) be done under a separate PR.
Also a pre-existing problem. Looks like @PlatCore is looking to fix this in a separate PR.
Let's separate the macroization from the fixes. I think the fixes will be easier once we have the macro in place.
Also pre-existing. |
I opened #1736, feel free to edit if needed. My intent is to have these things centralized for simpler tracking. |
1b725af to
ce9ea76
Compare
| #[derive(Debug)] | ||
| pub struct CreateProposalResult<'db> { | ||
| pub handle: ProposalHandle<'db>, | ||
| pub start_time: coarsetime::Instant, | ||
| } |
There was a problem hiding this comment.
You can get rid of this struct and return the ProposalHandle directly instead. The only reason I made a struct for this was to include the time field. But, that's no longer needed with this change.
62460e5 to
0efd284
Compare
We need to scope the histogram correctly, this should be correct.
0efd284 to
7d00124
Compare
Why this should be merged
Centralize the histogram code and follow the same pattern as used for the counters and timing metrics.
How this works
Brand new macro. I kept "expensive" as a specific option because I think we might want some histograms even if expensive is disabled.
How this was tested
Just unit tests and CI.