New histogram strategy and minimal changes to start emitting it #7240

Groxx · 2025-09-04T21:52:48Z

Part 1 of rewriting our timers to histograms: create and lightly document the new strategy, and start emitting an "interesting" metric, so we can check the results end-to-end.

Once this and other plans are verified, I'll write up a more thorough doc, and we might also choose to adopt a larger refactor before moving most of our metrics.
But for now, this is just step 1, intentionally kept (relatively) small and isolated to make it easier to review.

Most docs are in code, but in a nutshell this commit includes:

a new "subsettable histogram" type, which follows Prometheus/OTEL's "exponential histogram" pattern
a few probably-useful initial bucket sets to get started
some helpers for ^ making those, as tally's exponential histogram tooling turned out to be unacceptably flawed
dual-emitting a couple medium-large timers as histograms, so we can validate other changes (e.g. rewriting alerts and dashboards) and validate guesses on the runtime/storage/query costs

common/metrics/histograms.go

service/worker/archiver/replay_metrics_client.go

common/metrics/histograms_test.go

common/metrics/histograms.go

c-warren · 2025-09-09T20:30:30Z

common/metrics/histograms.go

+// Even 320 is far too many tbh, target a max of 160 for very-high-value data, and generally around 80 or less
+// (ideally a value that is divisible by 2 many times, to allow "clean" subsetting, but this is NOT necessary).
+//
+// The stop callback will be given the current largest value and the number of values generated so far.


I like the stop function documentation here, but I'd recommend moving it to a type:

// StopFunction has nice documentation type StopFunction func(last time.Duration, length int) bool

I think this might increase readability, and I'd recommend adding some examples to the documentation (e.g the ones above) just for extra clarity.

For me - why wouldn't we always force a user to a power of 2, if we're going to fill it post-fact anyway?

NIT: Would maxDuration or maxDurationNanos make sense rather than last as a parameter? I guess we would only want to do that if we were planning to force a specific comparison, which the stop function currently doesn't do.

force a clean power of 2: yea, I think I like that better too. will do 👍

maxDuration / maxDurationNanos: not sure what you mean here tbh. this isn't called once, it's called per value, and it's already a time.Duration

re docs on a type: I don't believe any IDEs will show that to you as you fill in the argument, unless you make a var x StopFunction variable - function arg type docs are generally not exposed, just the names and the "parent" func's docs:

goland while entering the arg:

vscode:

and generally for small funcs like this I think having to look at two locations to figure out how to call the callback-needing func is worse...

... though I see IDEs are finally showing you the signature inline, rather than just the name, so that's not too bad (unless this is a local LLM, then it's awful because that's not reliable):

in principle I kinda agree, the more types the merrier. ergonomics are just not all that good tho for this kind of case imo, unless there's a lot of reuse or func-as-var usage (where the type would prevent mixing up types) :\

Understood re: type and am happy with the more ergonomic editor usage.

maxDuration / maxDurationNanos: not sure what you mean here tbh. this isn't called once, it's called per value, and it's already a time.Duration

I had a couple of mystery logical leaps in my thinking here. At the moment the stop function gives the user arbitrary power to define how they want to stop, given the duration of the most recently appended bucket + how many buckets exist.

All the examples are just comparisons with:

the length of the list

whether the 'last' duration is greater than or equal to some const

If that's all we want to support, I'd suggest creating the stop function for the user by giving them a method that allows them to specify a max value + a max length:

func CreateStopFunction(maxDuration time.Duration, maxLength int) StopFunction { return func(last time.Duration, l int) bool { return last >= maxDuration && l >= maxLength } }

Or just allowing them to specify maxDuration/maxLength and doing that comparison where the existing stop function is called.

if I'm going to CreateStopFunction, I'll just have those two args passed directly and use it that way internally.

... which I think you're convincing me of, in side-chat. lemme see how it looks 👍

updated! I like this better, thanks for bringing it up and pushing on me :)

common/metrics/histograms.go

c-warren · 2025-09-09T22:34:50Z

common/metrics/histograms.go

+// Even 320 is far too many tbh, target a max of 160 for very-high-value data, and generally around 80 or less
+// (ideally a value that is divisible by 2 many times, to allow "clean" subsetting, but this is NOT necessary).
+//
+// The stop callback will be given the current largest value and the number of values generated so far.


Understood re: type and am happy with the more ergonomic editor usage.

maxDuration / maxDurationNanos: not sure what you mean here tbh. this isn't called once, it's called per value, and it's already a time.Duration

I had a couple of mystery logical leaps in my thinking here. At the moment the stop function gives the user arbitrary power to define how they want to stop, given the duration of the most recently appended bucket + how many buckets exist.

All the examples are just comparisons with:

the length of the list

whether the 'last' duration is greater than or equal to some const

If that's all we want to support, I'd suggest creating the stop function for the user by giving them a method that allows them to specify a max value + a max length:

func CreateStopFunction(maxDuration time.Duration, maxLength int) StopFunction { return func(last time.Duration, l int) bool { return last >= maxDuration && l >= maxLength } }

Or just allowing them to specify maxDuration/maxLength and doing that comparison where the existing stop function is called.

common/metrics/histograms.go

…ehind it

Groxx requested review from 3vilhamster, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners September 4, 2025 21:52

Groxx force-pushed the histogram-scope-only branch from 71347ce to 8c1e7a4 Compare September 4, 2025 23:50

scope changes

ab253f0

Groxx force-pushed the histogram-scope-only branch from 8c1e7a4 to ab253f0 Compare September 5, 2025 19:14

better tagging strategy, fixed int-bucket tag values

f1e7df2

c-warren reviewed Sep 9, 2025

View reviewed changes

histogram-making logic cleanup, reducing the mishmash of -1 values

72dc00a

c-warren self-assigned this Sep 9, 2025

c-warren approved these changes Sep 9, 2025

View reviewed changes

davidporter-id-au reviewed Sep 9, 2025

View reviewed changes

common/metrics/histograms.go Outdated Show resolved Hide resolved

davidporter-id-au approved these changes Sep 9, 2025

View reviewed changes

Groxx added 2 commits September 9, 2025 18:53

simplification, cleanup, better docs about the intent and reasoning b…

8d96140

…ehind it

docs on subsetTo

253a836

Groxx merged commit ecc4fb0 into cadence-workflow:master Sep 10, 2025
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New histogram strategy and minimal changes to start emitting it #7240

New histogram strategy and minimal changes to start emitting it #7240

Uh oh!

Groxx commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

c-warren Sep 9, 2025

Uh oh!

Groxx Sep 9, 2025 •

edited

Loading

Uh oh!

Groxx Sep 9, 2025 •

edited

Loading

Uh oh!

c-warren Sep 9, 2025

Uh oh!

Groxx Sep 9, 2025

Uh oh!

Groxx Sep 9, 2025

Uh oh!

Uh oh!

c-warren Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New histogram strategy and minimal changes to start emitting it #7240

New histogram strategy and minimal changes to start emitting it #7240

Uh oh!

Conversation

Groxx commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

c-warren Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Groxx Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c-warren Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Groxx Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Groxx Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

c-warren Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Groxx commented Sep 4, 2025 •

edited

Loading

Groxx Sep 9, 2025 •

edited

Loading

Groxx Sep 9, 2025 •

edited

Loading