Skip to content

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Sep 4, 2025

Part 1 of rewriting our timers to histograms: create and lightly document the new strategy, and start emitting an "interesting" metric, so we can check the results end-to-end.

Once this and other plans are verified, I'll write up a more thorough doc, and we might also choose to adopt a larger refactor before moving most of our metrics.
But for now, this is just step 1, intentionally kept (relatively) small and isolated to make it easier to review.

Most docs are in code, but in a nutshell this commit includes:

  • a new "subsettable histogram" type, which follows Prometheus/OTEL's "exponential histogram" pattern
  • a few probably-useful initial bucket sets to get started
  • some helpers for ^ making those, as tally's exponential histogram tooling turned out to be unacceptably flawed
  • dual-emitting a couple medium-large timers as histograms, so we can validate other changes (e.g. rewriting alerts and dashboards) and validate guesses on the runtime/storage/query costs

@Groxx Groxx force-pushed the histogram-scope-only branch from 8c1e7a4 to ab253f0 Compare September 5, 2025 19:14
// Even 320 is far too many tbh, target a max of 160 for very-high-value data, and generally around 80 or less
// (ideally a value that is divisible by 2 many times, to allow "clean" subsetting, but this is NOT necessary).
//
// The stop callback will be given the current largest value and the number of values generated so far.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the stop function documentation here, but I'd recommend moving it to a type:

// StopFunction has nice documentation
type StopFunction func(last time.Duration, length int) bool

I think this might increase readability, and I'd recommend adding some examples to the documentation (e.g the ones above) just for extra clarity.

For me - why wouldn't we always force a user to a power of 2, if we're going to fill it post-fact anyway?

NIT: Would maxDuration or maxDurationNanos make sense rather than last as a parameter? I guess we would only want to do that if we were planning to force a specific comparison, which the stop function currently doesn't do.

Copy link
Member Author

@Groxx Groxx Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force a clean power of 2: yea, I think I like that better too. will do 👍

maxDuration / maxDurationNanos: not sure what you mean here tbh. this isn't called once, it's called per value, and it's already a time.Duration

Copy link
Member Author

@Groxx Groxx Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re docs on a type: I don't believe any IDEs will show that to you as you fill in the argument, unless you make a var x StopFunction variable - function arg type docs are generally not exposed, just the names and the "parent" func's docs:

goland while entering the arg:
Image

vscode:
Image

and generally for small funcs like this I think having to look at two locations to figure out how to call the callback-needing func is worse...

... though I see IDEs are finally showing you the signature inline, rather than just the name, so that's not too bad (unless this is a local LLM, then it's awful because that's not reliable):

Image Image

in principle I kinda agree, the more types the merrier. ergonomics are just not all that good tho for this kind of case imo, unless there's a lot of reuse or func-as-var usage (where the type would prevent mixing up types) :\

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood re: type and am happy with the more ergonomic editor usage.

maxDuration / maxDurationNanos: not sure what you mean here tbh. this isn't called once, it's called per value, and it's already a time.Duration

I had a couple of mystery logical leaps in my thinking here. At the moment the stop function gives the user arbitrary power to define how they want to stop, given the duration of the most recently appended bucket + how many buckets exist.

All the examples are just comparisons with:

  • the length of the list
  • whether the 'last' duration is greater than or equal to some const

If that's all we want to support, I'd suggest creating the stop function for the user by giving them a method that allows them to specify a max value + a max length:

func CreateStopFunction(maxDuration time.Duration, maxLength int) StopFunction {
  return func(last time.Duration, l int) bool {
    return last >= maxDuration && l >= maxLength
  }
}

Or just allowing them to specify maxDuration/maxLength and doing that comparison where the existing stop function is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm going to CreateStopFunction, I'll just have those two args passed directly and use it that way internally.

... which I think you're convincing me of, in side-chat. lemme see how it looks 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated! I like this better, thanks for bringing it up and pushing on me :)

@c-warren c-warren self-assigned this Sep 9, 2025
// Even 320 is far too many tbh, target a max of 160 for very-high-value data, and generally around 80 or less
// (ideally a value that is divisible by 2 many times, to allow "clean" subsetting, but this is NOT necessary).
//
// The stop callback will be given the current largest value and the number of values generated so far.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood re: type and am happy with the more ergonomic editor usage.

maxDuration / maxDurationNanos: not sure what you mean here tbh. this isn't called once, it's called per value, and it's already a time.Duration

I had a couple of mystery logical leaps in my thinking here. At the moment the stop function gives the user arbitrary power to define how they want to stop, given the duration of the most recently appended bucket + how many buckets exist.

All the examples are just comparisons with:

  • the length of the list
  • whether the 'last' duration is greater than or equal to some const

If that's all we want to support, I'd suggest creating the stop function for the user by giving them a method that allows them to specify a max value + a max length:

func CreateStopFunction(maxDuration time.Duration, maxLength int) StopFunction {
  return func(last time.Duration, l int) bool {
    return last >= maxDuration && l >= maxLength
  }
}

Or just allowing them to specify maxDuration/maxLength and doing that comparison where the existing stop function is called.

@Groxx Groxx merged commit ecc4fb0 into cadence-workflow:master Sep 10, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants