Skip to content

Conversation

@spkane31
Copy link
Contributor

What changed?

Remove invalid_state_transition_workflow_update_message and workflow_update_registry_size_limited metrics. Add the namespace tag to the logs/softasserts in the instrumentation methods the metrics are used for.

Why?

The logs will give us the same information and the namespace without cardinality concerns. These metrics are not expected to fire frequently.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Minimal, metrics changes only.

@spkane31 spkane31 requested review from a team as code owners February 10, 2026 00:25
@spkane31 spkane31 requested a review from stephanos February 10, 2026 00:26
tag.String("update-id", updateID),
tag.String("message", fmt.Sprintf("%T", msg)),
tag.Stringer("state", state),
tag.String("namespace", namespace),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 52 to 57
i.oneOf(metrics.WorkflowExecutionUpdateRegistrySizeLimited.Name())
// TODO: remove log once limit is enforced everywhere
func (i *instrumentation) countRegistrySizeLimited(updateCount, registrySize, payloadSize int, namespace string) {
i.log.Warn("update registry size limit reached",
tag.Int("registry-size", registrySize),
tag.Int("payload-size", payloadSize),
tag.Int("update-count", updateCount))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking back at the original PR, the reason for this log line was to get better data on how/when the registry size limit was hit (avoiding some of the issues with a metrics-based solution). Since this is a rate limit, I think metrics might be a better way to capture this actually, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often. We also have workflow_update_registry_size to get similar information.

Copy link
Contributor

@stephanos stephanos Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often.

I might need more clarity on how the cost works out; if (1) we'll need namespace tags for other metrics anyway and (2) the volume is low; what's the issue?

Apart from that, rate limits are typically tracked as metrics across the codebase AFAIK; logs are not as useful as they are much more limiting to query. Our log queries puts a cap on how much data it can ingest. On a big cluster that limits how far back in time you can go (I've had it unable to process more than 1h, for example). Metrics don't have that issue.

We also have workflow_update_registry_size to get similar information.

It's not quite true; as you cannot make a leap from that to whether a limit was hit. If the size is at 99%, you cannot assume it hit the limit. Or if it's at 10%, it can still happen that an Update hits the limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on leaving the metric and adding namespace to the log for now? Once we eventually add the namespace to the metric we can remove the log entirely but keep the metric for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm good with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored

@spkane31 spkane31 requested a review from stephanos February 11, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants