-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Prune two workflow update related metrics in favor of logs w/ namespace #9269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| tag.String("update-id", updateID), | ||
| tag.String("message", fmt.Sprintf("%T", msg)), | ||
| tag.Stringer("state", state), | ||
| tag.String("namespace", namespace), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| i.oneOf(metrics.WorkflowExecutionUpdateRegistrySizeLimited.Name()) | ||
| // TODO: remove log once limit is enforced everywhere | ||
| func (i *instrumentation) countRegistrySizeLimited(updateCount, registrySize, payloadSize int, namespace string) { | ||
| i.log.Warn("update registry size limit reached", | ||
| tag.Int("registry-size", registrySize), | ||
| tag.Int("payload-size", payloadSize), | ||
| tag.Int("update-count", updateCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking back at the original PR, the reason for this log line was to get better data on how/when the registry size limit was hit (avoiding some of the issues with a metrics-based solution). Since this is a rate limit, I think metrics might be a better way to capture this actually, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often. We also have workflow_update_registry_size to get similar information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often.
I might need more clarity on how the cost works out; if (1) we'll need namespace tags for other metrics anyway and (2) the volume is low; what's the issue?
Apart from that, rate limits are typically tracked as metrics across the codebase AFAIK; logs are not as useful as they are much more limiting to query. Our log queries puts a cap on how much data it can ingest. On a big cluster that limits how far back in time you can go (I've had it unable to process more than 1h, for example). Metrics don't have that issue.
We also have workflow_update_registry_size to get similar information.
It's not quite true; as you cannot make a leap from that to whether a limit was hit. If the size is at 99%, you cannot assume it hit the limit. Or if it's at 10%, it can still happen that an Update hits the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on leaving the metric and adding namespace to the log for now? Once we eventually add the namespace to the metric we can remove the log entirely but keep the metric for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'm good with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored
What changed?
Remove
invalid_state_transition_workflow_update_messageandworkflow_update_registry_size_limitedmetrics. Add thenamespacetag to the logs/softasserts in theinstrumentationmethods the metrics are used for.Why?
The logs will give us the same information and the namespace without cardinality concerns. These metrics are not expected to fire frequently.
How did you test it?
Potential risks
Minimal, metrics changes only.