-
Notifications
You must be signed in to change notification settings - Fork 498
[FLINK-36409] Publish some autoscaler metrics during stabilisation period #945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mxm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this change will yield the desired outcome. There are some issues with collecting metrics in the stabilization phase, which is why we explicitly chose to not collect any that phase:
- Metrics are not available yet which will be evident in exceptions from the Rest API
- Metrics may be incomplete
- Metrics values will be skewed in the stabilization phase.
After this change, there is no way to externally asses the source-of-truth metrics which will be used for evaluation. This makes debugging the autoscaling algorithm harder.
Perhaps we can add an option to allow collecting metrics in the stabilization phase?
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingMetricCollector.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingMetricCollector.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingMetricCollector.java
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/DateTimeUtils.java
Outdated
Show resolved
Hide resolved
|
+1 for @mxm 's suggestion to enable this by a config flag but it should probably be disabled by default (to keep the current behaviour as the default) |
All right, I can close this PR |
|
Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested |
Max wrote:
I guess Max is under the assumption that the current logic does not collect metrics during the stabilization period. We do collect samples, and once the stabilization is over we even evaluate them. This PR does not change that logic, so not sure what should be controlled by a flag. The only thing the PR does is that it reports those metrics. Can you clarify? I might missing something obvious from the current logic. |
|
I am not sure that’s what Max meant, because you can already reduce the stabilization period to 0 if you want. So we wouldn’t need a new option . I thought he meant a flag to determine whether the stabilization metrics should be reported or not. Let’s clarify offline |
You're right, we already return metrics from the stabilization phase, but only to measure the observed true processing rate. In the original model, we only returned metrics once the metric window was full. I think that was more elegant, but the source metrics proved not reliable enough that we had to manually measure the processing capacity instead of always relying on the processing rate and busyness metrics of sources. I might be a bit pedantic here, but I want to see the actual metrics used for evaluation reported as autoscaler metrics. Reporting metrics during stabilization removes that clarity. You can only observe what the assumptions of the autoscaler were, if you observed what is actually used for evaluation. That's why I suggested to put reporting autoscaler metrics during the stabilization period behind a flag. |
|
I'm ok with not having flag / config option. I do see that adding the changes under a feature flag is of little use. Plus, we have too many config options already. You will likely still have gaps in the metric processing because of the job restarts for which metrics aren't available, but it may help users to better understand what's happening with the cluster during the stabilization period. |
|
Please rebase on main for the docker fix |
|
Thanks for the review folks, I've pushed the requested changes |
1996fanrui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution and discussion!
+1 for not introducing the new option as we already have too many config options.
LGTM
What is the purpose of the change
Currently autoscaler metrics although collected are not published during stabilization period. We report metrics after the stabilization period only. In practice this could result in larger gaps in metric charts during scale operations that makes it hard for end users to interpret. The metrics could appear to be broken, especially when multiple scale operations executed in a row, for example:

This change mitigates this issue by shortening the gaps in reported metrics. The collected metrics won't be withhold during stabilization period either.
Brief change log
Verifying this change
Updated existing unit tests.
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors: noDocumentation