[FLINK-36409] Publish some autoscaler metrics during stabilisation period #945

morhidi · 2025-02-18T22:25:31Z

What is the purpose of the change

Currently autoscaler metrics although collected are not published during stabilization period. We report metrics after the stabilization period only. In practice this could result in larger gaps in metric charts during scale operations that makes it hard for end users to interpret. The metrics could appear to be broken, especially when multiple scale operations executed in a row, for example:

This change mitigates this issue by shortening the gaps in reported metrics. The collected metrics won't be withhold during stabilization period either.

Brief change log

Removing the logic to report no metrics during stabilization
Adjusted the logging to better understand the stabilization/metric window periods
Update the timestamp format to contain millis (unit tests operate with millis)

2025-02-17 16:56:13,948 o.a.f.a.ScalingMetricCollector [INFO ] Stabilizing... until 1969-12-31 16:00:00.100. 1 samples collected
2025-02-17 16:56:13,952 o.a.f.a.ScalingMetricCollector [INFO ] Stabilizing... until 1969-12-31 16:00:00.100. 2 samples collected
2025-02-17 16:56:13,957 o.a.f.a.ScalingMetricCollector [INFO ] Metric window is not full until 1969-12-31 16:00:00.250. 3 samples collected
2025-02-17 16:56:13,958 o.a.f.a.ScalingMetricCollector [INFO ] Metric window is not full until 1969-12-31 16:00:00.250. 4 samples collected
2025-02-17 16:56:13,960 o.a.f.a.ScalingMetricCollector [INFO ] Metric window is now full. Dropped 3 samples before 1969-12-31 16:00:00.160, keeping 2.

Verifying this change

Updated existing unit tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

mxm

I'm not 100% sure this change will yield the desired outcome. There are some issues with collecting metrics in the stabilization phase, which is why we explicitly chose to not collect any that phase:

Metrics are not available yet which will be evident in exceptions from the Rest API
Metrics may be incomplete
Metrics values will be skewed in the stabilization phase.

After this change, there is no way to externally asses the source-of-truth metrics which will be used for evaluation. This makes debugging the autoscaling algorithm harder.

Perhaps we can add an option to allow collecting metrics in the stabilization phase?

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingMetricCollector.java

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/DateTimeUtils.java

gyfora · 2025-02-19T12:44:31Z

+1 for @mxm 's suggestion to enable this by a config flag but it should probably be disabled by default (to keep the current behaviour as the default)

morhidi · 2025-02-19T17:17:16Z

+1 for @mxm 's suggestion to enable this by a config flag but it should probably be disabled by default (to keep the current behaviour as the default)

I'm not 100% sure this change will yield the desired outcome.

All right, I can close this PR

gyfora · 2025-02-19T17:29:28Z

Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested

morhidi · 2025-02-19T18:22:32Z

Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested

Perhaps we can add an option to allow collecting metrics in the stabilization phase?

Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested

Max wrote:

I'm not 100% sure this change will yield the desired outcome. There are some issues with collecting metrics in the stabilization phase, which is why we explicitly chose to not collect any that phase:

Metrics are not available yet which will be evident in exceptions from the Rest API
Metrics may be incomplete
Metrics values will be skewed in the stabilization phase.
After this change, there is no way to externally asses the source-of-truth metrics which will be used for evaluation. This makes debugging the autoscaling algorithm harder.

Perhaps we can add an option to allow collecting metrics in the stabilization phase?

I guess Max is under the assumption that the current logic does not collect metrics during the stabilization period. We do collect samples, and once the stabilization is over we even evaluate them. This PR does not change that logic, so not sure what should be controlled by a flag. The only thing the PR does is that it reports those metrics. Can you clarify? I might missing something obvious from the current logic.

gyfora · 2025-02-19T18:54:29Z

I am not sure that’s what Max meant, because you can already reduce the stabilization period to 0 if you want. So we wouldn’t need a new option .

I thought he meant a flag to determine whether the stabilization metrics should be reported or not. Let’s clarify offline

mxm · 2025-02-20T10:56:06Z

I guess Max is under the assumption that the current logic does not collect metrics during the stabilization period. We do collect samples, and once the stabilization is over we even evaluate them. This PR does not change that logic, so not sure what should be controlled by a flag. The only thing the PR does is that it reports those metrics. Can you clarify? I might missing something obvious from the current logic.

You're right, we already return metrics from the stabilization phase, but only to measure the observed true processing rate. In the original model, we only returned metrics once the metric window was full. I think that was more elegant, but the source metrics proved not reliable enough that we had to manually measure the processing capacity instead of always relying on the processing rate and busyness metrics of sources.

I might be a bit pedantic here, but I want to see the actual metrics used for evaluation reported as autoscaler metrics. Reporting metrics during stabilization removes that clarity. You can only observe what the assumptions of the autoscaler were, if you observed what is actually used for evaluation. That's why I suggested to put reporting autoscaler metrics during the stabilization period behind a flag.

mxm · 2025-02-20T11:45:14Z

I'm ok with not having flag / config option. I do see that adding the changes under a feature flag is of little use. Plus, we have too many config options already. You will likely still have gaps in the metric processing because of the job restarts for which metrics aren't available, but it may help users to better understand what's happening with the cluster during the stabilization period.

gyfora · 2025-02-20T14:02:33Z

Please rebase on main for the docker fix

…riod

morhidi · 2025-02-20T16:06:15Z

Thanks for the review folks, I've pushed the requested changes

1996fanrui

Thanks for the contribution and discussion!

+1 for not introducing the new option as we already have too many config options.

LGTM

morhidi requested review from 1996fanrui, gyfora and mxm February 18, 2025 22:25

morhidi force-pushed the FLINK-36409 branch from 4461704 to 3f30475 Compare February 18, 2025 22:42

mxm reviewed Feb 19, 2025

View reviewed changes

morhidi force-pushed the FLINK-36409 branch from 3f30475 to f7a3b31 Compare February 20, 2025 16:01

[FLINK-36409] Publish some autoscaler metrics during stabilisation pe…

ee00ce2

…riod

morhidi force-pushed the FLINK-36409 branch from f7a3b31 to ee00ce2 Compare February 20, 2025 16:03

morhidi requested a review from mxm February 20, 2025 16:31

1996fanrui approved these changes Feb 21, 2025

View reviewed changes

morhidi merged commit 9c93f04 into apache:main Feb 21, 2025
115 checks passed

[FLINK-36409] Publish some autoscaler metrics during stabilisation period #945

[FLINK-36409] Publish some autoscaler metrics during stabilisation period #945

Uh oh!

Conversation

morhidi commented Feb 18, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gyfora commented Feb 19, 2025

Uh oh!

morhidi commented Feb 19, 2025

Uh oh!

gyfora commented Feb 19, 2025

Uh oh!

morhidi commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gyfora commented Feb 19, 2025

Uh oh!

mxm commented Feb 20, 2025

Uh oh!

mxm commented Feb 20, 2025

Uh oh!

gyfora commented Feb 20, 2025

Uh oh!

morhidi commented Feb 20, 2025

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

morhidi commented Feb 19, 2025 •

edited

Loading