Skip to content

Conversation

@morhidi
Copy link
Contributor

@morhidi morhidi commented Feb 18, 2025

What is the purpose of the change

Currently autoscaler metrics although collected are not published during stabilization period. We report metrics after the stabilization period only. In practice this could result in larger gaps in metric charts during scale operations that makes it hard for end users to interpret. The metrics could appear to be broken, especially when multiple scale operations executed in a row, for example:
image

This change mitigates this issue by shortening the gaps in reported metrics. The collected metrics won't be withhold during stabilization period either.

Brief change log

  • Removing the logic to report no metrics during stabilization
  • Adjusted the logging to better understand the stabilization/metric window periods
  • Update the timestamp format to contain millis (unit tests operate with millis)
2025-02-17 16:56:13,948 o.a.f.a.ScalingMetricCollector [INFO ] Stabilizing... until 1969-12-31 16:00:00.100. 1 samples collected
2025-02-17 16:56:13,952 o.a.f.a.ScalingMetricCollector [INFO ] Stabilizing... until 1969-12-31 16:00:00.100. 2 samples collected
2025-02-17 16:56:13,957 o.a.f.a.ScalingMetricCollector [INFO ] Metric window is not full until 1969-12-31 16:00:00.250. 3 samples collected
2025-02-17 16:56:13,958 o.a.f.a.ScalingMetricCollector [INFO ] Metric window is not full until 1969-12-31 16:00:00.250. 4 samples collected
2025-02-17 16:56:13,960 o.a.f.a.ScalingMetricCollector [INFO ] Metric window is now full. Dropped 3 samples before 1969-12-31 16:00:00.160, keeping 2.

Verifying this change

Updated existing unit tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this change will yield the desired outcome. There are some issues with collecting metrics in the stabilization phase, which is why we explicitly chose to not collect any that phase:

  1. Metrics are not available yet which will be evident in exceptions from the Rest API
  2. Metrics may be incomplete
  3. Metrics values will be skewed in the stabilization phase.

After this change, there is no way to externally asses the source-of-truth metrics which will be used for evaluation. This makes debugging the autoscaling algorithm harder.

Perhaps we can add an option to allow collecting metrics in the stabilization phase?

@gyfora
Copy link
Contributor

gyfora commented Feb 19, 2025

+1 for @mxm 's suggestion to enable this by a config flag but it should probably be disabled by default (to keep the current behaviour as the default)

@morhidi
Copy link
Contributor Author

morhidi commented Feb 19, 2025

+1 for @mxm 's suggestion to enable this by a config flag but it should probably be disabled by default (to keep the current behaviour as the default)

I'm not 100% sure this change will yield the desired outcome.

All right, I can close this PR

@gyfora
Copy link
Contributor

gyfora commented Feb 19, 2025

Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested

@morhidi
Copy link
Contributor Author

morhidi commented Feb 19, 2025

Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested

Perhaps we can add an option to allow collecting metrics in the stabilization phase?

Why would we close it? I think the PR makes sense , but it would be good to be able to configure the behavior as Max suggested

Max wrote:

I'm not 100% sure this change will yield the desired outcome. There are some issues with collecting metrics in the stabilization phase, which is why we explicitly chose to not collect any that phase:

Metrics are not available yet which will be evident in exceptions from the Rest API
Metrics may be incomplete
Metrics values will be skewed in the stabilization phase.
After this change, there is no way to externally asses the source-of-truth metrics which will be used for evaluation. This makes debugging the autoscaling algorithm harder.

Perhaps we can add an option to allow collecting metrics in the stabilization phase?

I guess Max is under the assumption that the current logic does not collect metrics during the stabilization period. We do collect samples, and once the stabilization is over we even evaluate them. This PR does not change that logic, so not sure what should be controlled by a flag. The only thing the PR does is that it reports those metrics. Can you clarify? I might missing something obvious from the current logic.

@gyfora
Copy link
Contributor

gyfora commented Feb 19, 2025

I am not sure that’s what Max meant, because you can already reduce the stabilization period to 0 if you want. So we wouldn’t need a new option .

I thought he meant a flag to determine whether the stabilization metrics should be reported or not. Let’s clarify offline

@mxm
Copy link
Contributor

mxm commented Feb 20, 2025

I guess Max is under the assumption that the current logic does not collect metrics during the stabilization period. We do collect samples, and once the stabilization is over we even evaluate them. This PR does not change that logic, so not sure what should be controlled by a flag. The only thing the PR does is that it reports those metrics. Can you clarify? I might missing something obvious from the current logic.

You're right, we already return metrics from the stabilization phase, but only to measure the observed true processing rate. In the original model, we only returned metrics once the metric window was full. I think that was more elegant, but the source metrics proved not reliable enough that we had to manually measure the processing capacity instead of always relying on the processing rate and busyness metrics of sources.

I might be a bit pedantic here, but I want to see the actual metrics used for evaluation reported as autoscaler metrics. Reporting metrics during stabilization removes that clarity. You can only observe what the assumptions of the autoscaler were, if you observed what is actually used for evaluation. That's why I suggested to put reporting autoscaler metrics during the stabilization period behind a flag.

@mxm
Copy link
Contributor

mxm commented Feb 20, 2025

I'm ok with not having flag / config option. I do see that adding the changes under a feature flag is of little use. Plus, we have too many config options already. You will likely still have gaps in the metric processing because of the job restarts for which metrics aren't available, but it may help users to better understand what's happening with the cluster during the stabilization period.

@gyfora
Copy link
Contributor

gyfora commented Feb 20, 2025

Please rebase on main for the docker fix

@morhidi
Copy link
Contributor Author

morhidi commented Feb 20, 2025

Thanks for the review folks, I've pushed the requested changes

@morhidi morhidi requested a review from mxm February 20, 2025 16:31
Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and discussion!

+1 for not introducing the new option as we already have too many config options.

LGTM

@morhidi morhidi merged commit 9c93f04 into apache:main Feb 21, 2025
115 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants