[FLINK-36535][autoscaler] Optimize the scale down logic based on historical parallelism to reduce the rescale frequency #920

1996fanrui · 2024-11-25T02:45:36Z

What is the purpose of the change

Get more details from FLINK-36535.

Brief change log

[FLINK-36535][autoscaler] Optimize the scale down logic based on historical parallelism

Using the maximum parallelism within the window instead of the latest parallelism when scaling down
Never scale down when (currentTime - triggerTime) < scale-down.interval

Verifying this change

This change added tests and can be verified as follows:

Added the DelayedScaleDownTest
Rewrite the testRequiredScaleDownAfterInterval to testScaleDownAfterInterval
Update the AbstractAutoScalerStateStoreTest, JobVertexScalerTest and ScalingExecutorTest

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: no

Documentation

Does this pull request introduce a new feature? no

1996fanrui

Hey @gyfora @mxm @huyuanfeng2018 , please help take a look this PR in your free time, thanks~

gyfora

I think one other problem with this logic in general is that we are handling the scale down window/interval more or less independently for different vertices.

So if you have 10 vertices and they would be scaled down at different times you can have 10 restarts within the scale down window. Which does not feel right....

I think we should consider setting global scale down windows, for example one possible implementation would be that if the scale down interval is 1 hour and we have one vertex requesting the scale down first at 12:10 the second at 12:20, then we scale both down at 13:20 (instead of doing 2 scale downs). So basically coalescing the intervals to some extent to really max 1 scale down per hour

mxm · 2024-11-26T09:59:34Z

So basically coalescing the intervals to some extent to really max 1 scale down per hour

Exactly, I actually thought that's how the originally proposed "lazy" scale down technique should work.

1996fanrui · 2024-11-26T11:11:54Z

Thanks @gyfora and @mxm for the quick comment and suggestion!

So if you have 10 vertices and they would be scaled down at different times you can have 10 restarts within the scale down window. Which does not feel right.

Actually, we only rescale twice (instead of 10 times) if we have 10 vertices and they would be scaled down at different times.
Assuming scale down interval is 1 hour:

vertex1 triggers scale down at 12:10,
vertex2 triggers scale down at 12:15,
vertex3 triggers scale down at 12:18,
vertex10 triggers scale down at 12:50,

And then vertex1 will be scaled down at 13:10, and rest of them do not be changed.

After 13: 10, if all of vertex2 to vertex10 need to scale down, all of them will trigger scale down at 13: 10, and will be scaled down at 14:10

I think we should consider setting global scale down windows, for example one possible implementation would be that if the scale down interval is 1 hour and we have one vertex requesting the scale down first at 12:10 the second at 12:20, then we scale both down at 13:20 (instead of doing 2 scale downs). So basically coalescing the intervals to some extent to really max 1 scale down per hour

There may be an unexpected case, and it’s easy to happen when scale down interval >= 24 hours.
We assume scale down interval = 24 hours, and the peak hour are following:

vertex1 18:00
vertex2 20:00
vertex3 22:00

Trigger times:

vertex1 triggers scale down at 2024-11-20 19:00,
vertex2 triggers scale down at 2024-11-20 21:00,
vertex3 triggers scale down at 2024-11-20 23:00,

So this strategy hope scale down is executed at 2024-11-21 23:00. If all of vertices need to scale down after 24 hours, it works well.

But unexpected case is : vertex2 always wanna scale down, but vertex1 and vertex3 runs on the parallelism of peak time.

The scale down trigger is canceled for vertex1 at 2024-11-21 18:00, and
vertex1 re-triggered scale down at 2024-11-21 19:00.

At this time, the trigger status are:

vertex1 triggers scale down at 2024-11-21 19:00,
vertex2 triggers scale down at 2024-11-20 21:00,
vertex3 triggers scale down at 2024-11-20 23:00,

It hopes scale down is executed at 2024-11-22 19:00

But vertex3 will cancel and re-trigger again, the execution time will be postponed again.
It will be looped everyday, and scale down of vertex2 never happen.

Please correct me if I misunderstand anything, and I’m happy to hear more suggestions from you.

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/DelayedScaleDown.java

1996fanrui

I added some comments to help understand this approach easily.

gyfora · 2024-11-26T13:47:21Z

Thanks @1996fanrui , after some offline discussion I think this works as expected I just slightly misunderstood the mechanism.

We can later optimize this further by coalescing downscale triggers that are close to each other but we don't need to do it right now.

mxm

LGTM

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingExecutor.java

mxm · 2024-11-26T17:43:31Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java

+        var delayedScaleDownInfo = delayedScaleDown.triggerScaleDown(vertex, now, newParallelism);
+
+        // Never scale down within scale down interval
+        if (now.isBefore(delayedScaleDownInfo.getFirstTriggerTime().plus(scaleDownInterval))) {


Could we rename this method to getFirstScaledUpTime()?

Is it a typo?

Do you mean rename the firstTriggerTime to firstScaleDownTime inside of VertexDelayedScaleDownInfo?

I prefer to use firstTriggerTime because the class name includes DelayedScaleDown, so field name or method name doesn't need to includes scale down again. WDYT?

I thought that getFirstTriggerTime() returns the first time we scaled up, but we are actually recording the time we first try to scale down.

I'm not sure this is correct. We want to delay scale down from the first time we scale up, not the first time we scaled down.

Thanks @mxm for the clarification!

Actually, we expect to use the first scale down time instead of scale up time as the first trigger time.

job.autoscaler.scale-up.grace-period hopes to use scale up time

But it has been deprecated in FLINK-36018

And job.autoscaler.scale-down.interval hopes to use the first scale down time, it scale-down.interval is 1 hour, we expect the scale down can be executed after 1 hour.

It's delayed scale down

And It could merge multiple scale down requests into one scale down execution.

If scale-down.interval uses the scale up time, when scale down request comes, job will execute scale down directly if job is scaled up 1 hour ago. (It cannot merge multiple scale down request into one execution)

As I understand, the scale down interval wanna to reduce the scale down frequency.

Please correct me if anything is wrong, thank you~

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java

huyuanfeng2018

A little bit comment, PTAL

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/DelayedScaleDown.java

1996fanrui

Thanks @gyfora for the update!

And thanks @mxm and @huyuanfeng2018 for the quick review!

I have addressed or replied all comments.

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingExecutor.java

1996fanrui · 2024-11-27T02:11:31Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java

+        var delayedScaleDownInfo = delayedScaleDown.triggerScaleDown(vertex, now, newParallelism);
+
+        // Never scale down within scale down interval
+        if (now.isBefore(delayedScaleDownInfo.getFirstTriggerTime().plus(scaleDownInterval))) {


Is it a typo?

Do you mean rename the firstTriggerTime to firstScaleDownTime inside of VertexDelayedScaleDownInfo?

I prefer to use firstTriggerTime because the class name includes DelayedScaleDown, so field name or method name doesn't need to includes scale down again. WDYT?

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java

mxm

Generally, the PR looks good.

I missed this in the original JIRA, even though you wrote it there: #920 (comment) I can open a separate JIRA issue, if we don't want to resolve it here.

mxm

LGTM. We can handle #920 (comment) independently of this PR.

gyfora · 2024-11-29T10:04:30Z

please rebase on main as there is a conflicting test change

1996fanrui · 2024-11-29T12:58:45Z

Thanks for the review, I will address all comments next Monday due to I'm on the Flink Forward Asia Conference.

…orical parallelism to reduce the rescale frequency 1. Using the maximum parallelism within the window instead of the latest parallelism when scaling down 2. Never scale down when (currentTime - triggerTime) < scale-down.interval

1996fanrui · 2024-12-02T09:24:13Z

please rebase on main as there is a conflicting test change

thanks @gyfora for the reminder, rebased and fixed the conflict.

1996fanrui · 2024-12-03T02:33:32Z

Thanks all for the patient review!

We could discuss #920 (comment) asynchronously, so let me merge this PR first.

1996fanrui force-pushed the 36535/optimize-delayed-scale-down branch 3 times, most recently from b9c61ae to 3a3c045 Compare November 26, 2024 02:22

1996fanrui marked this pull request as ready for review November 26, 2024 02:37

1996fanrui commented Nov 26, 2024

View reviewed changes

gyfora reviewed Nov 26, 2024

View reviewed changes

1996fanrui commented Nov 26, 2024

View reviewed changes

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java Show resolved Hide resolved

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/DelayedScaleDown.java Show resolved Hide resolved

1996fanrui commented Nov 26, 2024

View reviewed changes

mxm approved these changes Nov 26, 2024

View reviewed changes

huyuanfeng2018 reviewed Nov 27, 2024

View reviewed changes

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java Show resolved Hide resolved

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/DelayedScaleDown.java Outdated Show resolved Hide resolved

1996fanrui commented Nov 27, 2024

View reviewed changes

huyuanfeng2018 approved these changes Nov 27, 2024

View reviewed changes

mxm reviewed Nov 28, 2024

View reviewed changes

gyfora approved these changes Nov 29, 2024

View reviewed changes

mxm approved these changes Nov 29, 2024

View reviewed changes

1996fanrui force-pushed the 36535/optimize-delayed-scale-down branch from 980386d to d912ec3 Compare December 2, 2024 09:03

1996fanrui force-pushed the 36535/optimize-delayed-scale-down branch from d912ec3 to 8fc87ae Compare December 2, 2024 09:08

1996fanrui merged commit d9e8cce into apache:main Dec 3, 2024
104 checks passed

[FLINK-36535][autoscaler] Optimize the scale down logic based on historical parallelism to reduce the rescale frequency #920

[FLINK-36535][autoscaler] Optimize the scale down logic based on historical parallelism to reduce the rescale frequency #920

Uh oh!

Conversation

1996fanrui commented Nov 25, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

gyfora left a comment

Choose a reason for hiding this comment

Uh oh!

mxm commented Nov 26, 2024

Uh oh!

1996fanrui commented Nov 26, 2024

Uh oh!

Uh oh!

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

gyfora commented Nov 26, 2024

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mxm Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

1996fanrui Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

mxm Nov 28, 2024

Choose a reason for hiding this comment

Uh oh!

1996fanrui Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

huyuanfeng2018 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

1996fanrui Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

gyfora commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1996fanrui commented Nov 29, 2024

Uh oh!

1996fanrui commented Dec 2, 2024

Uh oh!

1996fanrui commented Dec 3, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

1996fanrui Dec 2, 2024 •

edited

Loading

huyuanfeng2018 left a comment •

edited

Loading

gyfora commented Nov 29, 2024 •

edited

Loading