Skip to content

Conversation

shijiesheng
Copy link
Member

@shijiesheng shijiesheng commented Dec 10, 2024

Detailed Description

Improve performance of poller auto scaler by using more accurate scaling signals and several implementation changes.

Changes

  • New WorkerOptions AutoScalerOptions is introduced.
  • Several WorkerOptions are deprecated and become no-op.
  • read new signal (poller wait time) to scale
  • allow kill switching poller auto scaler from server
  • new implementation that makes scaling quicker to traffic change
  • removed no longer used autoscaler package completely (original implementation is over complicated)

Impact Analysis

  • Backward Compatibility: NO existing autoscaling will be stopped but this shall not have big impact since this feature was never rolled out in production. For OSS users, please follow the instructions below in rollout plan.
  • Forward Compatibility: Yes, introduce new

Testing Plan

  • Unit Tests: Yes
  • Persistence Tests: Not related
  • Integration Tests: No
  • Compatibility Tests: No, because it's autoscaler is a feature that was not rolled out in production.

Rollout Plan

  • What is the rollout plan?
    For Uber services, standard client release steps
    For OSS users, turn off autoscaler feature first before the client upgrade.

  • Does the order of deployment matter? No

  • Is it safe to rollback? Does the order of rollback matter? Yes

  • Is there a kill switch to mitigate the impact immediately? Yes, the new autoscaler feature is an opt-in feature.

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

Attention: Patch coverage is 87.89062% with 31 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (526cb2d) to head (7895fd4).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
internal/worker/concurrency_auto_scaler.go 91.47% 10 Missing and 5 partials ⚠️
internal/internal_worker_base.go 62.85% 12 Missing and 1 partial ⚠️
internal/internal_worker.go 88.46% 2 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
internal/internal_task_handlers.go 81.87% <100.00%> (+0.25%) ⬆️
internal/internal_task_pollers.go 83.16% <100.00%> (+0.61%) ⬆️
internal/internal_utils.go 76.58% <ø> (ø)
internal/worker.go 40.00% <100.00%> (ø)
internal/internal_worker.go 76.29% <88.46%> (-0.42%) ⬇️
internal/internal_worker_base.go 73.95% <62.85%> (+2.59%) ⬆️
internal/worker/concurrency_auto_scaler.go 91.47% <91.47%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 526cb2d...7895fd4. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Groxx
Copy link
Member

Groxx commented Jan 2, 2025

to stick it in here too: overall looks pretty good. simpler and the overall goal (and why it achieves it) is clearer too. seems like just minor tweaks (many optional) and it's probably good to go

Copy link
Contributor

@3vilhamster 3vilhamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I left some nits


if bw.pollerAutoScaler != nil {
if bw.concurrencyAutoScaler != nil {
if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks like a leaking abstraction. This should be handled inside concurrencyAutoScaler.
I suggest moving all
concurrencyAutoScaler != nil checks inside methods where it is required.
This code should be simpler. Just calling methods on autoscaler. If it is nil, do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've removed this check in all places. Regarding the comment hese two should be abstracted behind a single interface to avoid additional complexity in this file, I still think this is two separate entities. Client still needs concurrency whether autoscaler is enabled or not.

@shijiesheng shijiesheng force-pushed the autoscaler-rework branch 2 times, most recently from a9d3781 to 52dd229 Compare January 17, 2025 17:34

if bw.pollerAutoScaler != nil {
if bw.concurrencyAutoScaler != nil {
if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

return t.autoConfigHint
default:
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this switch case (which is not future proof), we can cast the task to autoConfigHintAwareTask interface and get the auto config hint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this to use autoConfigHintAwareTask

Comment on lines 38 to 39
lowerPollerWaitTime = 16 * time.Millisecond
upperPollerWaitTime = 256 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we would want to iterate on these to adjust sensitivity. consider exposing these to worker config

Copy link
Member Author

@shijiesheng shijiesheng Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The poller wait time is an invariant. User doesn't need to tune it. The sensitivity control (time-to-react) is actually controlled by the Cooldown which is already in the parameter

},
},
{
"idl pollers waiting for tasks",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo idle. same in other cases below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@shijiesheng
Copy link
Member Author

coverage failed due to deprecation changes

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping notes for now, while reading tests carefully 👍

overall looks pretty good I think - fairly easy to follow, behavior looks good (e.g. up to 4x growth when "instant", 0.5x shrink when slow, one scale change every 10 seconds sounds reasonable), everything's pretty close.
so just a small pile of minor stuff, some nits some not.

autoScalerEventStart autoScalerEvent = "auto-scaler-start"
autoScalerEventStop autoScalerEvent = "auto-scaler-stop"
autoScalerEventLogMsg string = "concurrency auto scaler event"
testTimeFormat string = "15:04:05"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
testTimeFormat string = "15:04:05"

Comment on lines 129 to 139
"busy pollers, scale up to maximum",
[]*shared.AutoConfigHint{
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
},
[]eventLog{
{autoScalerEventStart, false, 100, "00:00:00"},
{autoScalerEventEnable, true, 100, "00:00:00"},
{autoScalerEventPollerSkipUpdateCooldown, true, 100, "00:00:01"},
{autoScalerEventPollerScaleUp, true, 200, "00:00:02"},
{autoScalerEventStop, true, 200, "00:00:02"},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be easier to follow the actual behavior of this one with a less-than-1/2-maximum set of values, e.g. start with 10 rather than 100 -> it won't scale to maximum, it'll scale to 42.

kinda similar for others below, e.g. pollers, scale up and down multiple times becomes:

{autoScalerEventStart, false, 10, "00:00:00"},
{autoScalerEventEnable, true, 10, "00:00:00"},
{autoScalerEventPollerSkipUpdateCooldown, true, 10, "00:00:01"},
{autoScalerEventPollerScaleUp, true, 42, "00:00:02"},
{autoScalerEventPollerSkipUpdateCooldown, true, 42, "00:00:03"},
{autoScalerEventPollerScaleDown, true, 25, "00:00:04"},
{autoScalerEventPollerSkipUpdateCooldown, true, 25, "00:00:05"},
{autoScalerEventPollerScaleUp, true, 104, "00:00:06"},
{autoScalerEventPollerSkipUpdateCooldown, true, 104, "00:00:07"},
{autoScalerEventPollerScaleDown, true, 63, "00:00:08"},
{autoScalerEventStop, true, 63, "00:00:08"},

which seems a bit more informative than "to max, down, back to max, back to same down value"

@shijiesheng shijiesheng merged commit 221cc4e into cadence-workflow:master Jun 25, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants