rework on poller auto scaler #1411

shijiesheng · 2024-12-10T19:37:05Z

Detailed Description

Improve performance of poller auto scaler by using more accurate scaling signals and several implementation changes.

Changes

New WorkerOptions AutoScalerOptions is introduced.
Several WorkerOptions are deprecated and become no-op.
read new signal (poller wait time) to scale
allow kill switching poller auto scaler from server
new implementation that makes scaling quicker to traffic change
removed no longer used autoscaler package completely (original implementation is over complicated)

Impact Analysis

Backward Compatibility: NO existing autoscaling will be stopped but this shall not have big impact since this feature was never rolled out in production. For OSS users, please follow the instructions below in rollout plan.
Forward Compatibility: Yes, introduce new

Testing Plan

Unit Tests: Yes
Persistence Tests: Not related
Integration Tests: No
Compatibility Tests: No, because it's autoscaler is a feature that was not rolled out in production.

Rollout Plan

What is the rollout plan?
For Uber services, standard client release steps
For OSS users, turn off autoscaler feature first before the client upgrade.
Does the order of deployment matter? No
Is it safe to rollback? Does the order of rollback matter? Yes
Is there a kill switch to mitigate the impact immediately? Yes, the new autoscaler feature is an opt-in feature.

codecov · 2024-12-21T10:21:50Z

Codecov Report

Attention: Patch coverage is 87.89062% with 31 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (526cb2d) to head (7895fd4).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/worker/concurrency_auto_scaler.go	91.47%	10 Missing and 5 partials ⚠️
internal/internal_worker_base.go	62.85%	12 Missing and 1 partial ⚠️
internal/internal_worker.go	88.46%	2 Missing and 1 partial ⚠️

Files with missing lines	Coverage Δ
internal/internal_task_handlers.go	`81.87% <100.00%> (+0.25%)`	⬆️
internal/internal_task_pollers.go	`83.16% <100.00%> (+0.61%)`	⬆️
internal/internal_utils.go	`76.58% <ø> (ø)`
internal/worker.go	`40.00% <100.00%> (ø)`
internal/internal_worker.go	`76.29% <88.46%> (-0.42%)`	⬇️
internal/internal_worker_base.go	`73.95% <62.85%> (+2.59%)`	⬆️
internal/worker/concurrency_auto_scaler.go	`91.47% <91.47%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 526cb2d...7895fd4. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

internal/internal_worker_base.go

internal/worker/concurrency_auto_scaler_test.go

internal/worker/concurrency_auto_scaler.go

Groxx · 2025-01-02T18:31:44Z

to stick it in here too: overall looks pretty good. simpler and the overall goal (and why it achieves it) is clearer too. seems like just minor tweaks (many optional) and it's probably good to go

3vilhamster

Overall looks good, but I left some nits

3vilhamster · 2025-01-08T11:23:28Z

internal/internal_worker_base.go


-	if bw.pollerAutoScaler != nil {
+	if bw.concurrencyAutoScaler != nil {
 		if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {


nit: this looks like a leaking abstraction. This should be handled inside concurrencyAutoScaler.
I suggest moving all
concurrencyAutoScaler != nil checks inside methods where it is required.
This code should be simpler. Just calling methods on autoscaler. If it is nil, do nothing.

we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

i've removed this check in all places. Regarding the comment hese two should be abstracted behind a single interface to avoid additional complexity in this file, I still think this is two separate entities. Client still needs concurrency whether autoscaler is enabled or not.

internal/worker/concurrency_auto_scaler.go

internal/internal_task_handlers.go

taylanisikdemir · 2025-06-13T21:44:15Z

internal/internal_worker_base.go


-	if bw.pollerAutoScaler != nil {
+	if bw.concurrencyAutoScaler != nil {
 		if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {


we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

taylanisikdemir · 2025-06-13T21:45:55Z

internal/internal_worker_base.go

+		return t.autoConfigHint
+	default:
+		return nil
+	}


instead of this switch case (which is not future proof), we can cast the task to autoConfigHintAwareTask interface and get the auto config hint

I've removed this to use autoConfigHintAwareTask

taylanisikdemir · 2025-06-13T21:50:07Z

internal/worker/concurrency_auto_scaler.go

+	lowerPollerWaitTime           = 16 * time.Millisecond
+	upperPollerWaitTime           = 256 * time.Millisecond


it looks like we would want to iterate on these to adjust sensitivity. consider exposing these to worker config

The poller wait time is an invariant. User doesn't need to tune it. The sensitivity control (time-to-react) is actually controlled by the Cooldown which is already in the parameter

taylanisikdemir · 2025-06-13T21:54:52Z

internal/worker/concurrency_auto_scaler_test.go

+			},
+		},
+		{
+			"idl pollers waiting for tasks",


nit: typo idle. same in other cases below

internal/worker/concurrency_auto_scaler_test.go

shijiesheng · 2025-06-17T21:54:00Z

coverage failed due to deprecation changes

Groxx

dropping notes for now, while reading tests carefully 👍

overall looks pretty good I think - fairly easy to follow, behavior looks good (e.g. up to 4x growth when "instant", 0.5x shrink when slow, one scale change every 10 seconds sounds reasonable), everything's pretty close.
so just a small pile of minor stuff, some nits some not.

Groxx · 2025-06-18T21:00:43Z

internal/worker/concurrency_auto_scaler.go

+	autoScalerEventStart                      autoScalerEvent = "auto-scaler-start"
+	autoScalerEventStop                       autoScalerEvent = "auto-scaler-stop"
+	autoScalerEventLogMsg                     string          = "concurrency auto scaler event"
+	testTimeFormat                            string          = "15:04:05"


Suggested change

testTimeFormat string = "15:04:05"

internal/worker/concurrency_auto_scaler_test.go

internal/worker/concurrency_auto_scaler.go

internal/internal_worker_base.go

internal/internal_worker_test.go

internal/worker.go

internal/worker/concurrency_auto_scaler_test.go

Groxx · 2025-06-25T20:40:15Z

internal/worker/concurrency_auto_scaler_test.go

+			"busy pollers, scale up to maximum",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down
+				{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
+			},
+			[]eventLog{
+				{autoScalerEventStart, false, 100, "00:00:00"},
+				{autoScalerEventEnable, true, 100, "00:00:00"},
+				{autoScalerEventPollerSkipUpdateCooldown, true, 100, "00:00:01"},
+				{autoScalerEventPollerScaleUp, true, 200, "00:00:02"},
+				{autoScalerEventStop, true, 200, "00:00:02"},
+			},


might be easier to follow the actual behavior of this one with a less-than-1/2-maximum set of values, e.g. start with 10 rather than 100 -> it won't scale to maximum, it'll scale to 42.

kinda similar for others below, e.g. pollers, scale up and down multiple times becomes:

{autoScalerEventStart, false, 10, "00:00:00"}, {autoScalerEventEnable, true, 10, "00:00:00"}, {autoScalerEventPollerSkipUpdateCooldown, true, 10, "00:00:01"}, {autoScalerEventPollerScaleUp, true, 42, "00:00:02"}, {autoScalerEventPollerSkipUpdateCooldown, true, 42, "00:00:03"}, {autoScalerEventPollerScaleDown, true, 25, "00:00:04"}, {autoScalerEventPollerSkipUpdateCooldown, true, 25, "00:00:05"}, {autoScalerEventPollerScaleUp, true, 104, "00:00:06"}, {autoScalerEventPollerSkipUpdateCooldown, true, 104, "00:00:07"}, {autoScalerEventPollerScaleDown, true, 63, "00:00:08"}, {autoScalerEventStop, true, 63, "00:00:08"},

which seems a bit more informative than "to max, down, back to max, back to same down value"

internal/worker/concurrency_auto_scaler_test.go

… fields related

…ep backward compatibility and fix test cases and other comments

shijiesheng requested review from demirkayaender and dkrotx as code owners December 10, 2024 19:37

shijiesheng mentioned this pull request Dec 18, 2024

poller auto scaler rework #1409

Closed

shijiesheng force-pushed the autoscaler-rework branch from 82e6bb6 to 636c433 Compare December 20, 2024 23:01

shijiesheng requested review from 3vilhamster, Groxx, jakobht and taylanisikdemir as code owners December 20, 2024 23:01