Skip to content

[Serve] Skip steady-state per-tick work in DeploymentState via dirty flags#60840

Open
abrarsheikh wants to merge 3 commits intomasterfrom
60680-abrar-broadcast
Open

[Serve] Skip steady-state per-tick work in DeploymentState via dirty flags#60840
abrarsheikh wants to merge 3 commits intomasterfrom
60680-abrar-broadcast

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Feb 8, 2026

The Serve controller runs an update loop every tick for every deployment. In steady state (all replicas healthy, nothing scaling), nearly all of this work is pure waste that scales linearly with replica count:

  1. broadcast_running_replicas_if_changed() constructs N RunningReplicaInfo objects and builds two sets just to compare them — every tick, for every deployment.
  2. check_curr_status() (called twice per tick), scale_deployment_replicas(), and migrate_replicas_on_draining_nodes() all pop, count, and re-add replicas unconditionally, even when nothing has changed.

This directly addresses the existing TODO:

TODO(edoakes): we could make this more efficient in steady-state by having a "healthy" flag that gets flipped if an update or replica failure happens.

What

Two complementary dirty flags on DeploymentState:

_replicas_changed — guards broadcast_running_replicas_if_changed(). When False (and _request_routing_info_updated is also False), the method returns immediately: zero object construction, zero hashing, zero set comparison. Set when replicas transition state, routing stats change, availability-related fields change, or a config update requires a long-poll broadcast.

_needs_reconciliation — guards the expensive reconciliation methods. When False, the deployment is in steady state (all replicas RUNNING at target version, status HEALTHY). The following methods early-return:

  • check_curr_status() → returns (False, False)
  • scale_deployment_replicas() → returns ([], None)
  • migrate_replicas_on_draining_nodes({}) → returns immediately
  • Startup/stopping replica checks inside check_and_update_replicas() → skipped

Health checks on RUNNING/PENDING_MIGRATION replicas and the rank consistency check always run regardless of flag state.

Cleared only when check_curr_status() confirms: no pending ops, all replicas RUNNING at target version, at target count.

Flag-setting sites (both flags are set together at each site):

  • _set_target_state() — deploy, redeploy, autoscale, manual replica count change
  • _set_target_state_deleting() — deployment deletion
  • _stop_replica() — replica stopped (health check failure, scaling down, version update)
  • _check_startup_replicas() — replica enters RUNNING
  • _stop_or_update_outdated_version_replicas() — RUNNING→UPDATING, or requires_long_poll_broadcast
  • record_replica_startup_failure() — startup failure (may change availability)
  • check_curr_status()_replica_has_started flips True
  • migrate_replicas_on_draining_nodes() — RUNNING→PENDING_MIGRATION
  • _stop_one_running_replica_for_testing() — test helper
  • Health-check loop — routing stats changed on a replica

Refactoring: Extracted _check_and_update_transitioning_replicas() from check_and_update_replicas() so the startup/stopping sections can be cleanly guarded by _needs_reconciliation while the rank consistency check remains accessible.

Benchmark results

Broadcast optimization — steady-state broadcast_running_replicas_if_changed():

Replicas   OLD p50(µs)   NEW p50(µs)   Speedup
      16          96.3          0.19      507x
     256       1,786.7          0.18    9,926x
   1,024       7,580.1          0.18   42,112x
   4,096      37,560.2          0.18  208,668x

Reconciliation optimization — skippable methods (check_curr_status ×2, scale_deployment_replicas, migrate, broadcast):

Replicas   OLD p50(µs)   NEW p50(µs)   Speedup
      16         243.0          0.5      496x
     256       4,252.5          0.5    8,859x
   1,024      18,335.3          0.5   39,011x
   4,096      97,783.7          0.5  208,051x

Both fast paths are constant-time (~0.2–0.5µs) regardless of replica count. At 4,096 replicas, the optimization eliminates ~98ms of per-deployment per-tick overhead that was previously pure waste in steady state.

…via dirty flag

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner February 8, 2026 00:07
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 8, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization for broadcast_running_replicas_if_changed() by using a dirty flag, _replicas_changed. This avoids unnecessary work in steady state, leading to impressive speedups, especially for deployments with a large number of replicas. The implementation is clean and the new tests cover the main scenarios.

I found one potential issue where a lightweight configuration update (like changing max_ongoing_requests) might not set the dirty flag, leading to a missed broadcast. I've added a specific comment with a suggested fix. Other than that, the changes look great.

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh changed the title [Serve] Skip broadcast_running_replicas_if_changed() in steady state via dirty flag [Serve] Skip steady-state per-tick work in DeploymentState via dirty flags Feb 8, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

self._replica_constructor_retry_counter = 0
# Deployment is in steady state: all replicas RUNNING at
# target version, no pending operations.
self._needs_reconciliation = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PENDING_MIGRATION replicas stuck due to incomplete steady-state check

Medium Severity

The _needs_reconciliation flag is cleared at line 3058 when the deployment is considered "in steady state," but the pending operations check at lines 3032-3040 only includes STARTING, UPDATING, RECOVERING, and STOPPING states — not PENDING_MIGRATION. This allows the flag to be cleared while PENDING_MIGRATION replicas still exist. Combined with the new fast path in migrate_replicas_on_draining_nodes (lines 3506-3507), these replicas become stuck and never transition back to RUNNING or get stopped when their node stops draining.

Additional Locations (2)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant