Skip to content

Commit 5939e01

Browse files
jopemachineclaude
andcommitted
feat(BA-3435): Implement Rolling Update deployment strategy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e903f88 commit 5939e01

File tree

22 files changed

+1364
-221
lines changed

22 files changed

+1364
-221
lines changed

changes/9997.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Implement Rolling Update deployment strategy

proposals/BEP-1049-deployment-strategy-handler.md

Lines changed: 60 additions & 96 deletions
Large diffs are not rendered by default.

proposals/BEP-1049/rolling-update.md

Lines changed: 70 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
2828

2929
```
3030
┌──────────────────────────────────────┐
31-
│ Any New routes PROVISIONING? │──Yes──→ provisioning
31+
│ Any New routes PROVISIONING? │──Yes──→ provisioning (wait)
3232
└──────────────────┬───────────────────┘
3333
No
3434
@@ -53,15 +53,38 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
5353
progressing
5454
```
5555

56-
### Sub-Step Variants
56+
Rollback is **not** decided by the FSM itself. If all new routes fail, the FSM will keep attempting to create new routes via the surge/unavailable calculation. Eventually the DEPLOYING timeout (30 min) is exceeded and the coordinator transitions the deployment to ROLLING_BACK via the `expired` path.
5757

58-
Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.
58+
### Route Classification
5959

60-
| Sub-Step | Condition | Handler Action |
61-
|----------|-----------|----------------|
62-
| **provisioning** | New routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
63-
| **progressing** | Calculated surge/unavailable, created/terminated routes | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
64-
| **progressing** (`completed=True`) | No Old routes and New healthy >= desired_replicas | Coordinator → atomic revision swap + DEPLOYING→READY |
60+
Routes are classified by revision and status:
61+
62+
| Category | Condition | Description |
63+
|----------|-----------|-------------|
64+
| `old_active` | revision != deploying_revision, is_active() | Old routes currently serving traffic |
65+
| `new_provisioning` | revision == deploying_revision, PROVISIONING | New routes being created |
66+
| `new_healthy` | revision == deploying_revision, HEALTHY | New routes ready to serve |
67+
| `new_unhealthy` | revision == deploying_revision, UNHEALTHY/DEGRADED | New routes with issues |
68+
| `new_failed` | revision == deploying_revision, FAILED/TERMINATED | New routes that failed |
69+
70+
### Handler Flow
71+
72+
All DEPLOYING deployments are handled by `DeployingProvisioningHandler`, which stays in the PROVISIONING sub-step throughout the entire deployment lifecycle. The handler runs the strategy evaluator each cycle:
73+
74+
| Result | Condition | Handler Action |
75+
|--------|-----------|----------------|
76+
| **success** | Evaluator returns COMPLETED (no Old routes, New healthy >= desired) | Coordinator transitions to READY |
77+
| **need_retry** | Route mutations executed (create/drain) | Stays in DEPLOYING/PROVISIONING, history recorded |
78+
| **skipped** | No changes — routes still provisioning or waiting | No transition; coordinator checks for timeout |
79+
| **expired** | Skipped deployment exceeds DEPLOYING timeout (30 min) | Coordinator transitions to DEPLOYING/ROLLING_BACK |
80+
81+
When a deployment transitions to ROLLING_BACK, the `DeployingRollingBackHandler` clears `deploying_revision` and transitions directly to READY.
82+
83+
### Safety Guards
84+
85+
- **Zero-downtime protection**: When `max_unavailable < desired`, never terminates ALL old routes until at least one new route is healthy
86+
- **Deadlock prevention**: `RollingUpdateSpec` validator ensures at least one of `max_surge` or `max_unavailable` is positive
87+
- **Timeout-based rollback**: The FSM does not detect failure — the coordinator's timeout mechanism handles it. If the deployment cannot complete within the DEPLOYING timeout (30 min), the coordinator transitions to ROLLING_BACK via the `expired` path
6588

6689
## max_surge / max_unavailable Calculation
6790

@@ -107,6 +130,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
107130
│ healthy=3, min_available=2 → can_terminate=1 │
108131
│ │
109132
│ → Create 1 New, Terminate 1 Old │
133+
│ → need_retry (route mutations executed) │
110134
└─────────────────────────────────────────────────────┘
111135
112136
@@ -115,7 +139,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
115139
│ Old: [■ ■] (2 healthy) │
116140
│ New: [◇] (1 provisioning) │
117141
│ │
118-
│ → PROVISIONING exists → wait
142+
│ → PROVISIONING exists → skipped (wait)
119143
└─────────────────────────────────────────────────────┘
120144
121145
@@ -129,6 +153,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
129153
│ healthy=3, min_available=2 → can_terminate=1 │
130154
│ │
131155
│ → Create 1 New, Terminate 1 Old │
156+
│ → need_retry (route mutations executed) │
132157
└─────────────────────────────────────────────────────┘
133158
134159
@@ -137,7 +162,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
137162
│ Old: [■] (1 healthy) │
138163
│ New: [■ ◇] (1 healthy, 1 provisioning) │
139164
│ │
140-
│ → PROVISIONING exists → wait
165+
│ → PROVISIONING exists → skipped (wait)
141166
└─────────────────────────────────────────────────────┘
142167
143168
@@ -151,6 +176,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
151176
│ healthy=3, min_available=2 → can_terminate=1 │
152177
│ │
153178
│ → Create 1 New, Terminate 1 Old │
179+
│ → need_retry (route mutations executed) │
154180
└─────────────────────────────────────────────────────┘
155181
156182
@@ -159,7 +185,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
159185
│ Old: [] │
160186
│ New: [■ ■ ◇] (2 healthy, 1 provisioning) │
161187
│ │
162-
│ → PROVISIONING exists → wait
188+
│ → PROVISIONING exists → skipped (wait)
163189
└─────────────────────────────────────────────────────┘
164190
165191
@@ -170,12 +196,25 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
170196
│ │
171197
│ No Old and New >= desired_replicas → completed │
172198
│ → deploying_revision → current_revision swap │
173-
│ → DEPLOYINGREADY state transition
199+
│ → successcoordinator transitions to READY
174200
└─────────────────────────────────────────────────────┘
175201
176202
Legend: ■ = healthy, ◇ = provisioning
177203
```
178204

205+
## Timeout and Rollback
206+
207+
Deploying timeout is handled through the coordinator's generic `expired` transition mechanism:
208+
209+
1. `DeployingProvisioningHandler` declares `expired → DEPLOYING/ROLLING_BACK` in `status_transitions()`
210+
2. Each cycle, the coordinator checks `result.skipped` deployments against the DEPLOYING timeout (30 min)
211+
3. Timeout is measured using `phase_started_at` from `DeploymentWithHistory` — the `created_at` of the first scheduling history record for this handler phase
212+
4. `phase_started_at` is stable across retries: history records with same phase/error_code/to_status are merged (only `attempts` incremented, `created_at` unchanged)
213+
5. Timed-out deployments transition to DEPLOYING/ROLLING_BACK
214+
6. `DeployingRollingBackHandler` clears `deploying_revision` and transitions to READY
215+
216+
No separate timeout handler or periodic task is needed — timeout checking is built into the coordinator's standard transition handling.
217+
179218
## Component Structure
180219

181220
```
@@ -209,7 +248,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
209248
│ │ old_active: old + is_active() │ │
210249
│ └────────────────────────────────────────────────────┘ │
211250
│ │
212-
│ Route changes returned (applied by coordinator):
251+
│ Route changes returned (applied by applier):
213252
│ ┌────────────────────────────────────────────────────┐ │
214253
│ │ rollout_specs: RouteCreatorSpec( │ │
215254
│ │ revision_id = deploying_revision, │ │
@@ -223,14 +262,18 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
223262
224263
225264
┌──────────────────────────────────────────────────────────────┐
226-
│ Per-Sub-Step Handlers (coordinator generic path) │
265+
│ DeployingProvisioningHandler │
266+
│ (single handler for entire DEPLOYING lifecycle) │
227267
│ │
228-
│ PROVISIONING → DeployingProvisioningHandler │
229-
│ next_status: DEPLOYING → coordinator records history │
268+
│ completed → success → coordinator transitions to READY │
269+
│ route mutations → need_retry → stays in PROVISIONING │
270+
│ no changes → skipped → coordinator checks timeout │
271+
│ evaluation errors → errors → classified by coordinator │
230272
│ │
231-
│ PROGRESSING → DeployingProgressingHandler │
232-
│ next_status: DEPLOYING → coordinator records history │
233-
│ completed=True → coordinator atomic revision swap + READY │
273+
│ DeployingRollingBackHandler │
274+
│ (cleanup on timeout) │
275+
│ │
276+
│ clear deploying_revision → success → READY │
234277
└──────────────────────────────────────────────────────────────┘
235278
```
236279

@@ -242,11 +285,16 @@ When all Old routes are removed and New routes reach desired_replicas or above a
242285
completed determination (evaluator)
243286
244287
245-
Coordinator._transition_completed_deployments()
288+
StrategyResultApplier.apply()
246289
→ Atomic transaction:
247290
1. complete_deployment_revision_swap(ids)
248291
current_revision = deploying_revision
249292
deploying_revision = NULL
250-
2. DEPLOYING → READY lifecycle transition
251-
3. History recording
293+
2. Returns completed_ids in StrategyApplyResult
294+
295+
296+
DeployingProvisioningHandler
297+
→ completed_ids → successes
298+
→ coordinator transitions DEPLOYING → READY
299+
→ History recording
252300
```

src/ai/backend/common/dto/manager/deployment/response.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,9 @@ class DeploymentDTO(BaseModel):
134134
deployment_policy: DeploymentPolicyDTO | None = Field(
135135
default=None, description="Deployment rollout policy"
136136
)
137+
sub_step: str | None = Field(
138+
default=None, description="Current deployment sub-step (e.g. provisioning, rolling_back)"
139+
)
137140

138141

139142
class CreateDeploymentResponse(BaseResponseModel):

src/ai/backend/manager/api/rest/deployment/adapter.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@ def convert_to_dto(self, data: ModelDeploymentData) -> DeploymentDTO:
141141
default_deployment_strategy=data.default_deployment_strategy,
142142
current_revision=current_revision,
143143
deployment_policy=deployment_policy,
144+
sub_step=data.sub_step,
144145
)
145146

146147
def build_querier(self, request: SearchDeploymentsRequest) -> BatchQuerier:

src/ai/backend/manager/data/deployment/types.py

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,10 @@ def is_active(self) -> bool:
121121
def is_inactive(self) -> bool:
122122
return self in self.inactive_route_statuses()
123123

124+
def is_provisioning(self) -> bool:
125+
"""PROVISIONING or DEGRADED (still warming up, health checks not yet passing)."""
126+
return self in (RouteStatus.PROVISIONING, RouteStatus.DEGRADED)
127+
124128
def termination_priority(self) -> int:
125129
priority_map = {
126130
RouteStatus.UNHEALTHY: 1,
@@ -148,16 +152,15 @@ class RouteTrafficStatus(enum.StrEnum):
148152
# ========== Status Transition Types (BEP-1030) ==========
149153

150154

151-
class DeploymentSubStatus(enum.StrEnum):
152-
"""Base class for deployment lifecycle sub-statuses.
155+
class DeploymentLifecycleSubStep(enum.StrEnum):
156+
"""Base class for deployment lifecycle sub-steps.
153157
154-
Each lifecycle type can define its own sub-status enum by
155-
inheriting from this class. For example, DEPLOYING handlers
156-
use ``DeploymentSubStep`` (provisioning, rolling_back, …).
158+
Each lifecycle type can define its own sub-step enum by
159+
inheriting from this class.
157160
"""
158161

159162

160-
class DeploymentSubStep(DeploymentSubStatus):
163+
class DeployingSubStep(DeploymentLifecycleSubStep):
161164
"""Sub-steps for the DEPLOYING lifecycle phase.
162165
163166
- PROVISIONING: New revision routes are being provisioned and old routes
@@ -175,18 +178,17 @@ class DeploymentSubStep(DeploymentSubStatus):
175178
class DeploymentLifecycleStatus:
176179
"""Target lifecycle state for a deployment status transition.
177180
178-
Pairs an EndpointLifecycle with an optional sub-status to provide
181+
Pairs an EndpointLifecycle with an optional sub-step to provide
179182
context about which sub-step led to this transition.
180183
181184
Attributes:
182185
lifecycle: The target endpoint lifecycle state
183-
sub_status: Optional sub-status indicating what determined this
184-
transition. Concrete values come from DeploymentSubStatus
185-
subclasses (e.g. DeploymentSubStep for DEPLOYING handlers).
186+
sub_step: Optional sub-step indicating what determined this
187+
transition (e.g. DeployingSubStep for DEPLOYING handlers).
186188
"""
187189

188190
lifecycle: EndpointLifecycle
189-
sub_status: DeploymentSubStatus | None = None
191+
sub_step: DeploymentLifecycleSubStep | None = None
190192

191193

192194
@dataclass(frozen=True)
@@ -376,7 +378,7 @@ class DeploymentInfo:
376378
current_revision_id: UUID | None = None
377379
policy: DeploymentPolicyData | None = None
378380
deploying_revision_id: UUID | None = None
379-
sub_step: DeploymentSubStep | None = None
381+
sub_step: str | None = None
380382

381383
def resolve_revision_spec(self, revision_id: UUID) -> ModelRevisionSpec | None:
382384
"""Find a ModelRevisionSpec by revision_id from model_revisions."""
@@ -569,6 +571,7 @@ class ModelDeploymentData:
569571
created_user_id: UUID
570572
policy: DeploymentPolicyData | None = None
571573
access_token_ids: list[UUID] | None = None
574+
sub_step: str | None = None
572575

573576

574577
class DeploymentOrderField(enum.StrEnum):

src/ai/backend/manager/event_dispatcher/handlers/schedule.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
from ai.backend.common.events.hub.hub import EventHub
1818
from ai.backend.common.types import AgentId
1919
from ai.backend.logging.utils import BraceStyleAdapter
20-
from ai.backend.manager.data.deployment.types import DeploymentSubStep
20+
from ai.backend.manager.data.deployment.types import DeployingSubStep
2121
from ai.backend.manager.scheduler.types import ScheduleType
2222
from ai.backend.manager.sokovan.deployment.coordinator import DeploymentCoordinator
2323
from ai.backend.manager.sokovan.deployment.route.coordinator import RouteCoordinator
@@ -93,15 +93,15 @@ async def handle_do_deployment_lifecycle_if_needed(
9393
) -> None:
9494
"""Handle deployment lifecycle if needed event (checks marks)."""
9595
lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
96-
sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
96+
sub_step = DeployingSubStep(ev.sub_step) if ev.sub_step else None
9797
await self._deployment_coordinator.process_if_needed(lifecycle_type, sub_step)
9898

9999
async def handle_do_deployment_lifecycle(
100100
self, _context: None, _agent_id: str, ev: DoDeploymentLifecycleEvent
101101
) -> None:
102102
"""Handle deployment lifecycle event (unconditional)."""
103103
lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
104-
sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
104+
sub_step = DeployingSubStep(ev.sub_step) if ev.sub_step else None
105105
await self._deployment_coordinator.process_deployment_lifecycle(lifecycle_type, sub_step)
106106

107107
async def handle_do_route_lifecycle_if_needed(

src/ai/backend/manager/models/endpoint/row.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,6 @@
6767
DeploymentMetadata,
6868
DeploymentNetworkSpec,
6969
DeploymentState,
70-
DeploymentSubStep,
7170
ExecutionSpec,
7271
ModelDeploymentAutoScalingRuleData,
7372
ModelRevisionSpec,
@@ -315,9 +314,9 @@ class EndpointRow(Base): # type: ignore[misc]
315314
deploying_revision: Mapped[UUID | None] = mapped_column(
316315
"deploying_revision", GUID, nullable=True
317316
)
318-
sub_step: Mapped[DeploymentSubStep | None] = mapped_column(
317+
sub_step: Mapped[str | None] = mapped_column(
319318
"sub_step",
320-
StrEnumType(DeploymentSubStep),
319+
sa.String,
321320
nullable=True,
322321
default=None,
323322
)

src/ai/backend/manager/repositories/deployment/creators/deployment.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616
RuntimeVariant,
1717
VFolderMount,
1818
)
19-
from ai.backend.manager.data.deployment.types import DeploymentSubStatus
2019
from ai.backend.manager.models.endpoint import EndpointLifecycle, EndpointRow
2120
from ai.backend.manager.repositories.base import CreatorSpec
2221
from ai.backend.manager.repositories.base.updater import BatchUpdaterSpec
@@ -184,7 +183,7 @@ class EndpointLifecycleBatchUpdaterSpec(BatchUpdaterSpec[EndpointRow]):
184183
"""
185184

186185
lifecycle_stage: EndpointLifecycle
187-
sub_step: DeploymentSubStatus | None = None
186+
sub_step: str | None = None
188187

189188
@property
190189
@override

0 commit comments

Comments
 (0)