lablup
diff --git a/‎changes/9997.feature.md‎
Lines changed: 1 addition & 0 deletions b/‎changes/9997.feature.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎proposals/BEP-1049-deployment-strategy-handler.md‎
Lines changed: 60 additions & 96 deletions b/‎proposals/BEP-1049-deployment-strategy-handler.md‎
Lines changed: 60 additions & 96 deletions
diff --git a/‎proposals/BEP-1049/rolling-update.md‎
Lines changed: 70 additions & 22 deletions b/‎proposals/BEP-1049/rolling-update.md‎
Lines changed: 70 additions & 22 deletions
diff --git a/‎src/ai/backend/common/dto/manager/deployment/response.py‎
Lines changed: 3 additions & 0 deletions b/‎src/ai/backend/common/dto/manager/deployment/response.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎src/ai/backend/manager/api/rest/deployment/adapter.py‎
Lines changed: 1 addition & 0 deletions b/‎src/ai/backend/manager/api/rest/deployment/adapter.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/ai/backend/manager/data/deployment/types.py‎
Lines changed: 15 additions & 12 deletions b/‎src/ai/backend/manager/data/deployment/types.py‎
Lines changed: 15 additions & 12 deletions
diff --git a/‎src/ai/backend/manager/event_dispatcher/handlers/schedule.py‎
Lines changed: 3 additions & 3 deletions b/‎src/ai/backend/manager/event_dispatcher/handlers/schedule.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎src/ai/backend/manager/models/endpoint/row.py‎
Lines changed: 2 additions & 3 deletions b/‎src/ai/backend/manager/models/endpoint/row.py‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎src/ai/backend/manager/repositories/deployment/creators/deployment.py‎
Lines changed: 1 addition & 2 deletions b/‎src/ai/backend/manager/repositories/deployment/creators/deployment.py‎
Lines changed: 1 addition & 2 deletions
@@ -0,0 +1 @@
+Implement Rolling Update deployment strategy
@@ -28,7 +28,7 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
 
 ```
   ┌──────────────────────────────────────┐
-  │  Any New routes PROVISIONING?        │──Yes──→ provisioning
+  │  Any New routes PROVISIONING?        │──Yes──→ provisioning (wait)
   └──────────────────┬───────────────────┘
                      No
                      ▼
@@ -53,15 +53,38 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
                 progressing
 ```
 
-### Sub-Step Variants
+Rollback is **not** decided by the FSM itself. If all new routes fail, the FSM will keep attempting to create new routes via the surge/unavailable calculation. Eventually the DEPLOYING timeout (30 min) is exceeded and the coordinator transitions the deployment to ROLLING_BACK via the `expired` path.
 
-Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.
+### Route Classification
 
-| Sub-Step | Condition | Handler Action |
-|----------|-----------|----------------|
-| **provisioning** | New routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
-| **progressing** | Calculated surge/unavailable, created/terminated routes | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
-| **progressing** (`completed=True`) | No Old routes and New healthy >= desired_replicas | Coordinator → atomic revision swap + DEPLOYING→READY |
+Routes are classified by revision and status:
+
+| Category | Condition | Description |
+|----------|-----------|-------------|
+| `old_active` | revision != deploying_revision, is_active() | Old routes currently serving traffic |
+| `new_provisioning` | revision == deploying_revision, PROVISIONING | New routes being created |
+| `new_healthy` | revision == deploying_revision, HEALTHY | New routes ready to serve |
+| `new_unhealthy` | revision == deploying_revision, UNHEALTHY/DEGRADED | New routes with issues |
+| `new_failed` | revision == deploying_revision, FAILED/TERMINATED | New routes that failed |
+
+### Handler Flow
+
+All DEPLOYING deployments are handled by `DeployingProvisioningHandler`, which stays in the PROVISIONING sub-step throughout the entire deployment lifecycle. The handler runs the strategy evaluator each cycle:
+
+| Result | Condition | Handler Action |
+|--------|-----------|----------------|
+| **success** | Evaluator returns COMPLETED (no Old routes, New healthy >= desired) | Coordinator transitions to READY |
+| **need_retry** | Route mutations executed (create/drain) | Stays in DEPLOYING/PROVISIONING, history recorded |
+| **skipped** | No changes — routes still provisioning or waiting | No transition; coordinator checks for timeout |
+| **expired** | Skipped deployment exceeds DEPLOYING timeout (30 min) | Coordinator transitions to DEPLOYING/ROLLING_BACK |
+
+When a deployment transitions to ROLLING_BACK, the `DeployingRollingBackHandler` clears `deploying_revision` and transitions directly to READY.
+
+### Safety Guards
+
+- **Zero-downtime protection**: When `max_unavailable < desired`, never terminates ALL old routes until at least one new route is healthy
+- **Deadlock prevention**: `RollingUpdateSpec` validator ensures at least one of `max_surge` or `max_unavailable` is positive
+- **Timeout-based rollback**: The FSM does not detect failure — the coordinator's timeout mechanism handles it. If the deployment cannot complete within the DEPLOYING timeout (30 min), the coordinator transitions to ROLLING_BACK via the `expired` path
 
 ## max_surge / max_unavailable Calculation
 
@@ -107,6 +130,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  healthy=3, min_available=2 → can_terminate=1       │
   │                                                     │
   │  → Create 1 New, Terminate 1 Old                    │
+  │  → need_retry (route mutations executed)             │
   └─────────────────────────────────────────────────────┘
                           │
                           ▼
@@ -115,7 +139,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  Old: [■ ■]    (2 healthy)                          │
   │  New: [◇]      (1 provisioning)                     │
   │                                                     │
-  │  → PROVISIONING exists → wait                       │
+  │  → PROVISIONING exists → skipped (wait)             │
   └─────────────────────────────────────────────────────┘
                           │
                           ▼
@@ -129,6 +153,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  healthy=3, min_available=2 → can_terminate=1       │
   │                                                     │
   │  → Create 1 New, Terminate 1 Old                    │
+  │  → need_retry (route mutations executed)             │
   └─────────────────────────────────────────────────────┘
                           │
                           ▼
@@ -137,7 +162,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  Old: [■]      (1 healthy)                          │
   │  New: [■ ◇]    (1 healthy, 1 provisioning)          │
   │                                                     │
-  │  → PROVISIONING exists → wait                       │
+  │  → PROVISIONING exists → skipped (wait)             │
   └─────────────────────────────────────────────────────┘
                           │
                           ▼
@@ -151,6 +176,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  healthy=3, min_available=2 → can_terminate=1       │
   │                                                     │
   │  → Create 1 New, Terminate 1 Old                    │
+  │  → need_retry (route mutations executed)             │
   └─────────────────────────────────────────────────────┘
                           │
                           ▼
@@ -159,7 +185,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  Old: []                                            │
   │  New: [■ ■ ◇]  (2 healthy, 1 provisioning)          │
   │                                                     │
-  │  → PROVISIONING exists → wait                       │
+  │  → PROVISIONING exists → skipped (wait)             │
   └─────────────────────────────────────────────────────┘
                           │
                           ▼
@@ -170,12 +196,25 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │                                                     │
   │  No Old and New >= desired_replicas → completed     │
   │  → deploying_revision → current_revision swap       │
-  │  → DEPLOYING → READY state transition               │
+  │  → success → coordinator transitions to READY       │
   └─────────────────────────────────────────────────────┘
 
   Legend: ■ = healthy, ◇ = provisioning
 ```
 
+## Timeout and Rollback
+
+Deploying timeout is handled through the coordinator's generic `expired` transition mechanism:
+
+1. `DeployingProvisioningHandler` declares `expired → DEPLOYING/ROLLING_BACK` in `status_transitions()`
+2. Each cycle, the coordinator checks `result.skipped` deployments against the DEPLOYING timeout (30 min)
+3. Timeout is measured using `phase_started_at` from `DeploymentWithHistory` — the `created_at` of the first scheduling history record for this handler phase
+4. `phase_started_at` is stable across retries: history records with same phase/error_code/to_status are merged (only `attempts` incremented, `created_at` unchanged)
+5. Timed-out deployments transition to DEPLOYING/ROLLING_BACK
+6. `DeployingRollingBackHandler` clears `deploying_revision` and transitions to READY
+
+No separate timeout handler or periodic task is needed — timeout checking is built into the coordinator's standard transition handling.
+
 ## Component Structure
 
 ```
@@ -209,7 +248,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
   │  │  old_active:       old + is_active()               │      │
   │  └────────────────────────────────────────────────────┘      │
   │                                                              │
-  │  Route changes returned (applied by coordinator):            │
+  │  Route changes returned (applied by applier):                │
   │  ┌────────────────────────────────────────────────────┐      │
   │  │  rollout_specs: RouteCreatorSpec(                  │      │
   │  │    revision_id = deploying_revision,               │      │
@@ -223,14 +262,18 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
                              │
                              ▼
   ┌──────────────────────────────────────────────────────────────┐
-  │  Per-Sub-Step Handlers (coordinator generic path)            │
+  │  DeployingProvisioningHandler                                │
+  │  (single handler for entire DEPLOYING lifecycle)             │
   │                                                              │
-  │  PROVISIONING → DeployingProvisioningHandler                  │
-  │    next_status: DEPLOYING → coordinator records history      │
+  │  completed → success → coordinator transitions to READY      │
+  │  route mutations → need_retry → stays in PROVISIONING        │
+  │  no changes → skipped → coordinator checks timeout           │
+  │  evaluation errors → errors → classified by coordinator      │
   │                                                              │
-  │  PROGRESSING → DeployingProgressingHandler                   │
-  │    next_status: DEPLOYING → coordinator records history      │
-  │    completed=True → coordinator atomic revision swap + READY │
+  │  DeployingRollingBackHandler                                 │
+  │  (cleanup on timeout)                                        │
+  │                                                              │
+  │  clear deploying_revision → success → READY                  │
   └──────────────────────────────────────────────────────────────┘
 ```
 
@@ -242,11 +285,16 @@ When all Old routes are removed and New routes reach desired_replicas or above a
   completed determination (evaluator)
        │
        ▼
-  Coordinator._transition_completed_deployments()
+  StrategyResultApplier.apply()
     → Atomic transaction:
       1. complete_deployment_revision_swap(ids)
          current_revision = deploying_revision
          deploying_revision = NULL
-      2. DEPLOYING → READY lifecycle transition
-      3. History recording
+      2. Returns completed_ids in StrategyApplyResult
+       │
+       ▼
+  DeployingProvisioningHandler
+    → completed_ids → successes
+    → coordinator transitions DEPLOYING → READY
+    → History recording
 ```
@@ -134,6 +134,9 @@ class DeploymentDTO(BaseModel):
     deployment_policy: DeploymentPolicyDTO | None = Field(
         default=None, description="Deployment rollout policy"
     )
+    sub_step: str | None = Field(
+        default=None, description="Current deployment sub-step (e.g. provisioning, rolling_back)"
+    )
 
 
 class CreateDeploymentResponse(BaseResponseModel):
 
@@ -141,6 +141,7 @@ def convert_to_dto(self, data: ModelDeploymentData) -> DeploymentDTO:
             default_deployment_strategy=data.default_deployment_strategy,
             current_revision=current_revision,
             deployment_policy=deployment_policy,
+            sub_step=data.sub_step,
         )
 
     def build_querier(self, request: SearchDeploymentsRequest) -> BatchQuerier:
 
@@ -121,6 +121,10 @@ def is_active(self) -> bool:
     def is_inactive(self) -> bool:
         return self in self.inactive_route_statuses()
 
+    def is_provisioning(self) -> bool:
+        """PROVISIONING or DEGRADED (still warming up, health checks not yet passing)."""
+        return self in (RouteStatus.PROVISIONING, RouteStatus.DEGRADED)
+
     def termination_priority(self) -> int:
         priority_map = {
             RouteStatus.UNHEALTHY: 1,
@@ -148,16 +152,15 @@ class RouteTrafficStatus(enum.StrEnum):
 # ========== Status Transition Types (BEP-1030) ==========
 
 
-class DeploymentSubStatus(enum.StrEnum):
-    """Base class for deployment lifecycle sub-statuses.
+class DeploymentLifecycleSubStep(enum.StrEnum):
+    """Base class for deployment lifecycle sub-steps.
 
-    Each lifecycle type can define its own sub-status enum by
-    inheriting from this class.  For example, DEPLOYING handlers
-    use ``DeploymentSubStep`` (provisioning, rolling_back, …).
+    Each lifecycle type can define its own sub-step enum by
+    inheriting from this class.
     """
 
 
-class DeploymentSubStep(DeploymentSubStatus):
+class DeployingSubStep(DeploymentLifecycleSubStep):
     """Sub-steps for the DEPLOYING lifecycle phase.
 
     - PROVISIONING: New revision routes are being provisioned and old routes
@@ -175,18 +178,17 @@ class DeploymentSubStep(DeploymentSubStatus):
 class DeploymentLifecycleStatus:
     """Target lifecycle state for a deployment status transition.
 
-    Pairs an EndpointLifecycle with an optional sub-status to provide
+    Pairs an EndpointLifecycle with an optional sub-step to provide
     context about which sub-step led to this transition.
 
     Attributes:
         lifecycle: The target endpoint lifecycle state
-        sub_status: Optional sub-status indicating what determined this
-            transition. Concrete values come from DeploymentSubStatus
-            subclasses (e.g. DeploymentSubStep for DEPLOYING handlers).
+        sub_step: Optional sub-step indicating what determined this
+            transition (e.g. DeployingSubStep for DEPLOYING handlers).
     """
 
     lifecycle: EndpointLifecycle
-    sub_status: DeploymentSubStatus | None = None
+    sub_step: DeploymentLifecycleSubStep | None = None
 
 
 @dataclass(frozen=True)
@@ -376,7 +378,7 @@ class DeploymentInfo:
     current_revision_id: UUID | None = None
     policy: DeploymentPolicyData | None = None
     deploying_revision_id: UUID | None = None
-    sub_step: DeploymentSubStep | None = None
+    sub_step: str | None = None
 
     def resolve_revision_spec(self, revision_id: UUID) -> ModelRevisionSpec | None:
         """Find a ModelRevisionSpec by revision_id from model_revisions."""
@@ -569,6 +571,7 @@ class ModelDeploymentData:
     created_user_id: UUID
     policy: DeploymentPolicyData | None = None
     access_token_ids: list[UUID] | None = None
+    sub_step: str | None = None
 
 
 class DeploymentOrderField(enum.StrEnum):
 
@@ -17,7 +17,7 @@
 from ai.backend.common.events.hub.hub import EventHub
 from ai.backend.common.types import AgentId
 from ai.backend.logging.utils import BraceStyleAdapter
-from ai.backend.manager.data.deployment.types import DeploymentSubStep
+from ai.backend.manager.data.deployment.types import DeployingSubStep
 from ai.backend.manager.scheduler.types import ScheduleType
 from ai.backend.manager.sokovan.deployment.coordinator import DeploymentCoordinator
 from ai.backend.manager.sokovan.deployment.route.coordinator import RouteCoordinator
@@ -93,15 +93,15 @@ async def handle_do_deployment_lifecycle_if_needed(
     ) -> None:
         """Handle deployment lifecycle if needed event (checks marks)."""
         lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
-        sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
+        sub_step = DeployingSubStep(ev.sub_step) if ev.sub_step else None
         await self._deployment_coordinator.process_if_needed(lifecycle_type, sub_step)
 
     async def handle_do_deployment_lifecycle(
         self, _context: None, _agent_id: str, ev: DoDeploymentLifecycleEvent
     ) -> None:
         """Handle deployment lifecycle event (unconditional)."""
         lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
-        sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
+        sub_step = DeployingSubStep(ev.sub_step) if ev.sub_step else None
         await self._deployment_coordinator.process_deployment_lifecycle(lifecycle_type, sub_step)
 
     async def handle_do_route_lifecycle_if_needed(
 
@@ -67,7 +67,6 @@
     DeploymentMetadata,
     DeploymentNetworkSpec,
     DeploymentState,
-    DeploymentSubStep,
     ExecutionSpec,
     ModelDeploymentAutoScalingRuleData,
     ModelRevisionSpec,
@@ -315,9 +314,9 @@ class EndpointRow(Base):  # type: ignore[misc]
     deploying_revision: Mapped[UUID | None] = mapped_column(
         "deploying_revision", GUID, nullable=True
     )
-    sub_step: Mapped[DeploymentSubStep | None] = mapped_column(
+    sub_step: Mapped[str | None] = mapped_column(
         "sub_step",
-        StrEnumType(DeploymentSubStep),
+        sa.String,
         nullable=True,
         default=None,
     )
 
@@ -16,7 +16,6 @@
     RuntimeVariant,
     VFolderMount,
 )
-from ai.backend.manager.data.deployment.types import DeploymentSubStatus
 from ai.backend.manager.models.endpoint import EndpointLifecycle, EndpointRow
 from ai.backend.manager.repositories.base import CreatorSpec
 from ai.backend.manager.repositories.base.updater import BatchUpdaterSpec
@@ -184,7 +183,7 @@ class EndpointLifecycleBatchUpdaterSpec(BatchUpdaterSpec[EndpointRow]):
     """
 
     lifecycle_stage: EndpointLifecycle
-    sub_step: DeploymentSubStatus | None = None
+    sub_step: str | None = None
 
     @property
     @override
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Implement Rolling Update deployment strategy`
Original file line number	Diff line number	Diff line change
`@@ -134,6 +134,9 @@ class DeploymentDTO(BaseModel):`
`134`	`134`	`deployment_policy: DeploymentPolicyDTO \| None = Field(`
`135`	`135`	`default=None, description="Deployment rollout policy"`
`136`	`136`	`)`
	`137`	`+ sub_step: str \| None = Field(`
	`138`	`+ default=None, description="Current deployment sub-step (e.g. provisioning, rolling_back)"`
	`139`	`+ )`
`137`	`140`
`138`	`141`
`139`	`142`	`class CreateDeploymentResponse(BaseResponseModel):`
Original file line number	Diff line number	Diff line change
`@@ -141,6 +141,7 @@ def convert_to_dto(self, data: ModelDeploymentData) -> DeploymentDTO:`
`141`	`141`	`default_deployment_strategy=data.default_deployment_strategy,`
`142`	`142`	`current_revision=current_revision,`
`143`	`143`	`deployment_policy=deployment_policy,`
	`144`	`+ sub_step=data.sub_step,`
`144`	`145`	`)`
`145`	`146`
`146`	`147`	`def build_querier(self, request: SearchDeploymentsRequest) -> BatchQuerier:`