You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: proposals/BEP-1049/rolling-update.md
+70-22Lines changed: 70 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
28
28
29
29
```
30
30
┌──────────────────────────────────────┐
31
-
│ Any New routes PROVISIONING? │──Yes──→ provisioning
31
+
│ Any New routes PROVISIONING? │──Yes──→ provisioning (wait)
32
32
└──────────────────┬───────────────────┘
33
33
No
34
34
▼
@@ -53,15 +53,38 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
53
53
progressing
54
54
```
55
55
56
-
### Sub-Step Variants
56
+
Rollback is **not** decided by the FSM itself. If all new routes fail, the FSM will keep attempting to create new routes via the surge/unavailable calculation. Eventually the DEPLOYING timeout (30 min) is exceeded and the coordinator transitions the deployment to ROLLING_BACK via the `expired` path.
57
57
58
-
Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.
58
+
### Route Classification
59
59
60
-
| Sub-Step | Condition | Handler Action |
61
-
|----------|-----------|----------------|
62
-
|**provisioning**| New routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
|**progressing** (`completed=True`) | No Old routes and New healthy >= desired_replicas | Coordinator → atomic revision swap + DEPLOYING→READY |
60
+
Routes are classified by revision and status:
61
+
62
+
| Category | Condition | Description |
63
+
|----------|-----------|-------------|
64
+
|`old_active`| revision != deploying_revision, is_active() | Old routes currently serving traffic |
65
+
|`new_provisioning`| revision == deploying_revision, PROVISIONING | New routes being created |
66
+
|`new_healthy`| revision == deploying_revision, HEALTHY | New routes ready to serve |
67
+
|`new_unhealthy`| revision == deploying_revision, UNHEALTHY/DEGRADED | New routes with issues |
68
+
|`new_failed`| revision == deploying_revision, FAILED/TERMINATED | New routes that failed |
69
+
70
+
### Handler Flow
71
+
72
+
All DEPLOYING deployments are handled by `DeployingProvisioningHandler`, which stays in the PROVISIONING sub-step throughout the entire deployment lifecycle. The handler runs the strategy evaluator each cycle:
73
+
74
+
| Result | Condition | Handler Action |
75
+
|--------|-----------|----------------|
76
+
|**success**| Evaluator returns COMPLETED (no Old routes, New healthy >= desired) | Coordinator transitions to READY |
77
+
|**need_retry**| Route mutations executed (create/drain) | Stays in DEPLOYING/PROVISIONING, history recorded |
78
+
|**skipped**| No changes — routes still provisioning or waiting | No transition; coordinator checks for timeout |
When a deployment transitions to ROLLING_BACK, the `DeployingRollingBackHandler` clears `deploying_revision` and transitions directly to READY.
82
+
83
+
### Safety Guards
84
+
85
+
-**Zero-downtime protection**: When `max_unavailable < desired`, never terminates ALL old routes until at least one new route is healthy
86
+
-**Deadlock prevention**: `RollingUpdateSpec` validator ensures at least one of `max_surge` or `max_unavailable` is positive
87
+
-**Timeout-based rollback**: The FSM does not detect failure — the coordinator's timeout mechanism handles it. If the deployment cannot complete within the DEPLOYING timeout (30 min), the coordinator transitions to ROLLING_BACK via the `expired` path
65
88
66
89
## max_surge / max_unavailable Calculation
67
90
@@ -107,6 +130,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
Deploying timeout is handled through the coordinator's generic `expired` transition mechanism:
208
+
209
+
1.`DeployingProvisioningHandler` declares `expired → DEPLOYING/ROLLING_BACK` in `status_transitions()`
210
+
2. Each cycle, the coordinator checks `result.skipped` deployments against the DEPLOYING timeout (30 min)
211
+
3. Timeout is measured using `phase_started_at` from `DeploymentWithHistory` — the `created_at` of the first scheduling history record for this handler phase
212
+
4.`phase_started_at` is stable across retries: history records with same phase/error_code/to_status are merged (only `attempts` incremented, `created_at` unchanged)
213
+
5. Timed-out deployments transition to DEPLOYING/ROLLING_BACK
214
+
6.`DeployingRollingBackHandler` clears `deploying_revision` and transitions to READY
215
+
216
+
No separate timeout handler or periodic task is needed — timeout checking is built into the coordinator's standard transition handling.
217
+
179
218
## Component Structure
180
219
181
220
```
@@ -209,7 +248,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
0 commit comments