Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions docs/proposals/20240807-in-place-updates-implementation-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,173 @@ sequenceDiagram
MS2 (NewMS)-->>MS Controller: Yes, M1!
MS Controller->>M1: Remove annotation ".../pending-acknowledge-move": ""
```

## Notes about in-place update implementation for KubeadmControlPlane

- In-place updates respect the existing control plane update strategy:
- KCP controller uses `rollingUpdate` strategy with `maxSurge` (0 or 1)
- When `maxSurge` is 0, no new machines are created during rollout - only in-place updates or scale down
- When `maxSurge` is 1:
- Controller first scales up by creating one new machine to maximize fault tolerance
- Once at `maxReplicas` (desiredReplicas + 1), evaluates whether to in-place update or scale down old machines
- For each old machine needing rollout: if eligible for in-place update, performs in-place; otherwise scales down
- This pattern repeats until all machines are up-to-date, then scales back to desired replica count

- The implementation respects the existing set of responsibilities:
- KCP controller manages control plane Machines directly
- KCP controller enforces `maxSurge` limits during rolling updates
- KCP controller decides when to scale up, scale down, or perform in-place updates
- KCP controller runs preflight checks to ensure control plane is stable before in-place updates
- KCP controller calls `CanUpdateMachine` hook to verify if extensions can handle the changes
- When in-place update is possible, KCP controller triggers the update by writing desired state

- The in-place update decision flow:
- If `currentReplicas < maxReplicas` (desiredReplicas + maxSurge), scale up first to maximize fault tolerance
- If `currentReplicas >= maxReplicas`, select a machine needing rollout and evaluate options:
- Check if selected Machine is eligible for in-place update (determined by `UpToDate` function)
- Check if we already have enough up-to-date replicas (if `currentUpToDateReplicas >= desiredReplicas`, skip in-place and scale down)
- Run preflight checks to ensure control plane stability
- Call `CanUpdateMachine` hook on registered runtime extensions
- If all checks pass, trigger in-place update; otherwise, fall back to scale down/recreate
- This flow repeats on each reconciliation until all machines are up-to-date

- Orchestration of in-place updates uses two key annotations:
- `in-place-updates.internal.cluster.x-k8s.io/update-in-progress` - Marks Machine as undergoing in-place update
- `runtime.cluster.x-k8s.io/pending-hooks` - Tracks pending `UpdateMachine` runtime hook

Following schemas provide an overview of the in-place update workflow for KCP.

Workflow #1: KCP controller determines that a Machine can be updated in-place and triggers the update.

```mermaid
sequenceDiagram
autonumber
participant KCP Controller
participant RX as Runtime Extension
participant M1 as Machine
participant IM1 as InfraMachine
participant KC1 as KubeadmConfig

KCP Controller->>KCP Controller: Select Machine for rollout
KCP Controller->>KCP Controller: Run preflight checks on control plane
KCP Controller->>RX: CanUpdateMachine(current, desired)?
RX-->>KCP Controller: Yes, with patches to indicate supported changes

KCP Controller->>M1: Set annotation "update-in-progress": ""
KCP Controller->>IM1: Apply desired InfraMachine spec<br/>Set annotation "update-in-progress": ""
KCP Controller->>KC1: Apply desired KubeadmConfig spec<br/>Set annotation "update-in-progress": ""
KCP Controller->>M1: Apply desired Machine spec<br/>Set annotation "pending-hooks": "UpdateMachine"
```

Workflow #2: Machine controller detects the pending `UpdateMachine` hook and calls the runtime extension to perform the update.

```mermaid
sequenceDiagram
autonumber
participant Machine Controller
participant RX as Runtime Extension
participant M1 as Machine
participant IM1 as InfraMachine
participant KC1 as KubeadmConfig

Machine Controller-->>M1: Has "update-in-progress" and "pending-hooks: UpdateMachine"?
M1-->>Machine Controller: Yes!

Machine Controller->>RX: UpdateMachine(desired state)
RX-->>Machine Controller: Status: InProgress, RetryAfterSeconds: 30

Note over Machine Controller: Wait and retry

Machine Controller->>RX: UpdateMachine(desired state)
RX-->>Machine Controller: Status: Done

Machine Controller->>IM1: Remove annotation "update-in-progress"
Machine Controller->>KC1: Remove annotation "update-in-progress"
Machine Controller->>M1: Remove annotation "update-in-progress"<br/>Remove "UpdateMachine" from "pending-hooks"
```

Workflow #3: KCP controller waits for in-place update to complete before proceeding with further operations.

```mermaid
sequenceDiagram
autonumber
participant KCP Controller
participant M1 as Machine

KCP Controller-->>M1: Is in-place update in progress?
M1-->>KCP Controller: Yes! ("update-in-progress" or "pending-hooks: UpdateMachine")

Note over KCP Controller: Wait for update to complete<br/>Requeue on Machine changes

KCP Controller-->>M1: Is in-place update in progress?
M1-->>KCP Controller: No! (annotations removed)

Note over KCP Controller: Continue with next Machine rollout or other operations
```

## Notes about managedFields refactoring for in-place updates (KCP/MS)

To enable correct in-place updates of BootstrapConfigs and InfraMachines, CAPI v1.12 introduced a refactored managedFields structure. This change was necessary because:

- **Previously** (CAPI <= v1.11): BootstrapConfigs/InfraMachines were only created, never updated
- **Now** (CAPI >= v1.12): BootstrapConfigs/InfraMachines need to be updated during in-place updates using Server-Side Apply (SSA)
- **Why SSA**: Required for proper handling of co-ownership of fields and to enable unsetting fields during updates

### Two field managers approach

The refactoring uses **two separate field managers** to enable different responsibilities:

1. **Metadata manager** (`capi-kubeadmcontrolplane-metadata` / `capi-machineset-metadata`):
- Continuously syncs labels and annotations
- Updates on every reconciliation via `syncMachines`

2. **Spec manager** (`capi-kubeadmcontrolplane` / `capi-machineset`):
- Manages the spec and in-place update specific annotations
- Updates only when creating objects or triggering in-place updates

### ManagedFields structure comparison

**CAPI <= v1.11** (legacy):
- Machine:
- spec + labels + annotations => `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)
- BootstrapConfig / InfraMachine:
- labels + annotations => `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)
- spec => `manager` (Update)

**CAPI >= v1.12** (new):
- Machine (unchanged):
- spec + labels + annotations => `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)
- BootstrapConfig / InfraMachine:
- labels + annotations => `capi-kubeadmcontrolplane-metadata` / `capi-machineset-metadata` (Apply)
- spec => `capi-kubeadmcontrolplane` / `capi-machineset` (Apply)

### Object creation workflow (CAPI >= v1.12)

When creating new BootstrapConfig/InfraMachine:

1. **Initial creation**:
- Apply BootstrapConfig/InfraMachine with spec (manager: `capi-kubeadmcontrolplane` / `capi-machineset`)
- Remove managedFields for labels + annotations
- Result: labels/annotations are orphaned, spec is owned

2. **First syncMachines call** (happens immediately after):
- Apply labels + annotations (manager: `capi-kubeadmcontrolplane-metadata` / `capi-machineset-metadata`)
- Result: Final desired managedField structure is established

3. **Ready for operations**:
- Continuous `syncMachines` calls update labels/annotations without affecting spec
- In-place updates can now properly update spec fields and unset fields as needed

### In-place update object modifications

When triggering in-place updates:

1. Apply BootstrapConfig/InfraMachine with:
- Updated spec (owned by spec manager)
- `update-in-progress` annotation (owned by spec manager)
- For InfraMachine: `cloned-from` annotations (owned by spec manager)

2. Result after in-place update trigger:
- labels + annotations => metadata manager
- spec => spec manager
- in-progress / cloned-from annotations => spec manager
Loading