Adding Capacity Reservation during VM Resize#4707
Adding Capacity Reservation during VM Resize#4707rh-returners wants to merge 4 commits intomasterfrom
Conversation
@microsoft-github-policy-service agree |
|
@copilot review |
Switched from reflection mode to source mode (-source) for the new CapacityReservationGroups, CapacityReservations, and VirtualMachines mock clients. This ensures deterministic mock generation across Go versions (reflection mode requires Go 1.25 which is only available in CI, not locally). Changes: - generate.go: revert to ResourceSKUsClient-only reflection mode; add separate source-mode generate directives for the 3 new clients - armcompute.go: revert to original single-interface version - Add new separate mock files per interface (source-mode generated) Co-authored-by: tuxerrante <8364469+tuxerrante@users.noreply.github.com> Agent-Logs-Url: https://github.com/Azure/ARO-RP/sessions/7f872d11-3b82-4401-bed7-8af1e5550ea9
Reviewed and addressed the CI failure in the Root cause: The new Fix: Switched to source mode (
The original |
| for _, vm := range masterVMs { | ||
| zone := vmZone(vm) | ||
| crName := fmt.Sprintf(targetReservationNameFmt, zone) | ||
| err = a.armCapacityReservations.CreateOrUpdateAndWait(ctx, clusterRG, capacityReservationGroupName, crName, |
There was a problem hiding this comment.
Potential retry/idempotency issue here: if a zone is already on the target SKU (e.g. rerun after partial success), step 3a creates cr-current-z{zone} with that SKU and step 3b creates cr-target-z{zone} with the same SKU in the same zone. Azure CRG semantics allow only one reservation per VM size per zone, so this can fail before resize starts.
Could we skip/merge target reservation creation when zoneCurrentSKU[zone] == targetVMSize?
Refs:
- Azure docs: Work with capacity reservation
- Current-SKU create path:
ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Lines 91 to 98 in ee0ecbd
- Target-SKU create path:
ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Lines 113 to 118 in ee0ecbd
- PR: Adding Capacity Reservation during VM Resize #4707
| a.log.Warnf("cleanup after target-SKU reservation failure also failed: %v", cleanupErr) | ||
| } | ||
| return fmt.Errorf( | ||
| "target SKU %s has insufficient capacity in zone %s — consider choosing a different VM family: %w", |
There was a problem hiding this comment.
This message currently classifies every failure here as insufficient capacity, but CreateOrUpdate can also fail for unsupported reservation SKU/series, quota, RBAC, invalid params, etc. Could we preserve the concrete ARM reason and only add the capacity hint when the backend error actually indicates capacity exhaustion?
Refs:
- Azure limitations/restrictions: https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview#limitations-and-restrictions
- Current message:
ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Lines 123 to 125 in ee0ecbd
|
|
||
| // Step 6: success — disassociate VMs and delete all reservation resources. | ||
| // Errors are returned: lingering reservations incur ongoing Azure costs. | ||
| a.log.Info("resize complete, cleaning up capacity reservation resources") |
There was a problem hiding this comment.
Inline cleanup here is very useful, but there is still a failure mode if the process is interrupted (network/process crash) after CRG/CR creation and before this block executes. Since these resources are billable while present, would it make sense to add a separate recover/cleanup action that can be triggered independently when a run aborts unexpectedly?
Refs:
- Pricing/billing semantics: https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview#pricing-and-billing
- Cleanup entry point:
ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Lines 181 to 186 in ee0ecbd
- Deterministic resource names:
ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Lines 18 to 22 in ee0ecbd
| @@ -80,6 +81,20 @@ func TestAdminVMResize(t *testing.T) { | |||
| }, | |||
There was a problem hiding this comment.
Could we add focused unit tests for VMResizeWithCapacityReservation itself (in pkg/frontend/adminactions) in addition to this routing-level test? The high-risk paths seem to be: mixed-SKU zone retries, reservation-create failures, and cleanup ordering when association/resize/start fail.
Refs:
- New case in this file:
ARO-RP/pkg/frontend/admin_openshiftcluster_vmresize_test.go
Lines 81 to 95 in ee0ecbd
- New orchestration implementation: https://github.com/Azure/ARO-RP/blob/ee0ecbdcd63657cc725670aef2f5a1751264d985/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
| // Step 2: create the Capacity Reservation Group with the same zones as the master VMs. | ||
| // Azure requires the CRG to declare all zones it will serve. | ||
| a.log.Infof("creating capacity reservation group %q in zones %v", capacityReservationGroupName, zones) | ||
| crg, err := a.armCapacityReservationGroups.CreateOrUpdate(ctx, clusterRG, capacityReservationGroupName, |
There was a problem hiding this comment.
Can we add explicit handling/remediation for authorization failures on CRG/CR operations (AuthorizationFailed / missing Microsoft.Compute/capacityReservationGroups/* actions)? This path will run against customer RGs and permission gaps here would currently surface as a generic create error.
Possible follow-up: preflight auth probe + actionable error message before starting resize.
Refs:
- CRG create call:
ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Lines 65 to 69 in ee0ecbd
- PR: Adding Capacity Reservation during VM Resize #4707
| // 3. Create current-SKU and target-SKU reservations per zone before touching any VM. | ||
| // If target capacity is unavailable the rollback is simple: delete reservations + CRG. | ||
| // 4. Associate all master VMs with the CRG. | ||
| // 5. Resize each VM one at a time (deallocate → resize → start) to preserve quorum. |
There was a problem hiding this comment.
We should not do this from this action. When an entire control plane is resized, several discrete operations are performed against the machines and VMs on a sequential order. The current process is, machine per machine:
- Cordon a node (there is an specific action and code for that).
- Drain a node (also existing action and code).
- Stop a VM with/without deallocation (again, there is already existing code and action for this)
- Resize that specific VM (this is what this PR should support with reservations)
- Start the resized VM (also existing action and code).
- Uncordon the node (also existing action and code).
for what I've read in this PR (and this comment is a good point to state this), the code here is just being added to the Resize step, changing all machines at once. This will not have the intended effect, as the entire process works as described (in other words, it performs all above steps on every single machine of the cluster), so the effect would be to perform a total resize of the cluster 3 times.
As we are planning to create a "meta admin action for resize" that calls the different functions involved in a resize, i would suggest to change this PR to something that supports this philosophy, by creating support functions such as
- a function to create the CRG, that will be called before the overall start of the operation,
- a function for the target reservations
- a new resize operation that uses the above for a single VM, associates it to the CRG and performs the resize
- a function to remove all VMs from the CRG
Those will be called from the meta function. This also saves work here (for example, looking for master nodes, we already have code covering it). The resize action should not be concerned with looking for the appropriate machines, this is handled in another part of the process.
armcompute.gowas regenerated differently by CI's Go 1.25 vs what was committed (generated with Go 1.24)-source ./file.go) for new interfaces, keeping reflection mode only forResourceSKUsClient(unchanged baseline)armcompute.goto original single-interface format📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.