Skip to content

Adding Capacity Reservation during VM Resize#4707

Open
rh-returners wants to merge 4 commits intomasterfrom
mmancebo/ARO-23134
Open

Adding Capacity Reservation during VM Resize#4707
rh-returners wants to merge 4 commits intomasterfrom
mmancebo/ARO-23134

Conversation

@rh-returners
Copy link
Copy Markdown
Collaborator

@rh-returners rh-returners commented Mar 24, 2026

  • Investigate CI failure in generate-check job
  • Root cause: armcompute.go was regenerated differently by CI's Go 1.25 vs what was committed (generated with Go 1.24)
  • Fix: use source mode (-source ./file.go) for new interfaces, keeping reflection mode only for ResourceSKUsClient (unchanged baseline)
  • Revert armcompute.go to original single-interface format
  • Add separate mock files per new interface (source-mode generated, deterministic across Go versions)
  • All tests pass
  • Code compiles successfully
  • Code review passed

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

@rh-returners
Copy link
Copy Markdown
Collaborator Author

@rh-mmancebo please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@tuxerrante
Copy link
Copy Markdown
Collaborator

@copilot review

Switched from reflection mode to source mode (-source) for the new
CapacityReservationGroups, CapacityReservations, and VirtualMachines
mock clients. This ensures deterministic mock generation across Go
versions (reflection mode requires Go 1.25 which is only available
in CI, not locally).

Changes:
- generate.go: revert to ResourceSKUsClient-only reflection mode;
  add separate source-mode generate directives for the 3 new clients
- armcompute.go: revert to original single-interface version
- Add new separate mock files per interface (source-mode generated)

Co-authored-by: tuxerrante <8364469+tuxerrante@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/ARO-RP/sessions/7f872d11-3b82-4401-bed7-8af1e5550ea9
Copy link
Copy Markdown

Copilot AI commented Mar 25, 2026

@copilot review

Reviewed and addressed the CI failure in the generate-check job (commit ee0ecbd).

Root cause: The new armcompute.go mock (with 4 interfaces) was generated using reflection mode, which requires Go 1.25 (from go.mod). Locally with Go 1.24, I couldn't regenerate it identically to what CI would produce with Go 1.25, causing the isClean.sh check to fail.

Fix: Switched to source mode (-source ./file.go) for the 3 new client mocks (CapacityReservationGroups, CapacityReservations, VirtualMachines). Source mode parses the Go AST directly and is deterministic across Go versions. Each interface now has its own separate mock file:

  • capacityreservationgroups.go
  • capacityreservations.go
  • virtualmachines.go

The original armcompute.go (ResourceSKUsClient only, reflection mode) is reverted to its pre-PR state. All tests pass.

for _, vm := range masterVMs {
zone := vmZone(vm)
crName := fmt.Sprintf(targetReservationNameFmt, zone)
err = a.armCapacityReservations.CreateOrUpdateAndWait(ctx, clusterRG, capacityReservationGroupName, crName,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential retry/idempotency issue here: if a zone is already on the target SKU (e.g. rerun after partial success), step 3a creates cr-current-z{zone} with that SKU and step 3b creates cr-target-z{zone} with the same SKU in the same zone. Azure CRG semantics allow only one reservation per VM size per zone, so this can fail before resize starts.

Could we skip/merge target reservation creation when zoneCurrentSKU[zone] == targetVMSize?

Refs:

a.log.Warnf("cleanup after target-SKU reservation failure also failed: %v", cleanupErr)
}
return fmt.Errorf(
"target SKU %s has insufficient capacity in zone %s — consider choosing a different VM family: %w",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message currently classifies every failure here as insufficient capacity, but CreateOrUpdate can also fail for unsupported reservation SKU/series, quota, RBAC, invalid params, etc. Could we preserve the concrete ARM reason and only add the capacity hint when the backend error actually indicates capacity exhaustion?

Refs:


// Step 6: success — disassociate VMs and delete all reservation resources.
// Errors are returned: lingering reservations incur ongoing Azure costs.
a.log.Info("resize complete, cleaning up capacity reservation resources")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline cleanup here is very useful, but there is still a failure mode if the process is interrupted (network/process crash) after CRG/CR creation and before this block executes. Since these resources are billable while present, would it make sense to add a separate recover/cleanup action that can be triggered independently when a run aborts unexpectedly?

Refs:

@@ -80,6 +81,20 @@ func TestAdminVMResize(t *testing.T) {
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add focused unit tests for VMResizeWithCapacityReservation itself (in pkg/frontend/adminactions) in addition to this routing-level test? The high-risk paths seem to be: mixed-SKU zone retries, reservation-create failures, and cleanup ordering when association/resize/start fail.

Refs:

// Step 2: create the Capacity Reservation Group with the same zones as the master VMs.
// Azure requires the CRG to declare all zones it will serve.
a.log.Infof("creating capacity reservation group %q in zones %v", capacityReservationGroupName, zones)
crg, err := a.armCapacityReservationGroups.CreateOrUpdate(ctx, clusterRG, capacityReservationGroupName,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add explicit handling/remediation for authorization failures on CRG/CR operations (AuthorizationFailed / missing Microsoft.Compute/capacityReservationGroups/* actions)? This path will run against customer RGs and permission gaps here would currently surface as a generic create error.

Possible follow-up: preflight auth probe + actionable error message before starting resize.

Refs:

// 3. Create current-SKU and target-SKU reservations per zone before touching any VM.
// If target capacity is unavailable the rollback is simple: delete reservations + CRG.
// 4. Associate all master VMs with the CRG.
// 5. Resize each VM one at a time (deallocate → resize → start) to preserve quorum.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do this from this action. When an entire control plane is resized, several discrete operations are performed against the machines and VMs on a sequential order. The current process is, machine per machine:

  1. Cordon a node (there is an specific action and code for that).
  2. Drain a node (also existing action and code).
  3. Stop a VM with/without deallocation (again, there is already existing code and action for this)
  4. Resize that specific VM (this is what this PR should support with reservations)
  5. Start the resized VM (also existing action and code).
  6. Uncordon the node (also existing action and code).

for what I've read in this PR (and this comment is a good point to state this), the code here is just being added to the Resize step, changing all machines at once. This will not have the intended effect, as the entire process works as described (in other words, it performs all above steps on every single machine of the cluster), so the effect would be to perform a total resize of the cluster 3 times.

As we are planning to create a "meta admin action for resize" that calls the different functions involved in a resize, i would suggest to change this PR to something that supports this philosophy, by creating support functions such as

  • a function to create the CRG, that will be called before the overall start of the operation,
  • a function for the target reservations
  • a new resize operation that uses the above for a single VM, associates it to the CRG and performs the resize
  • a function to remove all VMs from the CRG

Those will be called from the meta function. This also saves work here (for example, looking for master nodes, we already have code covering it). The resize action should not be concerned with looking for the appropriate machines, this is handled in another part of the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants