Adding Capacity Reservation during VM Resize by rh-returners · Pull Request #4707 · Azure/ARO-RP

rh-returners · 2026-03-24T08:22:39Z

Investigate CI failure in generate-check job
Root cause: armcompute.go was regenerated differently by CI's Go 1.25 vs what was committed (generated with Go 1.24)
Fix: use source mode (-source ./file.go) for new interfaces, keeping reflection mode only for ResourceSKUsClient (unchanged baseline)
Revert armcompute.go to original single-interface format
Add separate mock files per new interface (source-mode generated, deterministic across Go versions)
All tests pass
Code compiles successfully
Code review passed

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

rh-returners · 2026-03-25T09:00:38Z

@rh-mmancebo please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

tuxerrante · 2026-03-25T13:55:12Z

@copilot review

Switched from reflection mode to source mode (-source) for the new CapacityReservationGroups, CapacityReservations, and VirtualMachines mock clients. This ensures deterministic mock generation across Go versions (reflection mode requires Go 1.25 which is only available in CI, not locally). Changes: - generate.go: revert to ResourceSKUsClient-only reflection mode; add separate source-mode generate directives for the 3 new clients - armcompute.go: revert to original single-interface version - Add new separate mock files per interface (source-mode generated) Co-authored-by: tuxerrante <8364469+tuxerrante@users.noreply.github.com> Agent-Logs-Url: https://github.com/Azure/ARO-RP/sessions/7f872d11-3b82-4401-bed7-8af1e5550ea9

Copilot · 2026-03-25T14:29:10Z

@copilot review

Reviewed and addressed the CI failure in the generate-check job (commit ee0ecbd).

Root cause: The new armcompute.go mock (with 4 interfaces) was generated using reflection mode, which requires Go 1.25 (from go.mod). Locally with Go 1.24, I couldn't regenerate it identically to what CI would produce with Go 1.25, causing the isClean.sh check to fail.

Fix: Switched to source mode (-source ./file.go) for the 3 new client mocks (CapacityReservationGroups, CapacityReservations, VirtualMachines). Source mode parses the Go AST directly and is deterministic across Go versions. Each interface now has its own separate mock file:

capacityreservationgroups.go
capacityreservations.go
virtualmachines.go

The original armcompute.go (ResourceSKUsClient only, reflection mode) is reverted to its pre-PR state. All tests pass.

tuxerrante · 2026-03-25T16:41:42Z

pkg/frontend/adminactions/azureactions_vmresize_capacity.go

+	for _, vm := range masterVMs {
+		zone := vmZone(vm)
+		crName := fmt.Sprintf(targetReservationNameFmt, zone)
+		err = a.armCapacityReservations.CreateOrUpdateAndWait(ctx, clusterRG, capacityReservationGroupName, crName,


Potential retry/idempotency issue here: if a zone is already on the target SKU (e.g. rerun after partial success), step 3a creates cr-current-z{zone} with that SKU and step 3b creates cr-target-z{zone} with the same SKU in the same zone. Azure CRG semantics allow only one reservation per VM size per zone, so this can fail before resize starts.

Could we skip/merge target reservation creation when zoneCurrentSKU[zone] == targetVMSize?

Refs:

Azure docs: Work with capacity reservation

Current-SKU create path:

ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

Lines 91 to 98 in ee0ecbd

crName := fmt.Sprintf(currentReservationNameFmt, zone)

a.log.Infof("creating current-SKU reservation %s (SKU %s) in zone %s", crName, actualVMSize, zone)

err = a.armCapacityReservations.CreateOrUpdateAndWait(ctx, clusterRG, capacityReservationGroupName, crName,

armcompute.CapacityReservation{

Location: &location,

SKU: &armcompute.SKU{Name: &actualVMSize, Capacity: pointerutils.ToPtr(int64(1))},

Zones: []*string{pointerutils.ToPtr(zone)},

})

Target-SKU create path:

ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

Lines 113 to 118 in ee0ecbd

err = a.armCapacityReservations.CreateOrUpdateAndWait(ctx, clusterRG, capacityReservationGroupName, crName,

armcompute.CapacityReservation{

Location: &location,

SKU: &armcompute.SKU{Name: &targetVMSize, Capacity: pointerutils.ToPtr(int64(1))},

Zones: []*string{pointerutils.ToPtr(zone)},

})

PR: Adding Capacity Reservation during VM Resize #4707

tuxerrante · 2026-03-25T16:41:48Z

pkg/frontend/adminactions/azureactions_vmresize_capacity.go

+				a.log.Warnf("cleanup after target-SKU reservation failure also failed: %v", cleanupErr)
+			}
+			return fmt.Errorf(
+				"target SKU %s has insufficient capacity in zone %s — consider choosing a different VM family: %w",


This message currently classifies every failure here as insufficient capacity, but CreateOrUpdate can also fail for unsupported reservation SKU/series, quota, RBAC, invalid params, etc. Could we preserve the concrete ARM reason and only add the capacity hint when the backend error actually indicates capacity exhaustion?

Refs:

Azure limitations/restrictions: https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview#limitations-and-restrictions

Current message:

ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

Lines 123 to 125 in ee0ecbd

return fmt.Errorf(

"target SKU %s has insufficient capacity in zone %s — consider choosing a different VM family: %w",

targetVMSize, zone, err)

tuxerrante · 2026-03-25T16:41:55Z

pkg/frontend/adminactions/azureactions_vmresize_capacity.go

+
+	// Step 6: success — disassociate VMs and delete all reservation resources.
+	// Errors are returned: lingering reservations incur ongoing Azure costs.
+	a.log.Info("resize complete, cleaning up capacity reservation resources")


Inline cleanup here is very useful, but there is still a failure mode if the process is interrupted (network/process crash) after CRG/CR creation and before this block executes. Since these resources are billable while present, would it make sense to add a separate recover/cleanup action that can be triggered independently when a run aborts unexpectedly?

Refs:

Pricing/billing semantics: https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview#pricing-and-billing

Cleanup entry point:

ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

Lines 181 to 186 in ee0ecbd

// Step 6: success — disassociate VMs and delete all reservation resources.

// Errors are returned: lingering reservations incur ongoing Azure costs.

a.log.Info("resize complete, cleaning up capacity reservation resources")

if err := a.cleanupCRG(ctx, location, clusterRG, targetVMSize, zoneCurrentSKU, masterVMs); err != nil {

return fmt.Errorf("resize succeeded but failed to clean up capacity reservation resources (manual cleanup required to avoid ongoing costs): %w", err)

}

Deterministic resource names:

ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

Lines 18 to 22 in ee0ecbd

const (

capacityReservationGroupName = "aro-resize-crg"

currentReservationNameFmt = "cr-current-z%s"

targetReservationNameFmt = "cr-target-z%s"

)

tuxerrante · 2026-03-25T16:42:14Z

pkg/frontend/admin_openshiftcluster_vmresize_test.go

@@ -80,6 +81,20 @@ func TestAdminVMResize(t *testing.T) {
 			},


Could we add focused unit tests for VMResizeWithCapacityReservation itself (in pkg/frontend/adminactions) in addition to this routing-level test? The high-risk paths seem to be: mixed-SKU zone retries, reservation-create failures, and cleanup ordering when association/resize/start fail.

Refs:

New case in this file:

ARO-RP/pkg/frontend/admin_openshiftcluster_vmresize_test.go

Lines 81 to 95 in ee0ecbd

},

wantStatusCode: http.StatusOK,

},

{

name: "capacity reservation path",

vmSize: "Standard_D8s_v3",

useCapacityReservation: true,

resourceID: testdatabase.GetResourcePath(mockSubID, "resourceName"),

fixture: func(f *testdatabase.Fixture) {

addClusterDoc(f)

addSubscriptionDoc(f)

},

azureActionsMocks: func(tt *test, a *mock_adminactions.MockAzureActions) {

a.EXPECT().VMResizeWithCapacityReservation(gomock.Any(), tt.vmSize).Return(nil)

},

New orchestration implementation: https://github.com/Azure/ARO-RP/blob/ee0ecbdcd63657cc725670aef2f5a1751264d985/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

tuxerrante · 2026-03-25T16:42:27Z

pkg/frontend/adminactions/azureactions_vmresize_capacity.go

+	// Step 2: create the Capacity Reservation Group with the same zones as the master VMs.
+	// Azure requires the CRG to declare all zones it will serve.
+	a.log.Infof("creating capacity reservation group %q in zones %v", capacityReservationGroupName, zones)
+	crg, err := a.armCapacityReservationGroups.CreateOrUpdate(ctx, clusterRG, capacityReservationGroupName,


Can we add explicit handling/remediation for authorization failures on CRG/CR operations (AuthorizationFailed / missing Microsoft.Compute/capacityReservationGroups/* actions)? This path will run against customer RGs and permission gaps here would currently surface as a generic create error.

Possible follow-up: preflight auth probe + actionable error message before starting resize.

Refs:

CRG create call:

ARO-RP/pkg/frontend/adminactions/azureactions_vmresize_capacity.go

Lines 65 to 69 in ee0ecbd

crg, err := a.armCapacityReservationGroups.CreateOrUpdate(ctx, clusterRG, capacityReservationGroupName,

armcompute.CapacityReservationGroup{

Location: &location,

Zones: pointerutils.ToSlicePtr(zones),

})

PR: Adding Capacity Reservation during VM Resize #4707

alcasim · 2026-03-27T12:05:37Z

pkg/frontend/adminactions/azureactions_vmresize_capacity.go

+//  3. Create current-SKU and target-SKU reservations per zone before touching any VM.
+//     If target capacity is unavailable the rollback is simple: delete reservations + CRG.
+//  4. Associate all master VMs with the CRG.
+//  5. Resize each VM one at a time (deallocate → resize → start) to preserve quorum.


We should not do this from this action. When an entire control plane is resized, several discrete operations are performed against the machines and VMs on a sequential order. The current process is, machine per machine:

Cordon a node (there is an specific action and code for that).

Drain a node (also existing action and code).

Stop a VM with/without deallocation (again, there is already existing code and action for this)

Resize that specific VM (this is what this PR should support with reservations)

Start the resized VM (also existing action and code).

Uncordon the node (also existing action and code).

for what I've read in this PR (and this comment is a good point to state this), the code here is just being added to the Resize step, changing all machines at once. This will not have the intended effect, as the entire process works as described (in other words, it performs all above steps on every single machine of the cluster), so the effect would be to perform a total resize of the cluster 3 times.

As we are planning to create a "meta admin action for resize" that calls the different functions involved in a resize, i would suggest to change this PR to something that supports this philosophy, by creating support functions such as

a function to create the CRG, that will be called before the overall start of the operation,

a function for the target reservations

a new resize operation that uses the above for a single VM, associates it to the CRG and performs the resize

a function to remove all VMs from the CRG

Those will be called from the meta function. This also saves work here (for example, looking for master nodes, we already have code covering it). The resize action should not be concerned with looking for the appropriate machines, this is handled in another part of the process.

Adding Capacity Reservation during VM Resize

1ecbb4f

rh-returners requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, tuxerrante, ventifus, wanghaoran1988 and yjst2012 as code owners March 24, 2026 08:22

rh-returners added 2 commits March 24, 2026 11:04

Fixed azureactions fmt

5e58442

Adding missing configuration for CRG resize

5d79d1d

rh-returners closed this Mar 25, 2026

rh-returners reopened this Mar 25, 2026

Copilot started work on behalf of tuxerrante March 25, 2026 13:55 View session

Copilot finished work on behalf of tuxerrante March 25, 2026 14:29

tuxerrante reviewed Mar 25, 2026

View reviewed changes

alcasim requested changes Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Capacity Reservation during VM Resize#4707

Adding Capacity Reservation during VM Resize#4707
rh-returners wants to merge 4 commits intomasterfrom
mmancebo/ARO-23134

rh-returners commented Mar 24, 2026 •

edited by Copilot AI

Loading

Uh oh!

rh-returners commented Mar 25, 2026

Uh oh!

tuxerrante commented Mar 25, 2026

Uh oh!

Copilot AI commented Mar 25, 2026

Uh oh!

tuxerrante Mar 25, 2026

Uh oh!

tuxerrante Mar 25, 2026

Uh oh!

tuxerrante Mar 25, 2026

Uh oh!

tuxerrante Mar 25, 2026

Uh oh!

tuxerrante Mar 25, 2026

Uh oh!

alcasim Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	crName := fmt.Sprintf(currentReservationNameFmt, zone)
	a.log.Infof("creating current-SKU reservation %s (SKU %s) in zone %s", crName, actualVMSize, zone)
	err = a.armCapacityReservations.CreateOrUpdateAndWait(ctx, clusterRG, capacityReservationGroupName, crName,
	armcompute.CapacityReservation{
	Location: &location,
	SKU: &armcompute.SKU{Name: &actualVMSize, Capacity: pointerutils.ToPtr(int64(1))},
	Zones: []*string{pointerutils.ToPtr(zone)},
	})

	return fmt.Errorf(
	"target SKU %s has insufficient capacity in zone %s — consider choosing a different VM family: %w",
	targetVMSize, zone, err)

	// Step 6: success — disassociate VMs and delete all reservation resources.
	// Errors are returned: lingering reservations incur ongoing Azure costs.
	a.log.Info("resize complete, cleaning up capacity reservation resources")
	if err := a.cleanupCRG(ctx, location, clusterRG, targetVMSize, zoneCurrentSKU, masterVMs); err != nil {
	return fmt.Errorf("resize succeeded but failed to clean up capacity reservation resources (manual cleanup required to avoid ongoing costs): %w", err)
	}

	const (
	capacityReservationGroupName = "aro-resize-crg"
	currentReservationNameFmt = "cr-current-z%s"
	targetReservationNameFmt = "cr-target-z%s"
	)

	},
	wantStatusCode: http.StatusOK,
	},
	{
	name: "capacity reservation path",
	vmSize: "Standard_D8s_v3",
	useCapacityReservation: true,
	resourceID: testdatabase.GetResourcePath(mockSubID, "resourceName"),
	fixture: func(f *testdatabase.Fixture) {
	addClusterDoc(f)
	addSubscriptionDoc(f)
	},
	azureActionsMocks: func(tt test, a mock_adminactions.MockAzureActions) {
	a.EXPECT().VMResizeWithCapacityReservation(gomock.Any(), tt.vmSize).Return(nil)
	},

	crg, err := a.armCapacityReservationGroups.CreateOrUpdate(ctx, clusterRG, capacityReservationGroupName,
	armcompute.CapacityReservationGroup{
	Location: &location,
	Zones: pointerutils.ToSlicePtr(zones),
	})

Conversation

rh-returners commented Mar 24, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rh-returners commented Mar 25, 2026

Uh oh!

tuxerrante commented Mar 25, 2026

Uh oh!

Copilot AI commented Mar 25, 2026

Uh oh!

tuxerrante Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tuxerrante Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tuxerrante Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tuxerrante Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tuxerrante Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

alcasim Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rh-returners commented Mar 24, 2026 •

edited by Copilot AI

Loading