chore: when node disk attach limit is reached, the driver should return ResourceExhausted errors. #3237

landreasyan · 2025-08-01T22:49:00Z

What type of PR is this?
/kind bug

When node disk attach limit is reached, the driver should return ResourceExhausted errors.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Special notes for your reviewer:

Release note:

none

k8s-ci-robot · 2025-08-01T22:49:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: landreasyan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [landreasyan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix: decrement

nearora-msft · 2025-08-01T23:58:11Z

Does this fix any issues, or is this just to ensure that we return the right error to K8s?

landreasyan · 2025-08-02T00:58:30Z

Does this fix any issues, or is this just to ensure that we return the right error to K8s?

It fixes the logic of removing the disks queued for batching. Disks that cannot be processed in the current batch will be left in the queue to be processed with the next request if there is capacity or otherwise, report the right error.

andyzhangx · 2025-08-02T02:27:59Z

@landreasyan ResourceExhausted error won't make reschedule happen, that won't mitigate the original MaxDataDiskNumExcceeded issue unless user deletes the pod manually to make reschedule happen

pkg/azuredisk/azure_controller_common.go

landreasyan · 2025-08-04T15:33:19Z

@landreasyan ResourceExhausted error won't make reschedule happen, that won't mitigate the original MaxDataDiskNumExcceeded issue unless user deletes the pod manually to make reschedule happen

Yes, it won't but regardless, we should return the right error. This is called out in the csi spec as well: https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerpublishvolume-errors

andyzhangx · 2025-08-05T02:10:49Z

pkg/azuredisk/azure_controller_common.go

 		klog.V(2).Infof("no attach disk(%s) request on node(%s), diskMap len:%d, %+v", diskURI, nodeName, len(diskMap), diskMap)
 		return nil, nil
 	}
-	c.attachDiskMap.Store(nodeName, make(map[string]*provider.AttachDiskOptions))


original logic is always clearing up the queue since if removeDisks > 0, keep the remaining disk in the queue would not succeed in the end. and CSI driver has retry logic. clearing up the queue would make the logic more straightfoward.

I really wanted to avoid dropping the disks that haven't been processed, even though I agree that k8s will eventually retry but more changes need to be made to do this safely, so I am adding the cleaning up the queue part back.

With this approach, the next attach request that comes, would add to this batch of already existing disks in the attachDiskMap? So, we won't create a fresh batch, but reuse the originally dropped disks?
Is that the correct understanding?

pkg/azuredisk/azure_controller_common.go

fix: error message string in unittest

k8s-ci-robot · 2025-08-06T01:35:14Z

@landreasyan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-azuredisk-csi-driver-external-e2e-single-az	`0384dab`	link	true	`/test pull-azuredisk-csi-driver-external-e2e-single-az`
pull-azuredisk-csi-driver-e2e-capz-multi-az	`0384dab`	link	true	`/test pull-azuredisk-csi-driver-e2e-capz-multi-az`
pull-azuredisk-csi-driver-e2e-capz	`0384dab`	link	true	`/test pull-azuredisk-csi-driver-e2e-capz`
pull-azuredisk-csi-driver-e2e-capz-vmssflex	`0384dab`	link	true	`/test pull-azuredisk-csi-driver-e2e-capz-vmssflex`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

andyzhangx · 2025-08-06T03:37:18Z

this PR has made following test fail:

Summarizing 1 Failure:
  [FAIL] Dynamic Provisioning [single-az] [It] should succeed without MaximumDataDisksExceeded
  /home/prow/go/src/sigs.k8s.io/azuredisk-csi-driver/test/e2e/testsuites/dynamically_provisioned_job_attach_batch_tester.go:97

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_azuredisk-csi-driver/3237/pull-azuredisk-csi-driver-e2e-capz/1952867569819979776

Copilot

Pull Request Overview

This PR addresses a bug in the Azure disk CSI driver's batching logic to ensure only processable disks are removed from batches and proper error handling when node disk limits are exceeded. The changes implement early detection of disk limit violations and return appropriate ResourceExhausted errors per CSI specification.

Adds logic to check node disk limits before processing attach requests and return early ResourceExhausted errors
Modifies batch retrieval to only remove disks that can actually be processed, keeping the requested disk in the batch
Updates error handling to return ResourceExhausted status codes when disk limits are exceeded

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
pkg/azuredisk/controllerserver.go	Adds ResourceExhausted error handling for maximum disk exceeded scenarios
pkg/azuredisk/azure_controller_common.go	Refactors batching logic to check disk limits early and only remove processable disks from batches
pkg/azuredisk/azure_controller_common_test.go	Updates tests to validate new batching behavior and adds test for clearing attach disk requests

Copilot · 2025-08-09T14:55:47Z

pkg/azuredisk/azure_controller_common.go

-					numDisksAllowed = 0
+				currentNodeDisks := len(attachedDisks)
+				maxNodeDisks := int(maxNodeDisks)
+				if currentNodeDisks > maxNodeDisks {


The condition should use >= instead of > since reaching the maximum number of disks should also trigger the error. If currentNodeDisks equals maxNodeDisks, there's no room for additional disks.

Suggested change

if currentNodeDisks > maxNodeDisks {

if currentNodeDisks >= maxNodeDisks {

Copilot · 2025-08-09T14:55:48Z

pkg/azuredisk/azure_controller_common_test.go

 					klog.Info("First call to CreateOrUpdate succeeded", "VM Name:", name, "Params:", params)
-					time.Sleep(100 * time.Millisecond) // Simulate some processing time to hold the node lock while the 3rd detach request is made
-					return nil, nil                    // First call succeeds
+					time.Sleep(1000 * time.Millisecond) // Simulate some processing time to hold the node lock while the 3rd detach request is made


The sleep duration was increased from 100ms to 1000ms (10x increase) without explanation. This significantly slows down the test. Consider using a more reasonable duration or explaining why such a long delay is necessary.

Suggested change

time.Sleep(1000 * time.Millisecond) // Simulate some processing time to hold the node lock while the 3rd detach request is made

time.Sleep(100 * time.Millisecond) // Simulate processing time to hold the node lock while the 3rd detach request is made; 100ms is sufficient for concurrency in tests

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 1, 2025

k8s-ci-robot requested review from andyzhangx and nearora-msft August 1, 2025 22:49

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 1, 2025

chore: only remove disks that can be processed from the batch

81f5d4e

fix: decrement

landreasyan force-pushed the batching-fix-disk-limit-processing branch from d69464a to 81f5d4e Compare August 1, 2025 23:28

andyzhangx reviewed Aug 2, 2025

View reviewed changes

pkg/azuredisk/azure_controller_common.go Outdated Show resolved Hide resolved

andyzhangx reviewed Aug 5, 2025

View reviewed changes

chore: cleanup node batching queue when node max disk limit is hit

0384dab

fix: error message string in unittest

landreasyan force-pushed the batching-fix-disk-limit-processing branch from abf84ac to 0384dab Compare August 5, 2025 23:01

andyzhangx requested a review from Copilot August 9, 2025 14:55

Copilot AI reviewed Aug 9, 2025

View reviewed changes

landreasyan changed the title ~~chore: only remove disks that can be processed from the batch~~ chore: when node disk attach limit is reached, the driver should return ResourceExhausted errors. Aug 20, 2025

	if currentNodeDisks > maxNodeDisks {
	if currentNodeDisks >= maxNodeDisks {

	time.Sleep(1000 * time.Millisecond) // Simulate some processing time to hold the node lock while the 3rd detach request is made
	time.Sleep(100 * time.Millisecond) // Simulate processing time to hold the node lock while the 3rd detach request is made; 100ms is sufficient for concurrency in tests

chore: when node disk attach limit is reached, the driver should return ResourceExhausted errors. #3237

Are you sure you want to change the base?

chore: when node disk attach limit is reached, the driver should return ResourceExhausted errors. #3237

Uh oh!

Conversation

landreasyan commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 1, 2025

Uh oh!

nearora-msft commented Aug 1, 2025

Uh oh!

landreasyan commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyzhangx commented Aug 2, 2025

Uh oh!

Uh oh!

landreasyan commented Aug 4, 2025

Uh oh!

andyzhangx Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

landreasyan Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

nearora-msft Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

andyzhangx commented Aug 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

landreasyan commented Aug 1, 2025 •

edited

Loading

landreasyan commented Aug 2, 2025 •

edited

Loading