fix: ensure Machines with delete-machine annotation are deleted first #4949

mweibel · 2024-06-25T18:14:43Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
As outlined in the linked issue: when a Machine has the delete annotation, CAPZ needs to prioritize those Machines or AzureMachinePoolMachines for downscaling. This is achieved by fetching the DeleteMachine annotation from the owner Machine before starting to look at scaling decisions.

Which issue(s) this PR fixes:
Fixes #4941

Special notes for your reviewer:

cherry-pick candidate

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

ensure Machines with delete-machine annotation are deleted first

k8s-ci-robot · 2024-06-25T18:14:53Z

Hi @mweibel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jackfrancis · 2024-06-25T18:22:12Z

/ok-to-test

codecov · 2024-06-25T18:28:56Z

Codecov Report

Attention: Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 51.17%. Comparing base (711d861) to head (0b71e08).
Report is 53 commits behind head on main.

Files with missing lines	Patch %	Lines
azure/scope/machinepool.go	71.42%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4949      +/-   ##
==========================================
+ Coverage   51.14%   51.17%   +0.02%     
==========================================
  Files         274      274              
  Lines       24669    24690      +21     
==========================================
+ Hits        12617    12634      +17     
- Misses      11264    11267       +3     
- Partials      788      789       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

azure/scope/machinepool.go

Jont828 · 2024-06-26T20:14:59Z

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

azure/scope/machinepool.go

nawazkh · 2024-06-26T22:36:02Z

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

Looks good to me overall, but I am curious to know if your approach changes with Jonathan's comment👆🏼 , @mweibel

mweibel · 2024-06-27T07:57:52Z

Thanks for the reviews!

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

That's a very good question.

prioritize failed machines: That's debatable. It can make sense, however e.g. in the case of windows nodes we sometimes have temporarily failed machines if they reboot for some reason but they come back online. Maybe prioritize marked as delete machines would make more sense.

prioritize machines without the latest model: This is another question. In our use case it's even debatable if we want to delete those at all, because we run batch workloads (each on their own machine). This means that machines without the latest model aren't bothering us and we'd rather keep them online. That sounds like a separate issue which needs consideration, though.

I'd happily adjust the code to prioritize machines with the delete annotations - what are your opinions about this, given the additional context I provided?

mweibel · 2024-06-28T11:36:12Z

about my previous comment:

This means that machines without the latest model aren't bothering us and we'd rather keep them online.

Just figured out that it actually does bother us, we're currently facing an issue with machines not getting updated and the VMSS rejecting to scale because we have too many VM models at the same time (10 is the limit apparently). Will file a separate issue for that .

mweibel · 2024-07-01T13:24:03Z

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

reading the code again, I believe this is not what the code does, or do I misunderstand something?

cluster-api-provider-azure/azure/scope/strategies/machinepool_deployments/machinepool_deployment_strategy.go

Lines 146 to 151 in 52df930

    
           // Order AzureMachinePoolMachines with the clutserv1.DeleteMachineAnnotation to the front so that they have delete priority. 
        
           // This allows MachinePool Machines to work with the autoscaler. 
        
           failedMachines = orderByDeleteMachineAnnotation(failedMachines) 
        
           deletingMachines = orderByDeleteMachineAnnotation(deletingMachines) 
        
           readyMachines = orderByDeleteMachineAnnotation(readyMachines) 
        
           machinesWithoutLatestModel = orderByDeleteMachineAnnotation(machinesWithoutLatestModel)

This code ensures that for each of those machine groups (failed, deleting, ready, without latest model), it'll prioritize those with the delete machine annotation over not annotated ones. This is because each machine group is sorted using orderByDeleteMachineAnnotation:

cluster-api-provider-azure/azure/scope/strategies/machinepool_deployments/machinepool_deployment_strategy.go

Lines 307 to 317 in 52df930

    
           // orderByDeleteMachineAnnotation will sort AzureMachinePoolMachines with the clusterv1.DeleteMachineAnnotation to the front of the list. 
        
           // It will preserve the existing order of the list otherwise so that it respects the existing delete priority otherwise. 
        
           func orderByDeleteMachineAnnotation(machines []infrav1exp.AzureMachinePoolMachine) []infrav1exp.AzureMachinePoolMachine { 
        
           	sort.SliceStable(machines, func(i, j int) bool { 
        
           		_, iHasAnnotation := machines[i].Annotations[clusterv1.DeleteMachineAnnotation] 
        
           		return iHasAnnotation 
        
           	}) 
        
           	return machines 
        
           }

Though indeed, we could think about adjusting this code to first delete machines with the annotations, regardless of the group, and then only afterwards potentially delete other machines (maybe in the next reconcile iteration only).

Would that be what we want? I think that would make sense.

mweibel · 2024-07-01T14:15:03Z

/retest

./scripts/../hack/../hack/ensure-azcli.sh: line 28: AZURE_CLIENT_SECRET: unbound variable is not related to this change.

mweibel · 2024-07-02T12:37:35Z

I assume this needs #4939 to be merged to make all tests work.

nawazkh · 2024-07-08T17:57:29Z

Though indeed, we could think about adjusting this code to first delete machines with the annotations, regardless of the group, and then only afterwards potentially delete other machines (maybe in the next reconcile iteration only).

Would that be what we want? I think that would make sense.

Coming to this thought; it makes sense to me that "delete annotations" should be respected over cleanup. I vote for that change. Will you be adding that change via this PR?

nawazkh · 2024-07-08T18:07:18Z

I assume this needs #4939 to be merged to make all tests work.

We are migrating CAPZ tests onto community clusters, and as a part of that effort pull-cluster-api-provider-azure-e2e-with-wi-optional was created to test the migration. Please disregard any failures of pull-cluster-api-provider-azure-e2e-with-wi-optional for the time being. I will also put out a post on the community channel to conveying the same.

nawazkh · 2024-07-08T18:20:42Z

I assume this needs #4939 to be merged to make all tests work.

We are migrating CAPZ tests onto community clusters, and as a part of that effort pull-cluster-api-provider-azure-e2e-with-wi-optional was created to test the migration. Please disregard any failures of pull-cluster-api-provider-azure-e2e-with-wi-optional for the time being. I will also put out a post on the community channel to conveying the same.

Created a PR kubernetes/test-infra#32926 to remove pull-cluster-api-provider-azure-e2e-with-wi-optional from automatically running on CAPZ PRs.

nawazkh · 2024-07-08T18:21:38Z

/test pull-cluster-api-provider-azure-e2e-aks

k8s-ci-robot · 2024-07-30T14:17:16Z

@mweibel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-azure-e2e-with-wi-optional	`eb033f2`	link	false	`/test pull-cluster-api-provider-azure-e2e-with-wi-optional`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mweibel · 2024-07-30T14:51:31Z

azure/scope/strategies/machinepool_deployments/machinepool_deployment_strategy.go

+	// if we have machines annotated with delete machine, remove them
+	if len(deleteAnnotatedMachines) > 0 {
+		log.Info("delete annotated machines", "desiredReplicaCount", desiredReplicaCount, "maxUnavailable", maxUnavailable, "deleteAnnotatedMachines", getProviderIDs(deleteAnnotatedMachines))
+		return deleteAnnotatedMachines, nil
+	}
+
 	// if we have failed or deleting machines, remove them
 	if len(failedMachines) > 0 || len(deletingMachines) > 0 {
 		log.Info("failed or deleting machines", "desiredReplicaCount", desiredReplicaCount, "maxUnavailable", maxUnavailable, "failedMachines", getProviderIDs(failedMachines), "deletingMachines", getProviderIDs(deletingMachines))


I wasn't sure if we should include the annotated machines in this condition (instead of making a separate one).

I decided against this approach because it would require a bit more code (filter out duplicates) and the benefit didn't seem huge - in the next reconciliation the failed/deleting machines would be sorted out, too.

Please let me know what you think about this, thanks!

mweibel · 2024-07-30T14:52:53Z

@nawazkh @Jont828 Please review again.

I updated the implementation to always prefer delete machine annotated machines and removes the ordering of the other kind of machines.

Some e2e tests fail currently but to me it looks like they're a bit flaky today. I'll wait until you reviewed before I retest and potentially investigate deeper into if my changes could have an impact on e2e tests.

Jont828 · 2024-07-31T21:12:14Z

azure/scope/strategies/machinepool_deployments/machinepool_deployment_strategy.go

 		"deletingMachines", len(deletingMachines),
 	)

+	// if we have machines annotated with delete machine, remove them


Suggested change

// if we have machines annotated with delete machine, remove them

// If we have machines with the delete machine annotation, remove them.

Jont828 · 2024-07-31T21:17:45Z

@nawazkh @Jont828 Please review again.

I updated the implementation to always prefer delete machine annotated machines and removes the ordering of the other kind of machines.

Some e2e tests fail currently but to me it looks like they're a bit flaky today. I'll wait until you reviewed before I retest and potentially investigate deeper into if my changes could have an impact on e2e tests.

Overall I think the PR looks solid to me, I think the only thing would be how to prioritize machines with the annotation vs failed/outdated machines. I remember when I implemented this, we had some discussion about whether to prefer failed/outdated over the annotated machines and it wasn't super clear which course was best. I ended up deleting outdated machines first to be consistent with the DockerMachinePool implementation, and to ensure that we don't get in a situation where we are unable to meet the ready replica count b/c a failed or outdated machine can't get deleted.

If I understand correctly, does your use case benefit from giving priority to annotated machines instead?

nawazkh · 2024-08-02T18:47:14Z

/retest

mweibel · 2024-08-05T07:29:37Z

If I understand correctly, does your use case benefit from giving priority to annotated machines instead?

the code as is in this PR would give priority to those, yes.

I'm not certain about this either. I guess there are three ways to handle it:

delete annotated machines first, then annotated (next reconcile)
delete failing/deleting machines first, then annotated (next reconcile)
merge those three lists together (making them unique)

Do you have any preference regarding which path to take?

In certain circumstances the deleteMachine annotation isn't yet propagated from the Machine to the AzureMachinePoolMachine. This can result in deleting the wrong machines. This change ensures the owner Machine's deleteMachine annotation is always considered and those machines are deleted before other kind of machines.

mweibel · 2024-08-08T10:02:51Z

@nawazkh @Jont828 after testing this change in production under high load (>800 windows machines), I noticed an issue which is related to the discussion/open question. When there are too many VMSS VMs in Failed state, the VMSS is marked as Unhealthy, preventing further scale up/down.

Therefore, I adjusted this PR to first delete failed/deleting machines (sorted by delete annotation) and only afterwards delete those with the delete machine annotation.

nawazkh · 2024-08-09T16:48:04Z

Looks good to me, thank you for putting this together.
Will wait for @Jont828 inputs.
/lgtm

k8s-ci-robot · 2024-08-09T16:48:10Z

LGTM label has been added.

Git tree hash: 8d4532c838a358d39fbb0cc38faf598e94f3c2f7

nojnhuh · 2024-08-22T16:25:35Z

Looks good to me, thank you for putting this together. Will wait for @Jont828 inputs. /lgtm

/assign @Jont828

Jont828 · 2024-08-29T20:11:44Z

@nawazkh @Jont828 after testing this change in production under high load (>800 windows machines), I noticed an issue which is related to the discussion/open question. When there are too many VMSS VMs in Failed state, the VMSS is marked as Unhealthy, preventing further scale up/down.

Therefore, I adjusted this PR to first delete failed/deleting machines (sorted by delete annotation) and only afterwards delete those with the delete machine annotation.

Thanks for your hard work! And yep that's why I originally had it to prioritize deleting failing machines over the annotation. We want to make sure we can get out of a unhealthy state before worrying about autoscaling behavior.

/lgtm
/approve

k8s-ci-robot · 2024-08-29T20:11:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jont828

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Jont828]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from jackfrancis and marosset June 25, 2024 18:15

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 25, 2024

Jont828 reviewed Jun 26, 2024

View reviewed changes

azure/scope/machinepool.go Outdated Show resolved Hide resolved

Jont828 reviewed Jun 26, 2024

View reviewed changes

azure/scope/machinepool.go Outdated Show resolved Hide resolved

nawazkh reviewed Jun 26, 2024

View reviewed changes

azure/scope/machinepool.go Outdated Show resolved Hide resolved

mweibel force-pushed the propagate-deleteMachine-annotation branch from e6f1522 to eb033f2 Compare July 1, 2024 14:10

mweibel changed the title ~~fix: propagate deleteMachine annotation~~ fix: fetch deleteMachine annotation from owner Machine before scaling decision Jul 1, 2024

mweibel requested review from Jont828 and nawazkh July 1, 2024 14:15

This comment was marked as outdated.

Sign in to view

mweibel force-pushed the propagate-deleteMachine-annotation branch 2 times, most recently from e9bdf9d to 4447721 Compare July 30, 2024 12:52

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 30, 2024

mweibel changed the title ~~fix: fetch deleteMachine annotation from owner Machine before scaling decision~~ fix: ensure Machines with delete-machine annotation are deleted first Jul 30, 2024

mweibel force-pushed the propagate-deleteMachine-annotation branch from 4447721 to b074410 Compare July 30, 2024 13:10

mweibel commented Jul 30, 2024

View reviewed changes

Jont828 reviewed Jul 31, 2024

View reviewed changes

mweibel force-pushed the propagate-deleteMachine-annotation branch from b074410 to 0b71e08 Compare August 8, 2024 07:15

k8s-ci-robot assigned nawazkh Aug 9, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2024

k8s-ci-robot assigned Jont828 Aug 22, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2024

k8s-ci-robot merged commit 51310aa into kubernetes-sigs:main Aug 29, 2024

k8s-ci-robot added this to the v1.17 milestone Aug 29, 2024

	// if we have machines annotated with delete machine, remove them
	// If we have machines with the delete machine annotation, remove them.

fix: ensure Machines with delete-machine annotation are deleted first #4949

fix: ensure Machines with delete-machine annotation are deleted first #4949

Uh oh!

Conversation

mweibel commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 25, 2024

Uh oh!

jackfrancis commented Jun 25, 2024

Uh oh!

codecov bot commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Jont828 commented Jun 26, 2024

Uh oh!

Uh oh!

nawazkh commented Jun 26, 2024

Uh oh!

mweibel commented Jun 27, 2024

Uh oh!

mweibel commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mweibel commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mweibel commented Jul 1, 2024

Uh oh!

mweibel commented Jul 2, 2024

Uh oh!

This comment was marked as outdated.

nawazkh commented Jul 8, 2024

Uh oh!

nawazkh commented Jul 8, 2024

Uh oh!

nawazkh commented Jul 8, 2024

Uh oh!

nawazkh commented Jul 8, 2024

Uh oh!

k8s-ci-robot commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mweibel Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

mweibel commented Jul 30, 2024

Uh oh!

Jont828 Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

Jont828 commented Jul 31, 2024

Uh oh!

nawazkh commented Aug 2, 2024

Uh oh!

mweibel commented Aug 5, 2024

Uh oh!

mweibel commented Aug 8, 2024

Uh oh!

nawazkh commented Aug 9, 2024

Uh oh!

k8s-ci-robot commented Aug 9, 2024

Uh oh!

nojnhuh commented Aug 22, 2024

Uh oh!

Jont828 commented Aug 29, 2024

Uh oh!

k8s-ci-robot commented Aug 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mweibel commented Jun 25, 2024 •

edited

Loading

codecov bot commented Jun 25, 2024 •

edited

Loading

mweibel commented Jun 28, 2024 •

edited

Loading

mweibel commented Jul 1, 2024 •

edited

Loading

k8s-ci-robot commented Jul 30, 2024 •

edited

Loading