Skip to content

Conversation

@mweibel
Copy link
Contributor

@mweibel mweibel commented Jun 25, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:
As outlined in the linked issue: when a Machine has the delete annotation, CAPZ needs to prioritize those Machines or AzureMachinePoolMachines for downscaling. This is achieved by fetching the DeleteMachine annotation from the owner Machine before starting to look at scaling decisions.

Which issue(s) this PR fixes:
Fixes #4941

Special notes for your reviewer:

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

ensure Machines with delete-machine annotation are deleted first

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 25, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mweibel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jackfrancis
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 25, 2024
@codecov
Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 51.17%. Comparing base (711d861) to head (0b71e08).
Report is 53 commits behind head on main.

Files with missing lines Patch % Lines
azure/scope/machinepool.go 71.42% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4949      +/-   ##
==========================================
+ Coverage   51.14%   51.17%   +0.02%     
==========================================
  Files         274      274              
  Lines       24669    24690      +21     
==========================================
+ Hits        12617    12634      +17     
- Misses      11264    11267       +3     
- Partials      788      789       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jont828
Copy link
Contributor

Jont828 commented Jun 26, 2024

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

@nawazkh
Copy link
Member

nawazkh commented Jun 26, 2024

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

Looks good to me overall, but I am curious to know if your approach changes with Jonathan's comment👆🏼 , @mweibel

@mweibel
Copy link
Contributor Author

mweibel commented Jun 27, 2024

Thanks for the reviews!

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

That's a very good question.

prioritize failed machines: That's debatable. It can make sense, however e.g. in the case of windows nodes we sometimes have temporarily failed machines if they reboot for some reason but they come back online. Maybe prioritize marked as delete machines would make more sense.

prioritize machines without the latest model: This is another question. In our use case it's even debatable if we want to delete those at all, because we run batch workloads (each on their own machine). This means that machines without the latest model aren't bothering us and we'd rather keep them online. That sounds like a separate issue which needs consideration, though.

I'd happily adjust the code to prioritize machines with the delete annotations - what are your opinions about this, given the additional context I provided?

@mweibel
Copy link
Contributor Author

mweibel commented Jun 28, 2024

about my previous comment:

This means that machines without the latest model aren't bothering us and we'd rather keep them online.

Just figured out that it actually does bother us, we're currently facing an issue with machines not getting updated and the VMSS rejecting to scale because we have too many VM models at the same time (10 is the limit apparently). Will file a separate issue for that .

@mweibel
Copy link
Contributor Author

mweibel commented Jul 1, 2024

Thanks for looking into this. The logic for sorting the delete priority is actually here, and it's worth noting that it will prioritize deleting failed Machines or Machines without the latest model over the delete annotation. Here's the part that sorts the AzureMachines as well if that's helpful.

reading the code again, I believe this is not what the code does, or do I misunderstand something?

// Order AzureMachinePoolMachines with the clutserv1.DeleteMachineAnnotation to the front so that they have delete priority.
// This allows MachinePool Machines to work with the autoscaler.
failedMachines = orderByDeleteMachineAnnotation(failedMachines)
deletingMachines = orderByDeleteMachineAnnotation(deletingMachines)
readyMachines = orderByDeleteMachineAnnotation(readyMachines)
machinesWithoutLatestModel = orderByDeleteMachineAnnotation(machinesWithoutLatestModel)

This code ensures that for each of those machine groups (failed, deleting, ready, without latest model), it'll prioritize those with the delete machine annotation over not annotated ones. This is because each machine group is sorted using orderByDeleteMachineAnnotation:

// orderByDeleteMachineAnnotation will sort AzureMachinePoolMachines with the clusterv1.DeleteMachineAnnotation to the front of the list.
// It will preserve the existing order of the list otherwise so that it respects the existing delete priority otherwise.
func orderByDeleteMachineAnnotation(machines []infrav1exp.AzureMachinePoolMachine) []infrav1exp.AzureMachinePoolMachine {
sort.SliceStable(machines, func(i, j int) bool {
_, iHasAnnotation := machines[i].Annotations[clusterv1.DeleteMachineAnnotation]
return iHasAnnotation
})
return machines
}

Though indeed, we could think about adjusting this code to first delete machines with the annotations, regardless of the group, and then only afterwards potentially delete other machines (maybe in the next reconcile iteration only).

Would that be what we want? I think that would make sense.

@mweibel mweibel force-pushed the propagate-deleteMachine-annotation branch from e6f1522 to eb033f2 Compare July 1, 2024 14:10
@mweibel mweibel changed the title fix: propagate deleteMachine annotation fix: fetch deleteMachine annotation from owner Machine before scaling decision Jul 1, 2024
@mweibel
Copy link
Contributor Author

mweibel commented Jul 1, 2024

/retest

./scripts/../hack/../hack/ensure-azcli.sh: line 28: AZURE_CLIENT_SECRET: unbound variable is not related to this change.

@mweibel mweibel requested review from Jont828 and nawazkh July 1, 2024 14:15
@mweibel
Copy link
Contributor Author

mweibel commented Jul 2, 2024

I assume this needs #4939 to be merged to make all tests work.

@nawazkh

This comment was marked as outdated.

@nawazkh
Copy link
Member

nawazkh commented Jul 8, 2024

Though indeed, we could think about adjusting this code to first delete machines with the annotations, regardless of the group, and then only afterwards potentially delete other machines (maybe in the next reconcile iteration only).

Would that be what we want? I think that would make sense.

Coming to this thought; it makes sense to me that "delete annotations" should be respected over cleanup. I vote for that change. Will you be adding that change via this PR?

@nawazkh
Copy link
Member

nawazkh commented Jul 8, 2024

I assume this needs #4939 to be merged to make all tests work.

We are migrating CAPZ tests onto community clusters, and as a part of that effort pull-cluster-api-provider-azure-e2e-with-wi-optional was created to test the migration. Please disregard any failures of pull-cluster-api-provider-azure-e2e-with-wi-optional for the time being. I will also put out a post on the community channel to conveying the same.

@nawazkh
Copy link
Member

nawazkh commented Jul 8, 2024

I assume this needs #4939 to be merged to make all tests work.

We are migrating CAPZ tests onto community clusters, and as a part of that effort pull-cluster-api-provider-azure-e2e-with-wi-optional was created to test the migration. Please disregard any failures of pull-cluster-api-provider-azure-e2e-with-wi-optional for the time being. I will also put out a post on the community channel to conveying the same.

Created a PR kubernetes/test-infra#32926 to remove pull-cluster-api-provider-azure-e2e-with-wi-optional from automatically running on CAPZ PRs.

@nawazkh
Copy link
Member

nawazkh commented Jul 8, 2024

/test pull-cluster-api-provider-azure-e2e-aks

@mweibel mweibel force-pushed the propagate-deleteMachine-annotation branch 2 times, most recently from e9bdf9d to 4447721 Compare July 30, 2024 12:52
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 30, 2024
@mweibel mweibel changed the title fix: fetch deleteMachine annotation from owner Machine before scaling decision fix: ensure Machines with delete-machine annotation are deleted first Jul 30, 2024
@mweibel mweibel force-pushed the propagate-deleteMachine-annotation branch from 4447721 to b074410 Compare July 30, 2024 13:10
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 30, 2024

@mweibel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-with-wi-optional eb033f2 link false /test pull-cluster-api-provider-azure-e2e-with-wi-optional

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines 158 to 165
// if we have machines annotated with delete machine, remove them
if len(deleteAnnotatedMachines) > 0 {
log.Info("delete annotated machines", "desiredReplicaCount", desiredReplicaCount, "maxUnavailable", maxUnavailable, "deleteAnnotatedMachines", getProviderIDs(deleteAnnotatedMachines))
return deleteAnnotatedMachines, nil
}

// if we have failed or deleting machines, remove them
if len(failedMachines) > 0 || len(deletingMachines) > 0 {
log.Info("failed or deleting machines", "desiredReplicaCount", desiredReplicaCount, "maxUnavailable", maxUnavailable, "failedMachines", getProviderIDs(failedMachines), "deletingMachines", getProviderIDs(deletingMachines))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if we should include the annotated machines in this condition (instead of making a separate one).

I decided against this approach because it would require a bit more code (filter out duplicates) and the benefit didn't seem huge - in the next reconciliation the failed/deleting machines would be sorted out, too.

Please let me know what you think about this, thanks!

@mweibel
Copy link
Contributor Author

mweibel commented Jul 30, 2024

@nawazkh @Jont828 Please review again.

I updated the implementation to always prefer delete machine annotated machines and removes the ordering of the other kind of machines.

Some e2e tests fail currently but to me it looks like they're a bit flaky today. I'll wait until you reviewed before I retest and potentially investigate deeper into if my changes could have an impact on e2e tests.

"deletingMachines", len(deletingMachines),
)

// if we have machines annotated with delete machine, remove them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if we have machines annotated with delete machine, remove them
// If we have machines with the delete machine annotation, remove them.

@Jont828
Copy link
Contributor

Jont828 commented Jul 31, 2024

@nawazkh @Jont828 Please review again.

I updated the implementation to always prefer delete machine annotated machines and removes the ordering of the other kind of machines.

Some e2e tests fail currently but to me it looks like they're a bit flaky today. I'll wait until you reviewed before I retest and potentially investigate deeper into if my changes could have an impact on e2e tests.

Overall I think the PR looks solid to me, I think the only thing would be how to prioritize machines with the annotation vs failed/outdated machines. I remember when I implemented this, we had some discussion about whether to prefer failed/outdated over the annotated machines and it wasn't super clear which course was best. I ended up deleting outdated machines first to be consistent with the DockerMachinePool implementation, and to ensure that we don't get in a situation where we are unable to meet the ready replica count b/c a failed or outdated machine can't get deleted.

If I understand correctly, does your use case benefit from giving priority to annotated machines instead?

@nawazkh
Copy link
Member

nawazkh commented Aug 2, 2024

/retest

@mweibel
Copy link
Contributor Author

mweibel commented Aug 5, 2024

If I understand correctly, does your use case benefit from giving priority to annotated machines instead?

the code as is in this PR would give priority to those, yes.

I'm not certain about this either. I guess there are three ways to handle it:

  • delete annotated machines first, then annotated (next reconcile)
  • delete failing/deleting machines first, then annotated (next reconcile)
  • merge those three lists together (making them unique)

Do you have any preference regarding which path to take?

In certain circumstances the deleteMachine annotation isn't yet
propagated from the Machine to the AzureMachinePoolMachine. This can
result in deleting the wrong machines. This change ensures the owner
Machine's deleteMachine annotation is always considered and those
machines are deleted before other kind of machines.
@mweibel mweibel force-pushed the propagate-deleteMachine-annotation branch from b074410 to 0b71e08 Compare August 8, 2024 07:15
@mweibel
Copy link
Contributor Author

mweibel commented Aug 8, 2024

@nawazkh @Jont828 after testing this change in production under high load (>800 windows machines), I noticed an issue which is related to the discussion/open question. When there are too many VMSS VMs in Failed state, the VMSS is marked as Unhealthy, preventing further scale up/down.

Therefore, I adjusted this PR to first delete failed/deleting machines (sorted by delete annotation) and only afterwards delete those with the delete machine annotation.

@nawazkh
Copy link
Member

nawazkh commented Aug 9, 2024

Looks good to me, thank you for putting this together.
Will wait for @Jont828 inputs.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8d4532c838a358d39fbb0cc38faf598e94f3c2f7

@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 22, 2024

Looks good to me, thank you for putting this together. Will wait for @Jont828 inputs. /lgtm

/assign @Jont828

@Jont828
Copy link
Contributor

Jont828 commented Aug 29, 2024

@nawazkh @Jont828 after testing this change in production under high load (>800 windows machines), I noticed an issue which is related to the discussion/open question. When there are too many VMSS VMs in Failed state, the VMSS is marked as Unhealthy, preventing further scale up/down.

Therefore, I adjusted this PR to first delete failed/deleting machines (sorted by delete annotation) and only afterwards delete those with the delete machine annotation.

Thanks for your hard work! And yep that's why I originally had it to prioritize deleting failing machines over the annotation. We want to make sure we can get out of a unhealthy state before worrying about autoscaling behavior.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jont828

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2024
@k8s-ci-robot k8s-ci-robot merged commit 51310aa into kubernetes-sigs:main Aug 29, 2024
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Downscaling with cluster-autoscaler (provider clusterapi) scales down wrong MachinePoolMachines

6 participants