-
Notifications
You must be signed in to change notification settings - Fork 4.2k
fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info #8449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info #8449
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @kincoy. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kincoy The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the PR and adding the test too.
the changes here look generally good to me, but i am not overly familiar with the simulator code. in specific, the returns for the failure condition.
would it be possible to add a unit test for the other return value as well? (i see the test for the unremovable node with the NoNodeInfo
, but can we also test for UnexpectedError
?)
Thanks! I looked into If you have a clean way to test this case, I’m happy to give it a try! |
thanks for investigating, i was worried it might take a complicated mock to make it work. i don't think it's worth the effort to create a test with a mock just for the error condition. /lgtm would be good to get another review from a maintainer |
@@ -151,6 +155,11 @@ func (r *RemovalSimulator) SimulateNodeRemoval( | |||
nodeInfo, err := r.clusterSnapshot.GetNodeInfo(nodeName) | |||
if err != nil { | |||
klog.Errorf("Can't retrieve node %s from snapshot, err: %v", nodeName, err) | |||
ghostNode := &apiv1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for doing something like this
unremovableReason := UnexpectedError
if errors.Is(err, clustersnapshot.ErrNodeNotFound) {
unremovableReason = NoNodeInfo
}
unremovableNode := &UnremovableNode{Node: &apiv1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}, Reason: unremovableReason}
return nil, unremovableNode
- to avoid the multiple return statements
- to actuate what the the if condition is really trying to determine
- to avoid wrapping the node object in an "opinonated" variable name like "ghostNode", which may indicate something more interesting going on than actually is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense — updated as you suggested. Also applied consistent naming in both implementation and tests.Thanks for the helpful review!
3d61fed
to
ed37b25
Compare
New changes are detected. LGTM label has been removed. |
…ling missing node info
ed37b25
to
6fcb503
Compare
Friendly ping @jackfrancis — this PR has been idle for a while. Would appreciate a review when convenient 🙏 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a potential
nil
dereference issue inSimulateNodeRemoval
when a node is missing from theclusterSnapshot
.Previously, if
clusterSnapshot.GetNodeInfo
failed, the function would continue and potentially panic when accessingnodeInfo.Pods()
.This fix introduces:
UnremovableReason
:NoNodeInfo
, used to mark such nodes as unremovable.SimulateNodeRemoval
when the node is missing..Name
set) to maintain observability and event compatibility.Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
This PR fixes a potential nil dereference in
SimulateNodeRemoval
when a node is missing from the cluster snapshot.It adds a new
UnremovableReason
(NoNodeInfo
) to capture this edge case, and add the test coverage.Does this PR introduce a user-facing change?
NONE