Skip to content

Conversation

kincoy
Copy link
Contributor

@kincoy kincoy commented Aug 18, 2025

What type of PR is this?

/kind bug


What this PR does / why we need it:

This PR fixes a potential nil dereference issue in SimulateNodeRemoval when a node is missing from the clusterSnapshot.

Previously, if clusterSnapshot.GetNodeInfo failed, the function would continue and potentially panic when accessing nodeInfo.Pods().

This fix introduces:

  • A new UnremovableReason: NoNodeInfo, used to mark such nodes as unremovable.
  • An early return from SimulateNodeRemoval when the node is missing.
  • A defensive placeholder node (with only .Name set) to maintain observability and event compatibility.
  • A dedicated unit test case to verify this scenario is correctly handled.

Which issue(s) this PR fixes:

N/A


Special notes for your reviewer:

This PR fixes a potential nil dereference in SimulateNodeRemoval when a node is missing from the cluster snapshot.
It adds a new UnremovableReason (NoNodeInfo) to capture this edge case, and add the test coverage.


Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 18, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @kincoy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kincoy
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from elmiko August 18, 2025 09:35
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 18, 2025
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the PR and adding the test too.

the changes here look generally good to me, but i am not overly familiar with the simulator code. in specific, the returns for the failure condition.

would it be possible to add a unit test for the other return value as well? (i see the test for the unremovable node with the NoNodeInfo, but can we also test for UnexpectedError?)

@kincoy
Copy link
Contributor Author

kincoy commented Aug 22, 2025

thank you for the PR and adding the test too.

the changes here look generally good to me, but i am not overly familiar with the simulator code. in specific, the returns for the failure condition.

would it be possible to add a unit test for the other return value as well? (i see the test for the unremovable node with the NoNodeInfo, but can we also test for UnexpectedError?)

Thanks! I looked into GetNodeInfo — aside from ErrNodeNotFound, other errors only happen when draEnabled is true and WrapSchedulerNodeInfo fails, which is rare and hard to simulate without artificial mocks.

If you have a clean way to test this case, I’m happy to give it a try!

@elmiko
Copy link
Contributor

elmiko commented Aug 27, 2025

Thanks! I looked into GetNodeInfo — aside from ErrNodeNotFound, other errors only happen when draEnabled is true and WrapSchedulerNodeInfo fails, which is rare and hard to simulate without artificial mocks.

thanks for investigating, i was worried it might take a complicated mock to make it work. i don't think it's worth the effort to create a test with a mock just for the error condition.

/lgtm

would be good to get another review from a maintainer

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2025
@@ -151,6 +155,11 @@ func (r *RemovalSimulator) SimulateNodeRemoval(
nodeInfo, err := r.clusterSnapshot.GetNodeInfo(nodeName)
if err != nil {
klog.Errorf("Can't retrieve node %s from snapshot, err: %v", nodeName, err)
ghostNode := &apiv1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for doing something like this

		unremovableReason := UnexpectedError
		if errors.Is(err, clustersnapshot.ErrNodeNotFound) {
			unremovableReason = NoNodeInfo
		}
		unremovableNode := &UnremovableNode{Node: &apiv1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}, Reason: unremovableReason}
		return nil, unremovableNode
  1. to avoid the multiple return statements
  2. to actuate what the the if condition is really trying to determine
  3. to avoid wrapping the node object in an "opinonated" variable name like "ghostNode", which may indicate something more interesting going on than actually is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — updated as you suggested. Also applied consistent naming in both implementation and tests.Thanks for the helpful review!

@kincoy kincoy force-pushed the fix/simulate-removal-nonexistent-node branch from 3d61fed to ed37b25 Compare August 28, 2025 02:10
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2025
@kincoy kincoy force-pushed the fix/simulate-removal-nonexistent-node branch from ed37b25 to 6fcb503 Compare August 28, 2025 02:14
@kincoy
Copy link
Contributor Author

kincoy commented Sep 1, 2025

Friendly ping @jackfrancis — this PR has been idle for a while. Would appreciate a review when convenient 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants