Skip to content

Conversation

chizhang21
Copy link
Contributor

@chizhang21 chizhang21 commented Sep 24, 2025

- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 24, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 24, 2025

Hi @chizhang21. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chizhang21
Copy link
Contributor Author

Hi @yuqi-zhang @pablintino , could you please take a look at this PR? I’m available for any follow-ups. Thanks!

}

// make sure local node's Unschedulable state is updated
node.Spec.Unschedulable = updatedNode.Spec.Unschedulable
Copy link
Contributor

@pablintino pablintino Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think what you propose does make sense from a functional point of view, I have a few comments:

  • The node variable is not local, we are modifiying a reference thus changing it for the caller too, and I wouldn't be expecting this function to do write to the reference I pass. The underlaying cordon/uncordon logic only makes reads, and that's what I'd expect from this function too.
  • From the previous point. Why aren't us consuming inside the ExponentialBackoff the updated reference always? In the first loop it can be initialized with the node reference to make it work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I’ll update it a bit later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback—updates pushed. Could you take another look?

return false, nil
}

// make sure local node's Unschedulable state is updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an OCPBUGS ticket to track this? To which versions does this apply?

Copy link
Contributor Author

@chizhang21 chizhang21 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related bug may not have been created yet. And we encountered this issue on version 4.19, so I expect it to be fixed in 4.19 and later versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to create a OCPBUGS ticket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, cause our QE would need to test it + our automation requires it to include it in release branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Pablo, ticket OCPBUGS-62341 has been created: https://issues.redhat.com/browse/OCPBUGS-62341
Please review when convenient. Thank you!

@chizhang21 chizhang21 force-pushed the cordon_state_enhance branch 4 times, most recently from f049755 to 75aef33 Compare September 27, 2025 13:09
@chizhang21 chizhang21 changed the title Ensure node's Unschedulable state stays in sync OCPBUGS-62341: Ensure node's Unschedulable state stays in sync Sep 29, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 29, 2025
@openshift-ci-robot
Copy link
Contributor

@chizhang21: This pull request references Jira Issue OCPBUGS-62341, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Update node's Unschedulable state to keep it consistent with the node status in the cluster

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pablintino
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 30, 2025
@chizhang21
Copy link
Contributor Author

/retest

1 similar comment
@chizhang21
Copy link
Contributor Author

/retest

@sergiordlr
Copy link
Contributor

Verified using IPI on AWS

  1. Create a webhook that will make fail any attempt to change the .spec.unschedulable value in a node. It will make all cordon/uncordon operations fail

This is an example of a webhook failing all cordon/uncordon operations: https://github.com/sergiordlr/temp-testfiles/tree/master/webhook_example

  1. Apply a machineconfiguraion to make MCO cordon/uncordon the nodes to apply the config
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-machine-config-0
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,dGVzdA==
        path: /etc/test-file-0.test

  1. Check that the MCO controller cannot cordon the node and starts retrying
I1001 08:57:06.698182       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: cordoning
I1001 08:57:06.698230       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:57:06.720010       1 drain_controller.go:580] cordon failed with: cordon error: admission webhook "unschedulable-webhook.default.svc" denied the request: Changing .spec.unschedulable on node is forbidden., retrying
I1001 08:57:16.720564       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: false)
I1001 08:57:16.723557       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: RunCordonOrUncordon() succeeded but node is still not in cordon state, retrying
I1001 08:57:36.724415       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:57:36.746432       1 drain_controller.go:580] cordon failed with: cordon error: admission webhook "unschedulable-webhook.default.svc" denied the request: Changing .spec.unschedulable on node is forbidden., retrying
I1001 08:58:16.747298       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: false)
  1. Remove the MutatingWebhookConfiguration created in step 1 to allow cordon/uncordon operations succeed again
  2. Check that the controller can now cordon the node and start applying the config
I1001 08:57:36.724415       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:57:36.746432       1 drain_controller.go:580] cordon failed with: cordon error: admission webhook "unschedulable-webhook.default.svc" denied the request: Changing .spec.unschedulable on node is forbidden., retrying
I1001 08:58:16.747298       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: false)
I1001 08:58:16.751118       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: RunCordonOrUncordon() succeeded but node is still not in cordon state, retrying
I1001 08:59:36.751425       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:59:36.761928       1 node_controller.go:676] Pool worker[zone=us-east-2a]: node ip-10-0-26-72.us-east-2.compute.internal: Reporting unready: node ip-10-0-26-72.us-east-2.compute.internal is reporting Unschedulable
I1001 08:59:36.766330       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: cordon succeeded (currently schedulable: false)
I1001 08:59:36.777843       1 node_controller.go:676] Pool worker[zone=us-east-2a]: node ip-10-0-26-72.us-east-2.compute.internal: changed taints
I1001 08:59:36.782568       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating drain
E1001 08:59:37.816907       1 drain_controller.go:162] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-qwlzc, openshift-cluster-node-tuning-operator/tuned-6756p, openshift-dns/dns-default-5csfg, openshift-dns/node-resolver-5hchp, openshift-image-registry/node-ca-rqpq5, openshift-ingress-canary/ingress-canary-lr2n4, openshift-insights/insights-runtime-extractor-d7f6r, openshift-machine-config-operator/machine-config-daemon-hd87q, openshift-monitoring/node-exporter-8tlt4, openshift-multus/multus-5fn5z, openshift-multus/multus-additional-cni-plugins-4zpdp, openshift-multus/network-metrics-daemon-dclzs, openshift-network-diagnostics/network-check-target-jmvcq, openshift-network-operator/iptables-alerter-vf9f8, openshift-ovn-kubernetes/ovnkube-node-2cwvr
I1001 08:59:37.818345       1 drain_controller.go:162] evicting pod openshift-image-registry/image-registry-f7f4655fb-pnt6c
  1. The configuration is properly applied in all nodes

If we repeat the steps in a cluster without the fix, the controller is not able to cordon the node after removing the webhook.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 1, 2025
@pablintino
Copy link
Contributor

pablintino commented Oct 1, 2025

/jira refresh
/lgtm

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 1, 2025
@openshift-ci-robot
Copy link
Contributor

@pablintino: This pull request references Jira Issue OCPBUGS-62341, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh
/lgtm
/verified by @sergiordlr as per #5305 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chizhang21, pablintino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2025
@pablintino
Copy link
Contributor

/verified by @sergiordlr as per #5305 (comment)

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 1, 2025
@openshift-ci-robot
Copy link
Contributor

@pablintino: This PR has been marked as verified by @sergiordlr as per https://github.com/openshift/machine-config-operator/pull/5305#issuecomment-3355556420.

In response to this:

/verified by @sergiordlr as per #5305 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@chizhang21 chizhang21 changed the title OCPBUGS-62341: Ensure node's Unschedulable state stays in sync OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state Oct 1, 2025
@openshift-ci-robot
Copy link
Contributor

@chizhang21: This pull request references Jira Issue OCPBUGS-62341, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 1, 2025

@chizhang21: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-mco-disruptive 0c647cc link false /test e2e-gcp-mco-disruptive
ci/prow/e2e-gcp-op-ocl 0c647cc link false /test e2e-gcp-op-ocl
ci/prow/e2e-aws-mco-disruptive 0c647cc link false /test e2e-aws-mco-disruptive
ci/prow/e2e-azure-ovn-upgrade-out-of-change 0c647cc link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/bootstrap-unit 0c647cc link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 1c25a90 into openshift:main Oct 1, 2025
17 of 22 checks passed
@openshift-ci-robot
Copy link
Contributor

@chizhang21: Jira Issue Verification Checks: Jira Issue OCPBUGS-62341
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-62341 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.21.0-0.nightly-2025-10-02-002727

@chizhang21
Copy link
Contributor Author

Hi Pablo, could we backport this fix to 4.19.z and 4.20.z?

@chizhang21
Copy link
Contributor Author

/cherry-pick release-4.19

@chizhang21
Copy link
Contributor Author

/cherry-pick release-4.20

@openshift-cherrypick-robot

@chizhang21: only openshift org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@chizhang21: only openshift org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pablintino
Copy link
Contributor

Sorry @chizhang21, I missed your comment, I'll do the cherry-pick

@pablintino
Copy link
Contributor

/cherry-pick release-4.19
/cherry-pick release-4.20

@openshift-cherrypick-robot

@pablintino: new pull request created: #5348

In response to this:

/cherry-pick release-4.19
/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@pablintino: new pull request created: #5349

In response to this:

/cherry-pick release-4.19
/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chizhang21
Copy link
Contributor Author

Sorry @chizhang21, I missed your comment, I'll do the cherry-pick

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants