OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5305

chizhang21 · 2025-09-24T14:01:10Z

- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

openshift-ci · 2025-09-24T14:01:49Z

Hi @chizhang21. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chizhang21 · 2025-09-24T14:33:23Z

Hi @yuqi-zhang @pablintino , could you please take a look at this PR? I’m available for any follow-ups. Thanks!

pablintino · 2025-09-24T15:59:12Z

pkg/controller/drain/drain_controller.go

 		}

+		// make sure local node's Unschedulable state is updated
+		node.Spec.Unschedulable = updatedNode.Spec.Unschedulable


While I think what you propose does make sense from a functional point of view, I have a few comments:

The node variable is not local, we are modifiying a reference thus changing it for the caller too, and I wouldn't be expecting this function to do write to the reference I pass. The underlaying cordon/uncordon logic only makes reads, and that's what I'd expect from this function too.

From the previous point. Why aren't us consuming inside the ExponentialBackoff the updated reference always? In the first loop it can be initialized with the node reference to make it work.

That makes sense. I’ll update it a bit later.

Thanks for the feedback—updates pushed. Could you take another look?

pablintino · 2025-09-24T15:59:41Z

pkg/controller/drain/drain_controller.go

 			return false, nil
 		}

+		// make sure local node's Unschedulable state is updated


Do we have an OCPBUGS ticket to track this? To which versions does this apply?

A related bug may not have been created yet. And we encountered this issue on version 4.19, so I expect it to be fixed in 4.19 and later versions.

Would you like me to create a OCPBUGS ticket?

Yes please, cause our QE would need to test it + our automation requires it to include it in release branches.

Hi Pablo, ticket OCPBUGS-62341 has been created: https://issues.redhat.com/browse/OCPBUGS-62341
Please review when convenient. Thank you!

Signed-off-by: cz21 <[email protected]>

openshift-ci-robot · 2025-09-29T16:34:11Z

@chizhang21: This pull request references Jira Issue OCPBUGS-62341, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Update node's Unschedulable state to keep it consistent with the node status in the cluster

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pablintino · 2025-09-30T14:04:49Z

/ok-to-test

chizhang21 · 2025-10-01T00:45:16Z

/retest

chizhang21 · 2025-10-01T07:07:12Z

/retest

sergiordlr · 2025-10-01T09:48:49Z

Verified using IPI on AWS

Create a webhook that will make fail any attempt to change the .spec.unschedulable value in a node. It will make all cordon/uncordon operations fail

This is an example of a webhook failing all cordon/uncordon operations: https://github.com/sergiordlr/temp-testfiles/tree/master/webhook_example

Apply a machineconfiguraion to make MCO cordon/uncordon the nodes to apply the config

kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-machine-config-0
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,dGVzdA==
        path: /etc/test-file-0.test

Check that the MCO controller cannot cordon the node and starts retrying

I1001 08:57:06.698182       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: cordoning
I1001 08:57:06.698230       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:57:06.720010       1 drain_controller.go:580] cordon failed with: cordon error: admission webhook "unschedulable-webhook.default.svc" denied the request: Changing .spec.unschedulable on node is forbidden., retrying
I1001 08:57:16.720564       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: false)
I1001 08:57:16.723557       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: RunCordonOrUncordon() succeeded but node is still not in cordon state, retrying
I1001 08:57:36.724415       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:57:36.746432       1 drain_controller.go:580] cordon failed with: cordon error: admission webhook "unschedulable-webhook.default.svc" denied the request: Changing .spec.unschedulable on node is forbidden., retrying
I1001 08:58:16.747298       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: false)

Remove the MutatingWebhookConfiguration created in step 1 to allow cordon/uncordon operations succeed again
Check that the controller can now cordon the node and start applying the config

I1001 08:57:36.724415       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:57:36.746432       1 drain_controller.go:580] cordon failed with: cordon error: admission webhook "unschedulable-webhook.default.svc" denied the request: Changing .spec.unschedulable on node is forbidden., retrying
I1001 08:58:16.747298       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: false)
I1001 08:58:16.751118       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: RunCordonOrUncordon() succeeded but node is still not in cordon state, retrying
I1001 08:59:36.751425       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1001 08:59:36.761928       1 node_controller.go:676] Pool worker[zone=us-east-2a]: node ip-10-0-26-72.us-east-2.compute.internal: Reporting unready: node ip-10-0-26-72.us-east-2.compute.internal is reporting Unschedulable
I1001 08:59:36.766330       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: cordon succeeded (currently schedulable: false)
I1001 08:59:36.777843       1 node_controller.go:676] Pool worker[zone=us-east-2a]: node ip-10-0-26-72.us-east-2.compute.internal: changed taints
I1001 08:59:36.782568       1 drain_controller.go:192] node ip-10-0-26-72.us-east-2.compute.internal: initiating drain
E1001 08:59:37.816907       1 drain_controller.go:162] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-qwlzc, openshift-cluster-node-tuning-operator/tuned-6756p, openshift-dns/dns-default-5csfg, openshift-dns/node-resolver-5hchp, openshift-image-registry/node-ca-rqpq5, openshift-ingress-canary/ingress-canary-lr2n4, openshift-insights/insights-runtime-extractor-d7f6r, openshift-machine-config-operator/machine-config-daemon-hd87q, openshift-monitoring/node-exporter-8tlt4, openshift-multus/multus-5fn5z, openshift-multus/multus-additional-cni-plugins-4zpdp, openshift-multus/network-metrics-daemon-dclzs, openshift-network-diagnostics/network-check-target-jmvcq, openshift-network-operator/iptables-alerter-vf9f8, openshift-ovn-kubernetes/ovnkube-node-2cwvr
I1001 08:59:37.818345       1 drain_controller.go:162] evicting pod openshift-image-registry/image-registry-f7f4655fb-pnt6c

The configuration is properly applied in all nodes

If we repeat the steps in a cluster without the fix, the controller is not able to cordon the node after removing the webhook.

/label qe-approved

pablintino · 2025-10-01T12:10:59Z

/jira refresh
/lgtm

openshift-ci-robot · 2025-10-01T12:11:08Z

@pablintino: This pull request references Jira Issue OCPBUGS-62341, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh
/lgtm
/verified by @sergiordlr as per #5305 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-01T12:11:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chizhang21, pablintino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pablintino]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pablintino · 2025-10-01T12:12:44Z

/verified by @sergiordlr as per #5305 (comment)

openshift-ci-robot · 2025-10-01T12:13:05Z

@pablintino: This PR has been marked as verified by @sergiordlr as per https://github.com/openshift/machine-config-operator/pull/5305#issuecomment-3355556420.

In response to this:

/verified by @sergiordlr as per #5305 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-10-01T16:21:06Z

@chizhang21: This pull request references Jira Issue OCPBUGS-62341, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-01T16:45:53Z

@chizhang21: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-mco-disruptive	`0c647cc`	link	false	`/test e2e-gcp-mco-disruptive`
ci/prow/e2e-gcp-op-ocl	`0c647cc`	link	false	`/test e2e-gcp-op-ocl`
ci/prow/e2e-aws-mco-disruptive	`0c647cc`	link	false	`/test e2e-aws-mco-disruptive`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`0c647cc`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`
ci/prow/bootstrap-unit	`0c647cc`	link	false	`/test bootstrap-unit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-10-01T16:54:06Z

@chizhang21: Jira Issue Verification Checks: Jira Issue OCPBUGS-62341
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-62341 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

- What I did
Ensure the node passed to RunCordonOrUncordon comes from the latest updated state

- How to verify it
Trigger a node cordon/uncordon, when the first node's cordon/uncordon request failed, RunCordonOrUncordon will skip invoke PatchOrReplaceWithContext and return nil, the MCC will hangs to wait.
With this code, MCC will work as excepted.

- Description for the changelog
CordonHelper.PatchOrReplaceWithContext() will always set the node's Unschedulable state to desired, even if the PATCH/UPDATE request failed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2025-10-02T13:28:15Z

Fix included in accepted release 4.21.0-0.nightly-2025-10-02-002727

chizhang21 · 2025-10-12T15:23:54Z

Hi Pablo, could we backport this fix to 4.19.z and 4.20.z?

chizhang21 · 2025-10-15T09:06:48Z

/cherry-pick release-4.19

chizhang21 · 2025-10-15T09:06:56Z

/cherry-pick release-4.20

openshift-cherrypick-robot · 2025-10-15T09:06:57Z

@chizhang21: only openshift org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2025-10-15T09:07:05Z

@chizhang21: only openshift org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pablintino · 2025-10-15T09:18:02Z

Sorry @chizhang21, I missed your comment, I'll do the cherry-pick

pablintino · 2025-10-15T09:18:14Z

/cherry-pick release-4.19
/cherry-pick release-4.20

openshift-cherrypick-robot · 2025-10-15T09:18:58Z

@pablintino: new pull request created: #5348

In response to this:

/cherry-pick release-4.19
/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2025-10-15T09:19:40Z

@pablintino: new pull request created: #5349

In response to this:

/cherry-pick release-4.19
/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chizhang21 · 2025-10-15T09:21:06Z

Sorry @chizhang21, I missed your comment, I'll do the cherry-pick

Thank you!

openshift-ci bot requested review from pablintino and yuqi-zhang September 24, 2025 14:01

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 24, 2025

pablintino reviewed Sep 24, 2025

View reviewed changes

chizhang21 force-pushed the cordon_state_enhance branch 4 times, most recently from f049755 to 75aef33 Compare September 27, 2025 13:09

OCPBUGS-62341: Ensure node's Unschedulable state stays in sync

0c647cc

Signed-off-by: cz21 <[email protected]>

chizhang21 force-pushed the cordon_state_enhance branch from 75aef33 to 0c647cc Compare September 29, 2025 16:33

chizhang21 changed the title ~~Ensure node's Unschedulable state stays in sync~~ OCPBUGS-62341: Ensure node's Unschedulable state stays in sync Sep 29, 2025

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 30, 2025

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 1, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 1, 2025

openshift-ci bot assigned pablintino Oct 1, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 1, 2025

chizhang21 changed the title ~~OCPBUGS-62341: Ensure node's Unschedulable state stays in sync~~ OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state Oct 1, 2025

openshift-merge-bot bot merged commit 1c25a90 into openshift:main Oct 1, 2025
17 of 22 checks passed

openshift-cherrypick-robot mentioned this pull request Oct 15, 2025

[release-4.19] OCPBUGS-63126: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5348

Open

openshift-cherrypick-robot mentioned this pull request Oct 15, 2025

[release-4.20] OCPBUGS-63127: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5349

Merged

OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5305

OCPBUGS-62341: Ensure the node passed to RunCordonOrUncordon comes from the latest updated state #5305

Uh oh!

Conversation

chizhang21 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Sep 24, 2025

Uh oh!

chizhang21 commented Sep 24, 2025

Uh oh!

pablintino Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chizhang21 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

chizhang21 Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

pablintino Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

chizhang21 Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chizhang21 Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

pablintino Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

chizhang21 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 29, 2025

Uh oh!

pablintino commented Sep 30, 2025

Uh oh!

chizhang21 commented Oct 1, 2025

Uh oh!

chizhang21 commented Oct 1, 2025

Uh oh!

sergiordlr commented Oct 1, 2025

Uh oh!

pablintino commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 1, 2025

Uh oh!

openshift-ci bot commented Oct 1, 2025

Uh oh!

pablintino commented Oct 1, 2025

Uh oh!

openshift-ci-robot commented Oct 1, 2025

Uh oh!

openshift-ci-robot commented Oct 1, 2025

Uh oh!

openshift-ci bot commented Oct 1, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 1, 2025

Uh oh!

openshift-merge-robot commented Oct 2, 2025

Uh oh!

chizhang21 commented Oct 12, 2025

Uh oh!

chizhang21 commented Oct 15, 2025

Uh oh!

chizhang21 commented Oct 15, 2025

Uh oh!

openshift-cherrypick-robot commented Oct 15, 2025

Uh oh!

openshift-cherrypick-robot commented Oct 15, 2025

Uh oh!

pablintino commented Oct 15, 2025

Uh oh!

pablintino commented Oct 15, 2025

chizhang21 commented Sep 24, 2025 •

edited

Loading

pablintino Sep 24, 2025 •

edited

Loading

chizhang21 Sep 25, 2025 •

edited

Loading

pablintino commented Oct 1, 2025 •

edited

Loading