[release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating #1211

openshift-cherrypick-robot · 2025-07-07T16:18:23Z

This is an automated cherry-pick of #1165

/assign petr-muller

When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue. This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it. Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up. In addition to the condition's message. We propagate Fail=Unknown to make it available for other automations, such as update-status command. [1]. kubernetes/kubernetes#106054 [2]. https://github.com/openshift/cluster-version-operator/blob/08c0459df5096e9f16fad3af2831b62d06d415ee/lib/resourcebuilder/apps.go#L79-L136

openshift-ci-robot · 2025-07-07T16:19:36Z

@openshift-cherrypick-robot: Jira Issue OCPBUGS-23514 has been cloned as Jira Issue OCPBUGS-58450. Will retitle bug to link to clone.
/retitle [release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating

In response to this:

This is an automated cherry-pick of #1165

/assign petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-07-07T16:19:49Z

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58450, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.z) matches configured target version for branch (4.18.z)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
release note text is set and does not match the template
dependent bug Jira Issue OCPBUGS-23514 is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
dependent Jira Issue OCPBUGS-23514 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0, 4.19.z
bug has dependents

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is an automated cherry-pick of #1165

/assign petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking

/lgtm

openshift-ci · 2025-07-07T18:25:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-cherrypick-robot, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-07-07T19:26:38Z

@openshift-cherrypick-robot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-hypershift	`1ce0def`	link	true	`/test e2e-hypershift`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dis016 · 2025-07-09T06:25:06Z

Test Scenario: Failing=unknown when slow update happen for image-registry operator for >30 minutes.
Test Status: Pass
Step1: Install a cluster with 4.18.19 and check status of CVO.

dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        False         13m     Cluster version is 4.18.19
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-07-09T05:58:59Z"
status: "False"
type: Failing
lastTransitionTime: "2025-07-09T05:58:59Z"
message: Cluster version is 4.18.19
status: "False"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ %

Step2: create a build release with image-registry magic PR(openshift/machine-config-operator#5170)

build 4.18,openshift/cluster-image-registry-operator#1227,openshift/cluster-version-operator#1211

Step3: Upgrade to newly created build

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-72wbfn2/release@sha256:5325f54137fbfde8f9945d2f02dea735b2eb49357412ca971a7e21f936645194  --allow-explicit-upgrade --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-72wbfn2/release@sha256:5325f54137fbfde8f9945d2f02dea735b2eb49357412ca971a7e21f936645194
dinesh@Dineshs-MacBook-Pro ~ %

Step4: Upgrade should get triggered and proceed normally


dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          34s     Working towards 4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest: 1 of 901 done (0% complete)
dinesh@Dineshs-MacBook-Pro ~ % export OC_ENABLE_CMD_UPGRADE_STATUS=true 
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest (from 4.18.19)
Completion:      3% (1 operators updated, 0 updating, 33 waiting)
Duration:        54s (Est. Time Remaining: 1h11m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-101-12.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-29-0.us-east-2.compute.internal     Outdated     Pending   4.18.19   ?     
ip-10-0-87-207.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-109-168.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-122-73.us-east-2.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-2-89.us-east-2.compute.internal      Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
54s     Info    None     Update is proceeding well
dinesh@Dineshs-MacBook-Pro ~ %

Step5: After sometime(approximately>30minutes for image-registry ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True

 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing" or .type=="Progressing")'
{
  "lastTransitionTime": "2025-07-09T07:43:58Z",
  "message": "waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-07-09T06:49:08Z",
  "message": "Working towards 4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest: 709 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          64m     Working towards 4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest: 709 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest (from 4.18.19)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        1h4m (Est. Time Remaining: <10m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-101-12.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-29-0.us-east-2.compute.internal     Outdated     Pending   4.18.19   ?     
ip-10-0-87-207.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-109-168.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-122-73.us-east-2.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-2-89.us-east-2.compute.internal      Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL     IMPACT           MESSAGE
9m20s   Warning   Update Stalled   Cluster Version version is failing to proceed with the update (SlowClusterOperator)

Run with --details=health for additional description and links to related online documentation
dinesh@Dineshs-MacBook-Pro ~ %

Upgrade should never complete and keep progressing.

dis016 · 2025-07-09T10:04:06Z

Test Scenario: Failing=unknown when slow update happen for MCO operator for >90 minutes.
Test Status: Passed
Step1: Install a cluster with 4.18.19 and check status of CVO.

dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        False         111s    Cluster version is 4.18.19
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-07-09T13:13:03Z"
status: "False"
type: Failing
lastTransitionTime: "2025-07-09T13:13:03Z"
message: Cluster version is 4.18.19
status: "False"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ %

Step2: create a build release with MCO magic PR(openshift/machine-config-operator#5170)

build 4.18,openshift/machine-config-operator#5170,openshift/cluster-version-operator#1211

Step3: Upgrade to newly created build

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-v6spdst/release@sha256:1a307203a2a1042576cc82bf76312fa8a99a02848b9c87b141180d6534c4d008  --allow-explicit-upgrade --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-v6spdst/release@sha256:1a307203a2a1042576cc82bf76312fa8a99a02848b9c87b141180d6534c4d008
dinesh@Dineshs-MacBook-Pro ~ %

Step4: Upgrade should get triggered and proceed normally

 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion                                                                                             
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          7s      Working towards 4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest: 11 of 900 done (1% complete)
dinesh@Dineshs-MacBook-Pro ~ % export OC_ENABLE_CMD_UPGRADE_STATUS=true 
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest (from 4.18.19)
Completion:      0% (0 operators updated, 0 updating, 34 waiting)
Duration:        21s (Est. Time Remaining: 1h12m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-54-233.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-76-13.us-west-1.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-82-206.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-100-88.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-116-19.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-16-139.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
21s     Info    None     Update is proceeding well
dinesh@Dineshs-MacBook-Pro ~ %

Step5: After sometime(approximately>90minutes for machine-config ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True


dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          148m    Working towards 4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest: 773 of 900 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing" or .type=="Progressing")'
{
  "lastTransitionTime": "2025-07-09T15:42:21Z",
  "message": "waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-07-09T13:15:43Z",
  "message": "Working towards 4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest: 773 of 900 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Stalled
Target Version:  4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest (from 4.18.19)
Completion:      97% (33 operators updated, 0 updating, 1 waiting)
Duration:        2h29m (Est. Time Remaining: N/A; estimate duration was 1h48m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-54-233.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-76-13.us-west-1.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-82-206.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-100-88.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-116-19.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-16-139.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL     IMPACT           MESSAGE
2m      Warning   Update Stalled   Cluster Version version is failing to proceed with the update (SlowClusterOperator)

Run with --details=health for additional description and links to related online documentation
dinesh@Dineshs-MacBook-Pro ~ % oc get co machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.18.19   True        False         False      173m    
dinesh@Dineshs-MacBook-Pro ~ %

Upgrade should never complete and keep progressing.

wking · 2025-07-09T18:37:07Z

pkg/cvo/status.go

+	if failure != nil &&
+		strings.HasPrefix(progressReason, slowCOUpdatePrefix) {
+		failingCondition.Status = configv1.ConditionUnknown
+		failingCondition.Reason = "SlowClusterOperator"


more testing notes, since we had some questions about metrics handling. launch 4.18 aws to Cluster Bot (logs) and a new payload with this change via build 4.18,openshift/cluster-version-operator#1211 (logs). Cordon compute, to block ingress, registry, etc. from being able to roll out new workloads:

$ oc adm cordon -l node-role.kubernetes.io/worker=

Then launch the update, using --force to blast through the use of a by-tag :latest pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest points at) and the lack of signature (no production signatures on CI-built release images):

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest

I didn't set no-spot when requesting the cluster, so run a cordon again once this new Machine comes up:

$ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ci-ln-lxqwzpk-76ef8-95mmn-master-0 Running m6a.xlarge us-east-2 us-east-2b 40m ci-ln-lxqwzpk-76ef8-95mmn-master-1 Running m6a.xlarge us-east-2 us-east-2a 40m ci-ln-lxqwzpk-76ef8-95mmn-master-2 Running m6a.xlarge us-east-2 us-east-2b 40m ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq Running m6a.xlarge us-east-2 us-east-2a 31m ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8 Provisioned m6a.xlarge us-east-2 us-east-2b 4m23s ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7 Running m6a.xlarge us-east-2 us-east-2b 31m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE ci-ln-lxqwzpk-76ef8-95mmn-master-0 Running m6a.xlarge us-east-2 us-east-2b 40m ci-ln-lxqwzpk-76ef8-95mmn-master-1 Running m6a.xlarge us-east-2 us-east-2a 40m ci-ln-lxqwzpk-76ef8-95mmn-master-2 Running m6a.xlarge us-east-2 us-east-2b 40m ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq Running m6a.xlarge us-east-2 us-east-2a 32m ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8 Running m6a.xlarge us-east-2 us-east-2b 4m51s ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7 Running m6a.xlarge us-east-2 us-east-2b 32m $ oc adm cordon -l node-role.kubernetes.io/worker= node/ip-10-0-124-53.us-east-2.compute.internal cordoned node/ip-10-0-27-221.us-east-2.compute.internal already cordoned node/ip-10-0-64-26.us-east-2.compute.internal already cordoned

And the update gets stuck on the cordon, but the image-registry operator complains about the sticking (good for admins, but means we don't exercise this SlowClusterOperator logic.

$ oc adm upgrade Failing=True: Reason: ClusterOperatorDegraded Message: Cluster operator image-registry is degraded info: An upgrade is in progress. Unable to apply 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: an unknown error has occurred: MultipleErrors ```console $ oc get -o json clusteroperator image-registry | jq '.status.conditions[] | select(.type == "Degraded")' { "lastTransitionTime": "2025-07-09T18:28:57Z", "message": "Degraded: Registry deployment has timed out progressing: ReplicaSet \"image-registry-54757dcb9f\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "True", "type": "Degraded" }

I might need to use openshift/cluster-image-registry-operator#1227, or try again with ClusterVersion overrides.

Ok, trying on a fresh Cluster Bot launch 4.17.27 aws (logs) with an overrides entry instead of the cordoning, to try and get the stale-but-nominally-happy ClusterOperator content needed to trigger SlowClusterOperator:

$ oc patch clusterversion.config.openshift.io version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-image-registry", "name": "cluster-image-registry-operator", "unmanaged": true}]}]'

Then launch the update, using --force to blast through the use of a by-tag :latest pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest points at) and the lack of signature (no production signatures on CI-built release images):

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest

And some time later:

$ oc adm upgrade Failing=Unknown: Reason: SlowClusterOperator Message: waiting on image-registry over 30 minutes which is longer than expected info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: 708 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected ... $ oc get -o json clusterversion version | jq '.status.conditions[] | select(.type == "Failing")' { "lastTransitionTime": "2025-07-09T20:33:52Z", "message": "waiting on image-registry over 30 minutes which is longer than expected", "reason": "SlowClusterOperator", "status": "Unknown", "type": "Failing" }

And in platform monitoring:

max by (reason) (cluster_operator_conditions{name="version",condition="Failing"}) + 0.1

shows the Failing metric going away when it moves from Failing=False to Failing=Unknown, ~30m after group by (version) (cluster_operator_up{name="console"}) shows the also-runlevel-50 console ClusterOperator finishing its update (by which point, we'll also have been waiting on the image-registry ClusterOperator for ~30m).

That's also when ClusterOperatorDegraded starts firing for name="version":

group by (alertname, name, reason) (ALERTS{alertname="ClusterOperatorDegraded"})

And while we don't have a great reason like "because the metric we expect to see has disappeared", the alert description is probably still workable:

Hi @wking Thanks for the testing, do we need to take some action for metric we expect to see has disappeared or is it ok to proceed with current metric representation.

Thanks @wking saw the slack comments https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1752094519486299?thread_ts=1751905557.779399&cid=CJ1J9C3V4

dis016 · 2025-07-10T04:49:23Z

Test Scenario: Regression test Failing=True condition still works well during operator degraded(authentication).
Test Status: Passed
Step1: Install Cluster and degrade the CO authentication


dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.nightly-2025-07-05-213851   True        False         21m     Cluster version is 4.18.0-0.nightly-2025-07-05-213851
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
dinesh@Dineshs-MacBook-Pro ~ % 


dinesh@Dineshs-MacBook-Pro ~ % oc get co authentication 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.18.0-0.nightly-2025-07-05-213851   True        False         True       26m     OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
dinesh@Dineshs-MacBook-Pro ~ %

Step2: Trigger an upgrade to version which contains the changes.

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-83f28bb/release@sha256:986f2b451e59238eeb6fdc1ba7aa01789391cc2529efd9fe571f772577dc7f05 --allow-explicit-upgrade --force --allow-upgrade-with-warnings

warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.18.0-0.nightly-2025-07-05-213851":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build06.ci.openshift.org/ci-ln-83f28bb/release@sha256:986f2b451e59238eeb6fdc1ba7aa01789391cc2529efd9fe571f772577dc7f05
dinesh@Dineshs-MacBook-Pro ~ %

Step 3. Upgrade is triggered and CVO is throwing error after sometime for authentication operator and upgrade stuck with waiting up to 40 minutes on authentication.

 
dinesh@Dineshs-MacBook-Pro ~ % while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 111 of 901 done (12% complete), waiting on etcd, kube-apiserver

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2025-07-10T05:08:57Z",
  "status": "False",
  "type": "Failing"
}
...
...
...
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

info: An upgrade is in progress. Unable to apply 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: an unknown error has occurred: MultipleErrors

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2025-07-10T05:33:39Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 708 of 901 done (78% complete), waiting up to 40 minutes on authentication

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2025-07-10T05:41:54Z",
  "status": "False",
  "type": "Failing"
}

Step4: Check CVO status Failing!=unknown and Progressing=True

 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-07-10T05:41:54Z"
status: "False"
type: Failing
lastTransitionTime: "2025-07-10T05:08:53Z"
message: 'Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 708 of 901 done (78% complete), waiting up to 40 minutes on authentication'
reason: ClusterOperatorDegraded
status: "True"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ %

Step5: Enable upgrade status and check oc adm upgrade status reports authentication operator degraded.


dinesh@Dineshs-MacBook-Pro ~ % OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest (from 4.18.0-0.nightly-2025-07-05-213851)
Completion:      88% (30 operators updated, 0 updating, 4 waiting)
Duration:        36m (Est. Time Remaining: 36m)
Operator Health: 33 Healthy, 1 Available but degraded

Control Plane Nodes
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-127-246.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-43-80.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-46-27.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-119-222.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-14-152.us-west-1.compute.internal    Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-7-48.us-west-1.compute.internal      Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Update Health =
Message: Cluster Operator authentication is degraded (OAuthServerConfigObservation_Error)
  Since:       48m35s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: authentication
  Description: OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
dinesh@Dineshs-MacBook-Pro ~ %

Step6: un-degrade the CO authentication

 
dinesh@Dineshs-MacBook-Pro ~ % cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec: {}
dinesh@Dineshs-MacBook-Pro ~ % oc apply -f oauth.yaml 
oauth.config.openshift.io/cluster configured
dinesh@Dineshs-MacBook-Pro ~ %

Step7: Now CVO error should disappear then upgrade should resume.

 
dinesh@Dineshs-MacBook-Pro ~ % oc get co authentication 
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest   True        False         False      75m     
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade           
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 720 of 901 done (79% complete), waiting on olm

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status 
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest (from 4.18.0-0.nightly-2025-07-05-213851)
Completion:      88% (30 operators updated, 0 updating, 4 waiting)
Duration:        39m (Est. Time Remaining: 33m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-127-246.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-43-80.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-46-27.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-119-222.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-14-152.us-west-1.compute.internal    Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-7-48.us-west-1.compute.internal      Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Update Health =
SINCE    LEVEL   IMPACT   MESSAGE
38m53s   Info    None     Update is proceeding well
dinesh@Dineshs-MacBook-Pro ~ %

dis016 · 2025-07-11T09:15:51Z

/label qe-approved

openshift-ci-robot · 2025-07-11T09:16:01Z

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58450, which is valid.

7 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.z) matches configured target version for branch (4.18.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
release note text is set and does not match the template
dependent bug Jira Issue OCPBUGS-23514 is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
dependent Jira Issue OCPBUGS-23514 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0, 4.19.z
bug has dependents

Requesting review from QA contact:
/cc @dis016

In response to this:

This is an automated cherry-pick of #1165

/assign petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dis016 · 2025-07-15T10:25:34Z

/cc @jiajliu

dis016 · 2025-07-16T06:15:12Z

/assign @petr-muller

dis016 · 2025-07-30T06:52:23Z

@PraffulKapse Unit tests are covering below scenario's related to slow operator. so we don't have to execute the test manually which will be like repeated task.

<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/single_UpdateEffectNone_error_and_timeout" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/single_UpdateEffectNone_error_and_machine-config_timeout" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/MultipleErrors:_all_updating_and_one_slow" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/MultipleErrors:_all_updating_and_some_slow" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/MultipleErrors:_all_updating_and_all_slow" time="0.000000"></testcase>

cc @jhou1 @jiajliu

hongkailiu added 3 commits July 7, 2025 16:18

Move slowCOUpdatePrefix Reason instead of Message

ad02eee

Extract two cases from convertErrorToProgressing into its own function

1ce0def

openshift-cherrypick-robot assigned petr-muller Jul 7, 2025

openshift-cherrypick-robot mentioned this pull request Jul 7, 2025

OCPBUGS-23514: Failing=Unknown upon long CO updating #1165

Merged

openshift-ci bot changed the title ~~[release-4.18] OCPBUGS-23514: Failing=Unknown upon long CO updating~~ [release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating Jul 7, 2025

openshift-ci bot requested a review from jiajliu July 7, 2025 16:19

wking approved these changes Jul 7, 2025

View reviewed changes

openshift-ci bot assigned wking Jul 7, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2025

wking mentioned this pull request Jul 7, 2025

[release-4.17] OCPBUGS-58451: Failing=Unknown upon long CO updating #1212

Open

wking reviewed Jul 9, 2025

View reviewed changes

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jul 11, 2025

openshift-ci bot requested a review from dis016 July 11, 2025 09:16

[release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating #1211

Are you sure you want to change the base?

[release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating #1211

Uh oh!

Conversation

openshift-cherrypick-robot commented Jul 7, 2025

Uh oh!

openshift-ci-robot commented Jul 7, 2025

Uh oh!

openshift-ci-robot commented Jul 7, 2025

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jul 7, 2025

Uh oh!

openshift-ci bot commented Jul 7, 2025

Uh oh!

dis016 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dis016 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

dis016 Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dis016 Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

dis016 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dis016 commented Jul 11, 2025

Uh oh!

openshift-ci-robot commented Jul 11, 2025

Uh oh!

dis016 commented Jul 15, 2025

Uh oh!

dis016 commented Jul 16, 2025

Uh oh!

dis016 commented Jul 30, 2025

Uh oh!

Uh oh!

dis016 commented Jul 9, 2025 •

edited

Loading

dis016 commented Jul 9, 2025 •

edited

Loading

wking Jul 9, 2025 •

edited

Loading

dis016 Jul 10, 2025 •

edited

Loading

dis016 commented Jul 10, 2025 •

edited

Loading