Skip to content

[release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating #1211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: release-4.18
Choose a base branch
from

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #1165

/assign petr-muller

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2]. https://github.com/openshift/cluster-version-operator/blob/08c0459df5096e9f16fad3af2831b62d06d415ee/lib/resourcebuilder/apps.go#L79-L136
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: Jira Issue OCPBUGS-23514 has been cloned as Jira Issue OCPBUGS-58450. Will retitle bug to link to clone.
/retitle [release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating

In response to this:

This is an automated cherry-pick of #1165

/assign petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot changed the title [release-4.18] OCPBUGS-23514: Failing=Unknown upon long CO updating [release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating Jul 7, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jul 7, 2025
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58450, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.z) matches configured target version for branch (4.18.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-23514 is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-23514 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0, 4.19.z
  • bug has dependents

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is an automated cherry-pick of #1165

/assign petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jiajliu July 7, 2025 16:19
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2025
Copy link
Contributor

openshift-ci bot commented Jul 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-cherrypick-robot, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2025
Copy link
Contributor

openshift-ci bot commented Jul 7, 2025

@openshift-cherrypick-robot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift 1ce0def link true /test e2e-hypershift

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dis016
Copy link

dis016 commented Jul 9, 2025

Test Scenario: Failing=unknown when slow update happen for image-registry operator for >30 minutes.
Test Status: Pass
Step1: Install a cluster with 4.18.19 and check status of CVO.

dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        False         13m     Cluster version is 4.18.19
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-07-09T05:58:59Z"
status: "False"
type: Failing
lastTransitionTime: "2025-07-09T05:58:59Z"
message: Cluster version is 4.18.19
status: "False"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ % 

Step2: create a build release with image-registry magic PR(openshift/machine-config-operator#5170)

build 4.18,openshift/cluster-image-registry-operator#1227,openshift/cluster-version-operator#1211

Step3: Upgrade to newly created build

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-72wbfn2/release@sha256:5325f54137fbfde8f9945d2f02dea735b2eb49357412ca971a7e21f936645194  --allow-explicit-upgrade --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-72wbfn2/release@sha256:5325f54137fbfde8f9945d2f02dea735b2eb49357412ca971a7e21f936645194
dinesh@Dineshs-MacBook-Pro ~ % 

Step4: Upgrade should get triggered and proceed normally


dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          34s     Working towards 4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest: 1 of 901 done (0% complete)
dinesh@Dineshs-MacBook-Pro ~ % export OC_ENABLE_CMD_UPGRADE_STATUS=true 
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest (from 4.18.19)
Completion:      3% (1 operators updated, 0 updating, 33 waiting)
Duration:        54s (Est. Time Remaining: 1h11m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-101-12.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-29-0.us-east-2.compute.internal     Outdated     Pending   4.18.19   ?     
ip-10-0-87-207.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-109-168.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-122-73.us-east-2.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-2-89.us-east-2.compute.internal      Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
54s     Info    None     Update is proceeding well
dinesh@Dineshs-MacBook-Pro ~ % 

Step5: After sometime(approximately>30minutes for image-registry ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True

 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing" or .type=="Progressing")'
{
  "lastTransitionTime": "2025-07-09T07:43:58Z",
  "message": "waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-07-09T06:49:08Z",
  "message": "Working towards 4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest: 709 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          64m     Working towards 4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest: 709 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-08-110128-test-ci-ln-72wbfn2-latest (from 4.18.19)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        1h4m (Est. Time Remaining: <10m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-101-12.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-29-0.us-east-2.compute.internal     Outdated     Pending   4.18.19   ?     
ip-10-0-87-207.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-109-168.us-east-2.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-122-73.us-east-2.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-2-89.us-east-2.compute.internal      Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL     IMPACT           MESSAGE
9m20s   Warning   Update Stalled   Cluster Version version is failing to proceed with the update (SlowClusterOperator)

Run with --details=health for additional description and links to related online documentation
dinesh@Dineshs-MacBook-Pro ~ %

Upgrade should never complete and keep progressing.

@dis016
Copy link

dis016 commented Jul 9, 2025

Test Scenario: Failing=unknown when slow update happen for MCO operator for >90 minutes.
Test Status: Passed
Step1: Install a cluster with 4.18.19 and check status of CVO.

dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        False         111s    Cluster version is 4.18.19
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-07-09T13:13:03Z"
status: "False"
type: Failing
lastTransitionTime: "2025-07-09T13:13:03Z"
message: Cluster version is 4.18.19
status: "False"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ %

Step2: create a build release with MCO magic PR(openshift/machine-config-operator#5170)

build 4.18,openshift/machine-config-operator#5170,openshift/cluster-version-operator#1211

Step3: Upgrade to newly created build

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-v6spdst/release@sha256:1a307203a2a1042576cc82bf76312fa8a99a02848b9c87b141180d6534c4d008  --allow-explicit-upgrade --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-v6spdst/release@sha256:1a307203a2a1042576cc82bf76312fa8a99a02848b9c87b141180d6534c4d008
dinesh@Dineshs-MacBook-Pro ~ % 

Step4: Upgrade should get triggered and proceed normally

 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion                                                                                             
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          7s      Working towards 4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest: 11 of 900 done (1% complete)
dinesh@Dineshs-MacBook-Pro ~ % export OC_ENABLE_CMD_UPGRADE_STATUS=true 
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest (from 4.18.19)
Completion:      0% (0 operators updated, 0 updating, 34 waiting)
Duration:        21s (Est. Time Remaining: 1h12m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-54-233.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-76-13.us-west-1.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-82-206.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-100-88.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-116-19.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-16-139.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
21s     Info    None     Update is proceeding well
dinesh@Dineshs-MacBook-Pro ~ % 

Step5: After sometime(approximately>90minutes for machine-config ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True


dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.19   True        True          148m    Working towards 4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest: 773 of 900 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing" or .type=="Progressing")'
{
  "lastTransitionTime": "2025-07-09T15:42:21Z",
  "message": "waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-07-09T13:15:43Z",
  "message": "Working towards 4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest: 773 of 900 done (85% complete), waiting on machine-config over 90 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Stalled
Target Version:  4.18.0-0-2025-07-09-053449-test-ci-ln-v6spdst-latest (from 4.18.19)
Completion:      97% (33 operators updated, 0 updating, 1 waiting)
Duration:        2h29m (Est. Time Remaining: N/A; estimate duration was 1h48m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-54-233.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-76-13.us-west-1.compute.internal    Outdated     Pending   4.18.19   ?     
ip-10-0-82-206.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                        ASSESSMENT   PHASE     VERSION   EST   MESSAGE
ip-10-0-100-88.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-116-19.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     
ip-10-0-16-139.us-west-1.compute.internal   Outdated     Pending   4.18.19   ?     

= Update Health =
SINCE   LEVEL     IMPACT           MESSAGE
2m      Warning   Update Stalled   Cluster Version version is failing to proceed with the update (SlowClusterOperator)

Run with --details=health for additional description and links to related online documentation
dinesh@Dineshs-MacBook-Pro ~ % oc get co machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.18.19   True        False         False      173m    
dinesh@Dineshs-MacBook-Pro ~ %

Upgrade should never complete and keep progressing.

if failure != nil &&
strings.HasPrefix(progressReason, slowCOUpdatePrefix) {
failingCondition.Status = configv1.ConditionUnknown
failingCondition.Reason = "SlowClusterOperator"
Copy link
Member

@wking wking Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more testing notes, since we had some questions about metrics handling. launch 4.18 aws to Cluster Bot (logs) and a new payload with this change via build 4.18,openshift/cluster-version-operator#1211 (logs). Cordon compute, to block ingress, registry, etc. from being able to roll out new workloads:

$ oc adm cordon -l node-role.kubernetes.io/worker=

Then launch the update, using --force to blast through the use of a by-tag :latest pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest points at) and the lack of signature (no production signatures on CI-built release images):

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest

I didn't set no-spot when requesting the cluster, so run a cordon again once this new Machine comes up:

$ oc -n openshift-machine-api get machines
NAME                                                PHASE         TYPE         REGION      ZONE         AGE
ci-ln-lxqwzpk-76ef8-95mmn-master-0                  Running       m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-1                  Running       m6a.xlarge   us-east-2   us-east-2a   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-2                  Running       m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq   Running       m6a.xlarge   us-east-2   us-east-2a   31m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8   Provisioned   m6a.xlarge   us-east-2   us-east-2b   4m23s
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7   Running       m6a.xlarge   us-east-2   us-east-2b   31m
$ oc -n openshift-machine-api get machines
NAME                                                PHASE     TYPE         REGION      ZONE         AGE
ci-ln-lxqwzpk-76ef8-95mmn-master-0                  Running   m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-1                  Running   m6a.xlarge   us-east-2   us-east-2a   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-2                  Running   m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq   Running   m6a.xlarge   us-east-2   us-east-2a   32m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8   Running   m6a.xlarge   us-east-2   us-east-2b   4m51s
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7   Running   m6a.xlarge   us-east-2   us-east-2b   32m
$ oc adm cordon -l node-role.kubernetes.io/worker=
node/ip-10-0-124-53.us-east-2.compute.internal cordoned
node/ip-10-0-27-221.us-east-2.compute.internal already cordoned
node/ip-10-0-64-26.us-east-2.compute.internal already cordoned

And the update gets stuck on the cordon, but the image-registry operator complains about the sticking (good for admins, but means we don't exercise this SlowClusterOperator logic.

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator image-registry is degraded

info: An upgrade is in progress. Unable to apply 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: an unknown error has occurred: MultipleErrors

```console
$ oc get -o json clusteroperator image-registry | jq '.status.conditions[] | select(.type == "Degraded")'
{
  "lastTransitionTime": "2025-07-09T18:28:57Z",
  "message": "Degraded: Registry deployment has timed out progressing: ReplicaSet \"image-registry-54757dcb9f\" has timed out progressing.",
  "reason": "ProgressDeadlineExceeded",
  "status": "True",
  "type": "Degraded"
}

I might need to use openshift/cluster-image-registry-operator#1227, or try again with ClusterVersion overrides.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, trying on a fresh Cluster Bot launch 4.17.27 aws (logs) with an overrides entry instead of the cordoning, to try and get the stale-but-nominally-happy ClusterOperator content needed to trigger SlowClusterOperator:

$ oc patch clusterversion.config.openshift.io version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-image-registry", "name": "cluster-image-registry-operator", "unmanaged": true}]}]'

Then launch the update, using --force to blast through the use of a by-tag :latest pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest points at) and the lack of signature (no production signatures on CI-built release images):

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest

And some time later:

$ oc adm upgrade
Failing=Unknown:

  Reason: SlowClusterOperator
  Message: waiting on image-registry over 30 minutes which is longer than expected

info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: 708 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

...
$ oc get -o json clusterversion version | jq '.status.conditions[] | select(.type == "Failing")'
{
  "lastTransitionTime": "2025-07-09T20:33:52Z",
  "message": "waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}

And in platform monitoring:

max by (reason) (cluster_operator_conditions{name="version",condition="Failing"}) + 0.1

shows the Failing metric going away when it moves from Failing=False to Failing=Unknown, ~30m after group by (version) (cluster_operator_up{name="console"}) shows the also-runlevel-50 console ClusterOperator finishing its update (by which point, we'll also have been waiting on the image-registry ClusterOperator for ~30m).

image

That's also when ClusterOperatorDegraded starts firing for name="version":

group by (alertname, name, reason) (ALERTS{alertname="ClusterOperatorDegraded"})

image

And while we don't have a great reason like "because the metric we expect to see has disappeared", the alert description is probably still workable:

image

Copy link

@dis016 dis016 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wking Thanks for the testing, do we need to take some action for metric we expect to see has disappeared or is it ok to proceed with current metric representation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dis016
Copy link

dis016 commented Jul 10, 2025

Test Scenario: Regression test Failing=True condition still works well during operator degraded(authentication).
Test Status: Passed
Step1: Install Cluster and degrade the CO authentication


dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.nightly-2025-07-05-213851   True        False         21m     Cluster version is 4.18.0-0.nightly-2025-07-05-213851
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
dinesh@Dineshs-MacBook-Pro ~ % 


dinesh@Dineshs-MacBook-Pro ~ % oc get co authentication 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.18.0-0.nightly-2025-07-05-213851   True        False         True       26m     OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
dinesh@Dineshs-MacBook-Pro ~ % 

Step2: Trigger an upgrade to version which contains the changes.

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade --to-image=registry.build06.ci.openshift.org/ci-ln-83f28bb/release@sha256:986f2b451e59238eeb6fdc1ba7aa01789391cc2529efd9fe571f772577dc7f05 --allow-explicit-upgrade --force --allow-upgrade-with-warnings

warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.18.0-0.nightly-2025-07-05-213851":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build06.ci.openshift.org/ci-ln-83f28bb/release@sha256:986f2b451e59238eeb6fdc1ba7aa01789391cc2529efd9fe571f772577dc7f05
dinesh@Dineshs-MacBook-Pro ~ % 

Step 3. Upgrade is triggered and CVO is throwing error after sometime for authentication operator and upgrade stuck with waiting up to 40 minutes on authentication.

 
dinesh@Dineshs-MacBook-Pro ~ % while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 111 of 901 done (12% complete), waiting on etcd, kube-apiserver

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2025-07-10T05:08:57Z",
  "status": "False",
  "type": "Failing"
}
...
...
...
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

info: An upgrade is in progress. Unable to apply 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: an unknown error has occurred: MultipleErrors

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2025-07-10T05:33:39Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 708 of 901 done (78% complete), waiting up to 40 minutes on authentication

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2025-07-10T05:41:54Z",
  "status": "False",
  "type": "Failing"
}

Step4: Check CVO status Failing!=unknown and Progressing=True

 
dinesh@Dineshs-MacBook-Pro ~ % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
lastTransitionTime: "2025-07-10T05:41:54Z"
status: "False"
type: Failing
lastTransitionTime: "2025-07-10T05:08:53Z"
message: 'Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 708 of 901 done (78% complete), waiting up to 40 minutes on authentication'
reason: ClusterOperatorDegraded
status: "True"
type: Progressing
dinesh@Dineshs-MacBook-Pro ~ % 

Step5: Enable upgrade status and check oc adm upgrade status reports authentication operator degraded.


dinesh@Dineshs-MacBook-Pro ~ % OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest (from 4.18.0-0.nightly-2025-07-05-213851)
Completion:      88% (30 operators updated, 0 updating, 4 waiting)
Duration:        36m (Est. Time Remaining: 36m)
Operator Health: 33 Healthy, 1 Available but degraded

Control Plane Nodes
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-127-246.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-43-80.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-46-27.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-119-222.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-14-152.us-west-1.compute.internal    Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-7-48.us-west-1.compute.internal      Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Update Health =
Message: Cluster Operator authentication is degraded (OAuthServerConfigObservation_Error)
  Since:       48m35s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: authentication
  Description: OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
dinesh@Dineshs-MacBook-Pro ~ % 

Step6: un-degrade the CO authentication

 
dinesh@Dineshs-MacBook-Pro ~ % cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec: {}
dinesh@Dineshs-MacBook-Pro ~ % oc apply -f oauth.yaml 
oauth.config.openshift.io/cluster configured
dinesh@Dineshs-MacBook-Pro ~ %

Step7: Now CVO error should disappear then upgrade should resume.

 
dinesh@Dineshs-MacBook-Pro ~ % oc get co authentication 
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest   True        False         False      75m     
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % 
dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade           
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest: 720 of 901 done (79% complete), waiting on olm

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.32 and therefore OpenShift 4.19 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7112216 for details and instructions. This cluster is GCP or AWS but lacks a boot image configuration. OCP will automatically opt this cluster into boot image management in 4.19. Please add a configuration to disable boot image updates if this is not desired. See https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/machine_configuration/mco-update-boot-images#mco-update-boot-images-disable_machine-configs-configure for more details.

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

dinesh@Dineshs-MacBook-Pro ~ % oc adm upgrade status 
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-0-2025-07-09-082113-test-ci-ln-83f28bb-latest (from 4.18.0-0.nightly-2025-07-05-213851)
Completion:      88% (30 operators updated, 0 updating, 4 waiting)
Duration:        39m (Est. Time Remaining: 33m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-127-246.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-43-80.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-46-27.us-west-1.compute.internal     Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION                              EST   MESSAGE
ip-10-0-119-222.us-west-1.compute.internal   Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-14-152.us-west-1.compute.internal    Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     
ip-10-0-7-48.us-west-1.compute.internal      Outdated     Pending   4.18.0-0.nightly-2025-07-05-213851   ?     

= Update Health =
SINCE    LEVEL   IMPACT   MESSAGE
38m53s   Info    None     Update is proceeding well
dinesh@Dineshs-MacBook-Pro ~ % 

@dis016
Copy link

dis016 commented Jul 11, 2025

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jul 11, 2025
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58450, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.z) matches configured target version for branch (4.18.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-23514 is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-23514 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0, 4.19.z
  • bug has dependents

Requesting review from QA contact:
/cc @dis016

In response to this:

This is an automated cherry-pick of #1165

/assign petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from dis016 July 11, 2025 09:16
@dis016
Copy link

dis016 commented Jul 15, 2025

/cc @jiajliu

@dis016
Copy link

dis016 commented Jul 16, 2025

/assign @petr-muller

@dis016
Copy link

dis016 commented Jul 30, 2025

@PraffulKapse Unit tests are covering below scenario's related to slow operator. so we don't have to execute the test manually which will be like repeated task.

<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/single_UpdateEffectNone_error_and_timeout" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/single_UpdateEffectNone_error_and_machine-config_timeout" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/MultipleErrors:_all_updating_and_one_slow" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/MultipleErrors:_all_updating_and_some_slow" time="0.000000"></testcase>
<testcase classname="github.com/openshift/cluster-version-operator/pkg/cvo" name="TestUpdateClusterVersionStatus_FilteringMultipleErrorsForFailingCondition/MultipleErrors:_all_updating_and_all_slow" time="0.000000"></testcase>

cc @jhou1 @jiajliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants