-
Notifications
You must be signed in to change notification settings - Fork 204
[release-4.16] OCPBUGS-58452: Failing=Unknown upon long CO updating #1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.16
Are you sure you want to change the base?
[release-4.16] OCPBUGS-58452: Failing=Unknown upon long CO updating #1213
Conversation
When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue. This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it. Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up. In addition to the condition's message. We propagate Fail=Unknown to make it available for other automations, such as update-status command. [1]. kubernetes/kubernetes#106054 [2]. https://github.com/openshift/cluster-version-operator/blob/08c0459df5096e9f16fad3af2831b62d06d415ee/lib/resourcebuilder/apps.go#L79-L136
@openshift-cherrypick-robot: Jira Issue OCPBUGS-23514 has been cloned as Jira Issue OCPBUGS-58452. Will retitle bug to link to clone. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58452, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@petr-muller: This pull request references Jira Issue OCPBUGS-58452, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@petr-muller: This pull request references Jira Issue OCPBUGS-58452, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Feel free to add backport-risk-assessed
when you're ready, depending on however long you want to soak #1212.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: openshift-cherrypick-robot, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@openshift-cherrypick-robot: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Test Scenario: Regression test Failing=True condition still works well during operator degraded(authentication). pkapse@pkapse-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2025-07-28-124528 True False 40m Cluster version is 4.16.0-0.nightly-2025-07-28-124528 pkapse@pkapse-mac Downloads % oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.0-0.nightly-2025-07-28-124528 True False False 41m pkapse@pkapse-mac Downloads % cat auth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: test clientSecret: name: test claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com pkapse@pkapse-mac Downloads % oc apply -f auth.yaml Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically. oauth.config.openshift.io/cluster configured pkapse@pkapse-mac Downloads % oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.16.0-0.nightly-2025-07-28-124528 True False True 49m OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host Step2: Trigger an upgrade to image which contains the new changes. pkapse@pkapse-mac Downloads % oc adm upgrade --to-image=registry.build10.ci.openshift.org/ci-ln-40w96fb/release:latest --allow-explicit-upgrade --force --allow-upgrade-with-warnings warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.16.0-0.nightly-2025-07-28-124528": Reason: ClusterOperatorDegraded Message: Cluster operator authentication is degraded Requested update to release image [registry.build10.ci.openshift.org/ci-ln-40w96fb/release:latest](http://registry.build10.ci.openshift.org/ci-ln-40w96fb/release:latest) pkapse@pkapse-mac Downloads % export OC_ENABLE_CMD_UPGRADE_STATUS=true pkapse@pkapse-mac Downloads % oc adm upgrade status Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session = Control Plane = Assessment: Progressing Target Version: 4.16.0-0-2025-07-27-054505-test-ci-ln-40w96fb-latest (from 4.16.0-0.nightly-2025-07-28-124528) Updating: etcd, kube-apiserver Completion: 3% (1 operators updated, 2 updating, 30 waiting) Duration: 1m27s (Est. Time Remaining: 1h10m) Operator Health: 32 Healthy, 1 Available but degradedx`` Control Plane Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-41-207.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-82-186.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-90-213.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? = Worker Upgrade = WORKER POOL ASSESSMENT COMPLETION STATUS worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining Worker Pool Nodes: worker NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-55-182.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-77-73.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-93-56.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? = Update Health = SINCE LEVEL IMPACT MESSAGE 7m55s Warning API Availability Cluster Operator authentication is degraded (OAuthServerConfigObservation_Error) Step 3. Upgrade is triggered and CVO is throwing error after sometime for authentication operator and upgrade stuck with waiting up to 40 minutes on authentication. pkapse@pkapse-mac Downloads % oc adm upgrade Failing=True: Reason: ClusterOperatorDegraded Message: Cluster operator authentication is degraded info: An upgrade is in progress. Unable to apply 4.16.0-0-2025-07-27-054505-test-ci-ln-40w96fb-latest: an unknown error has occurred: MultipleErrors warning: Cannot display available updates: Reason: NoChannel Message: The update channel has not been configured. pkapse@pkapse-mac Downloads % oc adm upgrade status --details=all Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session = Control Plane = Assessment: Progressing Target Version: 4.16.0-0-2025-07-27-054505-test-ci-ln-40w96fb-latest (from 4.16.0-0.nightly-2025-07-28-124528) Completion: 91% (30 operators updated, 0 updating, 3 waiting) Duration: 44m (Est. Time Remaining: 12m) Operator Health: 32 Healthy, 1 Available but degraded Control Plane Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-41-207.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-82-186.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-90-213.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? = Worker Upgrade = WORKER POOL ASSESSMENT COMPLETION STATUS worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining Worker Pool Nodes: worker NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-55-182.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-77-73.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-93-56.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? = Update Health = Message: Cluster Operator authentication is degraded (OAuthServerConfigObservation_Error) Since: 50m48s Level: Error Impact: API Availability Reference: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md Resources: clusteroperators.config.openshift.io: authentication Description: OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host Step4: Check CVO status Failing!=unknown and Progressing=True pkapse@pkapse-mac Downloads % oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")' lastTransitionTime: "2025-07-30T11:22:17Z" status: "False" type: Failing lastTransitionTime: "2025-07-30T10:50:04Z" message: 'Working towards 4.16.0-0-2025-07-27-054505-test-ci-ln-40w96fb-latest: 706 of 902 done (78% complete), waiting up to 40 minutes on authentication' reason: ClusterOperatorDegraded status: "True" type: Progressing Ste5: un-degrade the CO authentication pkapse@pkapse-mac Downloads % cat un_auth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: {} pkapse@pkapse-mac Downloads % oc apply -f un_auth.yaml oauth.config.openshift.io/cluster configured Step6: Now CVO error should disappear then upgrade should resume. pkapse@pkapse-mac Downloads % oc adm upgrade info: An upgrade is in progress. Working towards 4.16.0-0-2025-07-27-054505-test-ci-ln-40w96fb-latest: 728 of 902 done (80% complete), waiting on dns, network warning: Cannot display available updates: Reason: NoChannel Message: The update channel has not been configured. pkapse@pkapse-mac Downloads % oc adm upgrade status Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session = Control Plane = Assessment: Progressing Target Version: 4.16.0-0-2025-07-27-054505-test-ci-ln-40w96fb-latest (from 4.16.0-0.nightly-2025-07-28-124528) Updating: dns, network Completion: 91% (30 operators updated, 2 updating, 1 waiting) Duration: 58m (Est. Time Remaining: 59m) Operator Health: 33 Healthy Control Plane Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-41-207.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-82-186.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-90-213.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? = Worker Upgrade = WORKER POOL ASSESSMENT COMPLETION STATUS worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining Worker Pool Nodes: worker NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-55-182.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-77-73.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? ip-10-0-93-56.us-east-2.compute.internal Outdated Pending 4.16.0-0.nightly-2025-07-28-124528 ? = Update Health = SINCE LEVEL IMPACT MESSAGE 58m12s Info None Update is proceeding well |
#Test Scenario: Failing=unknown when slow update happen for image-registry operator for >30 minutes.
pkapse@pkapse-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2025-07-24-234922 True False 6m10s Cluster version is 4.16.0-0.nightly-2025-07-24-234922
pkapse@pkapse-mac Downloads % oc patch clusterversion.config.openshift.io version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-image-registry", "name": "cluster-image-registry-operator", "unmanaged": true}]}]' clusterversion.config.openshift.io/version patched
pkapse@pkapse-mac Downloads % oc adm upgrade --force --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image registry.build10.ci.openshift.org/ci-ln-8nljj5k/release:latest warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requested update to release image registry.build10.ci.openshift.org/ci-ln-8nljj5k/release:latest
pkapse@pkapse-mac Downloads % oc adm upgrade Failing=Unknown: Reason: SlowClusterOperator Message: waiting on image-registry over 30 minutes which is longer than expected info: An upgrade is in progress. Working towards 4.16.0-0-2025-07-24-052605-test-ci-ln-8nljj5k-latest: 706 of 902 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected |
/label qe-approved |
This is an automated cherry-pick of #1165
/assign petr-muller