-
Notifications
You must be signed in to change notification settings - Fork 204
[release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating #1211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.18
Are you sure you want to change the base?
[release-4.18] OCPBUGS-58450: Failing=Unknown upon long CO updating #1211
Conversation
When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue. This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it. Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up. In addition to the condition's message. We propagate Fail=Unknown to make it available for other automations, such as update-status command. [1]. kubernetes/kubernetes#106054 [2]. https://github.com/openshift/cluster-version-operator/blob/08c0459df5096e9f16fad3af2831b62d06d415ee/lib/resourcebuilder/apps.go#L79-L136
@openshift-cherrypick-robot: Jira Issue OCPBUGS-23514 has been cloned as Jira Issue OCPBUGS-58450. Will retitle bug to link to clone. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58450, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: openshift-cherrypick-robot, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@openshift-cherrypick-robot: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Test Scenario: Failing=unknown when slow update happen for image-registry operator for >30 minutes.
Step2: create a build release with image-registry magic PR(openshift/machine-config-operator#5170)
Step3: Upgrade to newly created build
Step4: Upgrade should get triggered and proceed normally
Step5: After sometime(approximately>30minutes for image-registry ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True
Upgrade should never complete and keep progressing. |
Test Scenario: Failing=unknown when slow update happen for MCO operator for >90 minutes.
Step2: create a build release with MCO magic PR(openshift/machine-config-operator#5170)
Step3: Upgrade to newly created build
Step4: Upgrade should get triggered and proceed normally
Step5: After sometime(approximately>90minutes for machine-config ) CVO should report Failing=Unknown, reason=SlowClusterOperator and Progressing=True
Upgrade should never complete and keep progressing. |
if failure != nil && | ||
strings.HasPrefix(progressReason, slowCOUpdatePrefix) { | ||
failingCondition.Status = configv1.ConditionUnknown | ||
failingCondition.Reason = "SlowClusterOperator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more testing notes, since we had some questions about metrics handling. launch 4.18 aws
to Cluster Bot (logs) and a new payload with this change via build 4.18,openshift/cluster-version-operator#1211
(logs). Cordon compute, to block ingress, registry, etc. from being able to roll out new workloads:
$ oc adm cordon -l node-role.kubernetes.io/worker=
Then launch the update, using --force
to blast through the use of a by-tag :latest
pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest
points at) and the lack of signature (no production signatures on CI-built release images):
$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
I didn't set no-spot
when requesting the cluster, so run a cordon again once this new Machine comes up:
$ oc -n openshift-machine-api get machines
NAME PHASE TYPE REGION ZONE AGE
ci-ln-lxqwzpk-76ef8-95mmn-master-0 Running m6a.xlarge us-east-2 us-east-2b 40m
ci-ln-lxqwzpk-76ef8-95mmn-master-1 Running m6a.xlarge us-east-2 us-east-2a 40m
ci-ln-lxqwzpk-76ef8-95mmn-master-2 Running m6a.xlarge us-east-2 us-east-2b 40m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq Running m6a.xlarge us-east-2 us-east-2a 31m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8 Provisioned m6a.xlarge us-east-2 us-east-2b 4m23s
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7 Running m6a.xlarge us-east-2 us-east-2b 31m
$ oc -n openshift-machine-api get machines
NAME PHASE TYPE REGION ZONE AGE
ci-ln-lxqwzpk-76ef8-95mmn-master-0 Running m6a.xlarge us-east-2 us-east-2b 40m
ci-ln-lxqwzpk-76ef8-95mmn-master-1 Running m6a.xlarge us-east-2 us-east-2a 40m
ci-ln-lxqwzpk-76ef8-95mmn-master-2 Running m6a.xlarge us-east-2 us-east-2b 40m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq Running m6a.xlarge us-east-2 us-east-2a 32m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8 Running m6a.xlarge us-east-2 us-east-2b 4m51s
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7 Running m6a.xlarge us-east-2 us-east-2b 32m
$ oc adm cordon -l node-role.kubernetes.io/worker=
node/ip-10-0-124-53.us-east-2.compute.internal cordoned
node/ip-10-0-27-221.us-east-2.compute.internal already cordoned
node/ip-10-0-64-26.us-east-2.compute.internal already cordoned
And the update gets stuck on the cordon, but the image-registry operator complains about the sticking (good for admins, but means we don't exercise this SlowClusterOperator
logic.
$ oc adm upgrade
Failing=True:
Reason: ClusterOperatorDegraded
Message: Cluster operator image-registry is degraded
info: An upgrade is in progress. Unable to apply 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: an unknown error has occurred: MultipleErrors
```console
$ oc get -o json clusteroperator image-registry | jq '.status.conditions[] | select(.type == "Degraded")'
{
"lastTransitionTime": "2025-07-09T18:28:57Z",
"message": "Degraded: Registry deployment has timed out progressing: ReplicaSet \"image-registry-54757dcb9f\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "True",
"type": "Degraded"
}
I might need to use openshift/cluster-image-registry-operator#1227, or try again with ClusterVersion overrides
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, trying on a fresh Cluster Bot launch 4.17.27 aws
(logs) with an overrides
entry instead of the cordoning, to try and get the stale-but-nominally-happy ClusterOperator content needed to trigger SlowClusterOperator
:
$ oc patch clusterversion.config.openshift.io version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-image-registry", "name": "cluster-image-registry-operator", "unmanaged": true}]}]'
Then launch the update, using --force
to blast through the use of a by-tag :latest
pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest
points at) and the lack of signature (no production signatures on CI-built release images):
$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
And some time later:
$ oc adm upgrade
Failing=Unknown:
Reason: SlowClusterOperator
Message: waiting on image-registry over 30 minutes which is longer than expected
info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: 708 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
...
$ oc get -o json clusterversion version | jq '.status.conditions[] | select(.type == "Failing")'
{
"lastTransitionTime": "2025-07-09T20:33:52Z",
"message": "waiting on image-registry over 30 minutes which is longer than expected",
"reason": "SlowClusterOperator",
"status": "Unknown",
"type": "Failing"
}
And in platform monitoring:
max by (reason) (cluster_operator_conditions{name="version",condition="Failing"}) + 0.1
shows the Failing
metric going away when it moves from Failing=False
to Failing=Unknown
, ~30m after group by (version) (cluster_operator_up{name="console"})
shows the also-runlevel-50 console
ClusterOperator finishing its update (by which point, we'll also have been waiting on the image-registry
ClusterOperator for ~30m).
That's also when ClusterOperatorDegraded
starts firing for name="version"
:
group by (alertname, name, reason) (ALERTS{alertname="ClusterOperatorDegraded"})
And while we don't have a great reason
like "because the metric we expect to see has disappeared", the alert description is probably still workable:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wking Thanks for the testing, do we need to take some action for metric we expect to see has disappeared
or is it ok to proceed with current metric representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wking saw the slack comments https://redhat-internal.slack.com/archives/CJ1J9C3V4/p1752094519486299?thread_ts=1751905557.779399&cid=CJ1J9C3V4
Test Scenario: Regression test Failing=True condition still works well during operator degraded(authentication).
Step2: Trigger an upgrade to version which contains the changes.
Step 3. Upgrade is triggered and CVO is throwing error after sometime for authentication operator and upgrade stuck with waiting up to 40 minutes on authentication.
Step4: Check CVO status Failing!=unknown and Progressing=True
Step5: Enable upgrade status and check oc adm upgrade status reports authentication operator degraded.
Step6: un-degrade the CO authentication
Step7: Now CVO error should disappear then upgrade should resume.
|
/label qe-approved |
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-58450, which is valid. 7 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cc @jiajliu |
/assign @petr-muller |
@PraffulKapse Unit tests are covering below scenario's related to slow operator. so we don't have to execute the test manually which will be like repeated task.
|
This is an automated cherry-pick of #1165
/assign petr-muller