Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 63 additions & 14 deletions pkg/cvo/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,14 @@ func updateClusterVersionStatus(cvStatus *configv1.ClusterVersionStatus, status
failingCondition.Reason = failingReason
failingCondition.Message = failingMessage
}
if failure != nil &&
strings.HasPrefix(progressReason, slowCOUpdatePrefix) {
failingCondition.Status = configv1.ConditionUnknown
failingCondition.Reason = "SlowClusterOperator"
Copy link
Member

@wking wking Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more testing notes, since we had some questions about metrics handling. launch 4.18 aws to Cluster Bot (logs) and a new payload with this change via build 4.18,openshift/cluster-version-operator#1211 (logs). Cordon compute, to block ingress, registry, etc. from being able to roll out new workloads:

$ oc adm cordon -l node-role.kubernetes.io/worker=

Then launch the update, using --force to blast through the use of a by-tag :latest pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest points at) and the lack of signature (no production signatures on CI-built release images):

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest

I didn't set no-spot when requesting the cluster, so run a cordon again once this new Machine comes up:

$ oc -n openshift-machine-api get machines
NAME                                                PHASE         TYPE         REGION      ZONE         AGE
ci-ln-lxqwzpk-76ef8-95mmn-master-0                  Running       m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-1                  Running       m6a.xlarge   us-east-2   us-east-2a   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-2                  Running       m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq   Running       m6a.xlarge   us-east-2   us-east-2a   31m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8   Provisioned   m6a.xlarge   us-east-2   us-east-2b   4m23s
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7   Running       m6a.xlarge   us-east-2   us-east-2b   31m
$ oc -n openshift-machine-api get machines
NAME                                                PHASE     TYPE         REGION      ZONE         AGE
ci-ln-lxqwzpk-76ef8-95mmn-master-0                  Running   m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-1                  Running   m6a.xlarge   us-east-2   us-east-2a   40m
ci-ln-lxqwzpk-76ef8-95mmn-master-2                  Running   m6a.xlarge   us-east-2   us-east-2b   40m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2a-zlzqq   Running   m6a.xlarge   us-east-2   us-east-2a   32m
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-78rc8   Running   m6a.xlarge   us-east-2   us-east-2b   4m51s
ci-ln-lxqwzpk-76ef8-95mmn-worker-us-east-2b-s9nm7   Running   m6a.xlarge   us-east-2   us-east-2b   32m
$ oc adm cordon -l node-role.kubernetes.io/worker=
node/ip-10-0-124-53.us-east-2.compute.internal cordoned
node/ip-10-0-27-221.us-east-2.compute.internal already cordoned
node/ip-10-0-64-26.us-east-2.compute.internal already cordoned

And the update gets stuck on the cordon, but the image-registry operator complains about the sticking (good for admins, but means we don't exercise this SlowClusterOperator logic.

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator image-registry is degraded

info: An upgrade is in progress. Unable to apply 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: an unknown error has occurred: MultipleErrors

```console
$ oc get -o json clusteroperator image-registry | jq '.status.conditions[] | select(.type == "Degraded")'
{
  "lastTransitionTime": "2025-07-09T18:28:57Z",
  "message": "Degraded: Registry deployment has timed out progressing: ReplicaSet \"image-registry-54757dcb9f\" has timed out progressing.",
  "reason": "ProgressDeadlineExceeded",
  "status": "True",
  "type": "Degraded"
}

I might need to use openshift/cluster-image-registry-operator#1227, or try again with ClusterVersion overrides.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, trying on a fresh Cluster Bot launch 4.17.27 aws (logs) with an overrides entry instead of the cordoning, to try and get the stale-but-nominally-happy ClusterOperator content needed to trigger SlowClusterOperator:

$ oc patch clusterversion.config.openshift.io version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-image-registry", "name": "cluster-image-registry-operator", "unmanaged": true}]}]'

Then launch the update, using --force to blast through the use of a by-tag :latest pullspec (it's a throw away cluster, I'll accept the risk that the registry changes what latest points at) and the lack of signature (no production signatures on CI-built release images):

$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build06.ci.openshift.org/ci-ln-k5kkkjt/release:latest

And some time later:

$ oc adm upgrade
Failing=Unknown:

  Reason: SlowClusterOperator
  Message: waiting on image-registry over 30 minutes which is longer than expected

info: An upgrade is in progress. Working towards 4.18.0-0-2025-07-09-171308-test-ci-ln-k5kkkjt-latest: 708 of 901 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

...
$ oc get -o json clusterversion version | jq '.status.conditions[] | select(.type == "Failing")'
{
  "lastTransitionTime": "2025-07-09T20:33:52Z",
  "message": "waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}

And in platform monitoring:

max by (reason) (cluster_operator_conditions{name="version",condition="Failing"}) + 0.1

shows the Failing metric going away when it moves from Failing=False to Failing=Unknown, ~30m after group by (version) (cluster_operator_up{name="console"}) shows the also-runlevel-50 console ClusterOperator finishing its update (by which point, we'll also have been waiting on the image-registry ClusterOperator for ~30m).

image

That's also when ClusterOperatorDegraded starts firing for name="version":

group by (alertname, name, reason) (ALERTS{alertname="ClusterOperatorDegraded"})

image

And while we don't have a great reason like "because the metric we expect to see has disappeared", the alert description is probably still workable:

image

Copy link

@dis016 dis016 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wking Thanks for the testing, do we need to take some action for metric we expect to see has disappeared or is it ok to proceed with current metric representation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failingCondition.Message = progressMessage
}
progressReason = strings.TrimPrefix(progressReason, slowCOUpdatePrefix)

resourcemerge.SetOperatorStatusCondition(&cvStatus.Conditions, failingCondition)

// update progressing
Expand Down Expand Up @@ -537,6 +545,8 @@ func setDesiredReleaseAcceptedCondition(cvStatus *configv1.ClusterVersionStatus,
}
}

const slowCOUpdatePrefix = "Slow::"

// convertErrorToProgressing returns true if the provided status indicates a failure condition can be interpreted as
// still making internal progress. The general error we try to suppress is an operator or operators still being
// progressing AND the general payload task making progress towards its goal. The error's UpdateEffect determines
Expand All @@ -555,28 +565,67 @@ func convertErrorToProgressing(now time.Time, statusFailure error) (reason strin
case payload.UpdateEffectReport:
return uErr.Reason, uErr.Error(), false
case payload.UpdateEffectNone:
return uErr.Reason, fmt.Sprintf("waiting on %s", uErr.Name), true
return convertErrorToProgressingForUpdateEffectNone(uErr, now)
case payload.UpdateEffectFail:
return "", "", false
case payload.UpdateEffectFailAfterInterval:
var exceeded []string
threshold := now.Add(-(40 * time.Minute))
names := uErr.Names
if len(names) == 0 {
names = []string{uErr.Name}
}
for _, name := range names {
if payload.COUpdateStartTimesGet(name).Before(threshold) {
return convertErrorToProgressingForUpdateEffectFailAfterInterval(uErr, now)
}
return "", "", false
}

func convertErrorToProgressingForUpdateEffectNone(uErr *payload.UpdateError, now time.Time) (string, string, bool) {
var exceeded []string
names := uErr.Names
if len(names) == 0 {
names = []string{uErr.Name}
}
var machineConfig bool
for _, name := range names {
m := 30 * time.Minute
// It takes longer to upgrade MCO
if name == "machine-config" {
m = 3 * m
}
t := payload.COUpdateStartTimesGet(name)
if (!t.IsZero()) && t.Before(now.Add(-(m))) {
if name == "machine-config" {
machineConfig = true
} else {
exceeded = append(exceeded, name)
}
}
if len(exceeded) > 0 {
return uErr.Reason, fmt.Sprintf("wait has exceeded 40 minutes for these operators: %s", strings.Join(exceeded, ", ")), false
} else {
return uErr.Reason, fmt.Sprintf("waiting up to 40 minutes on %s", uErr.Name), true
}
// returns true in those slow cases because it is still only a suspicion
if len(exceeded) > 0 && !machineConfig {
return slowCOUpdatePrefix + uErr.Reason, fmt.Sprintf("waiting on %s over 30 minutes which is longer than expected", strings.Join(exceeded, ", ")), true
}
if len(exceeded) > 0 && machineConfig {
return slowCOUpdatePrefix + uErr.Reason, fmt.Sprintf("waiting on %s over 30 minutes and machine-config over 90 minutes which is longer than expected", strings.Join(exceeded, ", ")), true
}
if len(exceeded) == 0 && machineConfig {
return slowCOUpdatePrefix + uErr.Reason, "waiting on machine-config over 90 minutes which is longer than expected", true
}
return uErr.Reason, fmt.Sprintf("waiting on %s", strings.Join(names, ", ")), true
}

func convertErrorToProgressingForUpdateEffectFailAfterInterval(uErr *payload.UpdateError, now time.Time) (string, string, bool) {
var exceeded []string
threshold := now.Add(-(40 * time.Minute))
names := uErr.Names
if len(names) == 0 {
names = []string{uErr.Name}
}
for _, name := range names {
if payload.COUpdateStartTimesGet(name).Before(threshold) {
exceeded = append(exceeded, name)
}
}
return "", "", false
if len(exceeded) > 0 {
return uErr.Reason, fmt.Sprintf("wait has exceeded 40 minutes for these operators: %s", strings.Join(exceeded, ", ")), false
} else {
return uErr.Reason, fmt.Sprintf("waiting up to 40 minutes on %s", uErr.Name), true
}
}

// syncFailingStatus handles generic errors in the cluster version. It tries to preserve
Expand Down
Loading