From 2659b6680285f6b9f8e810a5226d87ba4f17392a Mon Sep 17 00:00:00 2001 From: Hongkai Liu Date: Thu, 11 Sep 2025 17:45:17 -0400 Subject: [PATCH] NO-JIRA: New rules about CO's conditions This is to sync up the changed from [api#2469](https://github.com/openshift/api/pull/2469) to dev-guide. --- .../dev/clusteroperator.md | 42 +++++++++++++++---- 1 file changed, 33 insertions(+), 9 deletions(-) diff --git a/dev-guide/cluster-version-operator/dev/clusteroperator.md b/dev-guide/cluster-version-operator/dev/clusteroperator.md index 495d70e0f3..6f34baf152 100644 --- a/dev-guide/cluster-version-operator/dev/clusteroperator.md +++ b/dev-guide/cluster-version-operator/dev/clusteroperator.md @@ -169,15 +169,39 @@ Refer [the godocs](https://godoc.org/github.com/openshift/api/config/v1#ClusterS In general, ClusterOperators should contain at least three core conditions: -* `Progressing` must be true if the operator is actually making change to the operand. -The change may be anything: desired user state, desired user configuration, observed configuration, version update, etc. -If this is false, it means the operator is not trying to apply any new state. -If it remains true for an extended period of time, it suggests something is wrong in the cluster. It can probably wait until Monday. -* `Available` must be true if the operand is functional and available in the cluster at the level in status. -If this is false, it means there is an outage. Someone is probably getting paged. -* `Degraded` should be true if the operator has encountered an error that is preventing it or its operand from working properly. -The operand may still be available, but intent may not have been fulfilled. -If this is true, it means that the operand is at risk of an outage or improper configuration. It can probably wait until the morning, but someone needs to look at it. +* `Progressing` indicates that the component (operator and all configured operands) + is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise + moving from one steady state to another. Operators should not report + Progressing when they are reconciling (without action) a previously known + state. Operators should not report Progressing only because DaemonSets owned by them + are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. + If the observed cluster state has changed and the component is + reacting to it (updated proxy configuration for instance), Progressing should become true + since it is moving from one steady state to another. + A component in a cluster with less than 250 nodes must complete a version + change within a limited period of time: 90 minutes for Machine Config Operator and 20 minutes for others. + Machine Config Operator is given more time as it needs to restart control plane nodes. +* `Available` indicates that the component (operator and all configured operands) + is functional and available in the cluster. Available=False means at least + part of the component is non-functional, and that the condition requires + immediate administrator intervention. + A component must not report Available=False during the course of a normal upgrade. +* `Degraded` indicates that the component (operator and all configured operands) + does not match its desired state over a period of time resulting in a lower + quality of service. The period of time may vary by component, but a Degraded + state represents persistent observation of a condition. As a result, a + component should not oscillate in and out of Degraded state. A component may + be Available even if its degraded. For example, a component may desire 3 + running pods, but 1 pod is crash-looping. The component is Available but + Degraded because it may have a lower quality of service. A component may be + Progressing but not Degraded because the transition from one state to + another does not persist over a long enough period to report Degraded. A + component must not report Degraded during the course of a normal upgrade. + A component may report Degraded in response to a persistent infrastructure + failure that requires eventual administrator intervention. For example, if + a control plane host is unhealthy and must be replaced. A component should + report Degraded if unexpected errors occur over a period, but the + expectation is that all unexpected errors are handled as operators mature. The message reported for each of these conditions is important. All messages should start with a capital letter (like a sentence) and be written for an end user / admin to debug the problem.