Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 33 additions & 9 deletions dev-guide/cluster-version-operator/dev/clusteroperator.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,15 +169,39 @@ Refer [the godocs](https://godoc.org/github.com/openshift/api/config/v1#ClusterS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link I was talking about is already there.
I did not notice it.

Eventually I copied everything over and then

$ sed -i 's|// ||g' dev-guide/cluster-version-operator/dev/clusteroperator.md

The main thing we lose here is when to handle it for Degraded and Long Progressing which is OK as discussed in openshift/api#2469 (comment)

In general, ClusterOperators should contain at least three core conditions:

* `Progressing` must be true if the operator is actually making change to the operand.
The change may be anything: desired user state, desired user configuration, observed configuration, version update, etc.
If this is false, it means the operator is not trying to apply any new state.
If it remains true for an extended period of time, it suggests something is wrong in the cluster. It can probably wait until Monday.
* `Available` must be true if the operand is functional and available in the cluster at the level in status.
If this is false, it means there is an outage. Someone is probably getting paged.
* `Degraded` should be true if the operator has encountered an error that is preventing it or its operand from working properly.
The operand may still be available, but intent may not have been fulfilled.
If this is true, it means that the operand is at risk of an outage or improper configuration. It can probably wait until the morning, but someone needs to look at it.
* `Progressing` indicates that the component (operator and all configured operands)
is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise
moving from one steady state to another. Operators should not report
Progressing when they are reconciling (without action) a previously known
state. Operators should not report Progressing only because DaemonSets owned by them
are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade.
If the observed cluster state has changed and the component is
reacting to it (updated proxy configuration for instance), Progressing should become true
since it is moving from one steady state to another.
A component in a cluster with less than 250 nodes must complete a version
change within a limited period of time: 90 minutes for Machine Config Operator and 20 minutes for others.
Machine Config Operator is given more time as it needs to restart control plane nodes.
* `Available` indicates that the component (operator and all configured operands)
is functional and available in the cluster. Available=False means at least
part of the component is non-functional, and that the condition requires
immediate administrator intervention.
A component must not report Available=False during the course of a normal upgrade.
* `Degraded` indicates that the component (operator and all configured operands)
does not match its desired state over a period of time resulting in a lower
quality of service. The period of time may vary by component, but a Degraded
state represents persistent observation of a condition. As a result, a
component should not oscillate in and out of Degraded state. A component may
be Available even if its degraded. For example, a component may desire 3
running pods, but 1 pod is crash-looping. The component is Available but
Degraded because it may have a lower quality of service. A component may be
Progressing but not Degraded because the transition from one state to
another does not persist over a long enough period to report Degraded. A
component must not report Degraded during the course of a normal upgrade.
A component may report Degraded in response to a persistent infrastructure
failure that requires eventual administrator intervention. For example, if
a control plane host is unhealthy and must be replaced. A component should
report Degraded if unexpected errors occur over a period, but the
expectation is that all unexpected errors are handled as operators mature.

The message reported for each of these conditions is important.
All messages should start with a capital letter (like a sentence) and be written for an end user / admin to debug the problem.
Expand Down