Skip to content

Commit 757214d

Browse files
authored
Merge pull request #40133 from mburke5678/OSDOCS-3040-drift
OSDOCS-3040: Check for config drift regularly
2 parents 8e6e39a + d847cf7 commit 757214d

File tree

6 files changed

+122
-1
lines changed

6 files changed

+122
-1
lines changed

architecture/control-plane.adoc

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ include::modules/understanding-control-plane.adoc[leveloffset=+1]
99

1010
include::modules/architecture-machine-config-pools.adoc[leveloffset=+2]
1111

12+
.Additional resources
13+
14+
* For more information on configuration drift detection, see xref:../post_installation_configuration/machine-configuration-tasks.adoc#machine-config-drift-detection_post-install-machine-configuration-tasks[Understanding configuration drift detection].
15+
1216
include::modules/architecture-machine-roles.adoc[leveloffset=+2]
1317

1418
include::modules/operators-overview.adoc[leveloffset=+2]
@@ -19,4 +23,6 @@ include::modules/understanding-machine-config-operator.adoc[leveloffset=+3]
1923

2024
.Additional information
2125

22-
For information on preventing the control plane machines from rebooting after the Machine Config Operator makes changes to the machine config, see xref:../support/troubleshooting/troubleshooting-operator-issues.adoc#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues[Disabling Machine Config Operator from automatically rebooting].
26+
* For more information on detecting configuration drift, see xref:../post_installation_configuration/machine-configuration-tasks.adoc#machine-config-drift-detection_post-install-machine-configuration-tasks[Understanding configuration drift detection].
27+
28+
* For information on preventing the control plane machines from rebooting after the Machine Config Operator makes changes to the machine config, see xref:../support/troubleshooting/troubleshooting-operator-issues.adoc#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues[Disabling Machine Config Operator from automatically rebooting].

modules/architecture-machine-config-pools.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,6 @@ Any node labeled with the `infra` role that is only running infra workloads is n
2424
====
2525

2626
The MCO applies updates for pools independently; for example, if there is an update that affects all pools, nodes from each pool update in parallel with each other. If you add a custom pool, nodes from that pool also attempt to update concurrently with the master and worker nodes.
27+
28+
There might be situations where the configuration on a node does not fully match what the currently-applied machine config specifies. This state is called _configuration drift_. The Machine Config Daemon (MCD) regularly checks the nodes for configuration drift. If the MCD detects configuration drift, the MCO marks the node `degraded` until an administrator corrects the node configuration. A degraded node is online and operational, but, it cannot be updated.
29+
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * post_installation_configuration/machine-configuration-tasks.adoc
4+
5+
[id="machine-config-drift-detection_{context}"]
6+
= Understanding configuration drift detection
7+
8+
There might be situations when the on-disk state of a node differs from what is configured in the machine config. This is known as _configuration drift_. For example, a cluster admin might manually modify a file, a systemd unit file, or a file permission that was configured through a machine config. This causes configuration drift. Configuration drift can cause problems between nodes in a Machine Config Pool or when the machine configs are updated.
9+
10+
The Machine Config Operator (MCO) uses the Machine Config Daemon (MCD) to check nodes for configuration drift on a regular basis. If detected, the MCO sets the node and the machine config pool (MCP) to `Degraded` and reports the error. A degraded node is online and operational, but, it cannot be updated.
11+
12+
The MCD performs configuration drift detection upon each of the following conditions:
13+
14+
* When a node boots.
15+
* After any of the files (Ignition files and systemd drop-in units) specified in the machine config are modified outside of the machine config.
16+
* Before a new machine config is applied.
17+
+
18+
[NOTE]
19+
====
20+
If you apply a new machine config to the nodes, the MCD temporarily shuts down configuration drift detection. This shutdown is needed because the new machine config necessarily differs from the machine config on the nodes. After the new machine config is applied, the MCD restarts detecting configuration drift using the new machine config.
21+
====
22+
23+
When performing configuration drift detection, the MCD validates that the file contents and permissions fully match what the currently-applied machine config specifies. Typically, the MCD detects configuration drift in less than a second after the detection is triggered.
24+
25+
If the MCD detects configuration drift, the MCD performs the following tasks:
26+
27+
* Emits an error to the console logs
28+
* Emits a Kubernetes event
29+
* Stops further detection on the node
30+
* Sets the node and MCP to `degraded`
31+
32+
You can check if you have a degraded node by listing the MCPs:
33+
34+
[source,terminal]
35+
----
36+
$ oc get mcp worker
37+
----
38+
39+
If you have a degraded MCP, the `DEGRADEDMACHINECOUNT` field is non-zero, similar to the following output:
40+
41+
.Example output
42+
[source,terminal]
43+
----
44+
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
45+
worker rendered-worker-404caf3180818d8ac1f50c32f14b57c3 False True True 2 1 1 1 5h51m
46+
----
47+
48+
You can determine if the problem is caused by configuration drift by examining the machine config pool:
49+
50+
[source,terminal]
51+
----
52+
$ oc describe mcp worker
53+
----
54+
55+
.Example output
56+
[source,terminal]
57+
----
58+
...
59+
Last Transition Time: 2021-12-20T18:54:00Z
60+
Message: Node ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4 is reporting: "content mismatch for file \"/etc/mco-test-file\"" <1>
61+
Reason: 1 nodes are reporting degraded status on sync
62+
Status: True
63+
Type: NodeDegraded <2>
64+
...
65+
----
66+
<1> This message shows that a node's `/etc/mco-test-file` file, which was added by the machine config, has changed outside of the machine config.
67+
<2> The state of the node is `NodeDegraded`.
68+
69+
Or, if you know which node is degraded, examine that node:
70+
71+
[source,terminal]
72+
----
73+
$ oc describe node/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4
74+
----
75+
76+
.Example output
77+
[source,terminal]
78+
----
79+
...
80+
81+
Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"nic0","ifaddr":{"ipv4":"10.0.128.0/17"},"capacity":{"ip":10}}]
82+
csi.volume.kubernetes.io/nodeid:
83+
{"pd.csi.storage.gke.io":"projects/openshift-gce-devel-ci/zones/us-central1-a/instances/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4"}
84+
machine.openshift.io/machine: openshift-machine-api/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4
85+
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
86+
machineconfiguration.openshift.io/currentConfig: rendered-worker-67bd55d0b02b0f659aef33680693a9f9
87+
machineconfiguration.openshift.io/desiredConfig: rendered-worker-67bd55d0b02b0f659aef33680693a9f9
88+
machineconfiguration.openshift.io/reason: content mismatch for file "/etc/mco-test-file" <1>
89+
machineconfiguration.openshift.io/state: Degraded <2>
90+
...
91+
----
92+
<1> The error message indicating that configuration drift was detected between the node and the listed machine config. Here the error message indicates that the contents of the `/etc/mco-test-file`, which was added by the machine config, has changed outside of the machine config.
93+
<2> The state of the node is `Degraded`.
94+
95+
You can correct configuration drift and return the node to the `Ready` state by performing one of the following remediations:
96+
97+
* Ensure that the contents and file permissions of the files on the node match what is configured in the machine config. You can manually rewrite the file
98+
contents or change the file permissions.
99+
* Generate a link:https://access.redhat.com/solutions/5414371[force file] on the degraded node. The force file causes the MCD to bypass the usual configuration drift detection and reapplies the current machine config.
100+
+
101+
[NOTE]
102+
====
103+
Generating a force file on a node causes that node to reboot.
104+
====
105+

modules/machine-config-overview.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ The MCO is not the only Operator that can change operating system components on
6666

6767
Tasks for the MCO configuration that can be done post-installation are included in the following procedures. See descriptions of {op-system} bare metal installation for system configuration tasks that must be done during or before {product-title} installation.
6868

69+
There might be situations where the configuration on a node does not fully match what the currently-applied machine config specifies. This state is called _configuration drift_. The Machine Config Daemon (MCD) regularly checks the nodes for configuration drift. If the MCD detects configuration drift, the MCO marks the node `degraded` until an administrator corrects the node configuration. A degraded node is online and operational, but, it cannot be updated. For more information on configuration drift, see _Understanding configuration drift detection_.
70+
6971
== Project
7072

7173
See the link:https://github.com/openshift/machine-config-operator[openshift-machine-config-operator] GitHub site for details.

modules/understanding-machine-config-operator.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,5 @@ The following modifications do not trigger a node reboot:
5555
5656
* When the MCO detects changes to the `/etc/containers/registries.conf` file, such as adding or editing an `ImageContentSourcePolicy` object, it drains the corresponding nodes, applies the changes, and uncordons the nodes.
5757
====
58+
59+
There might be situations where the configuration on a node does not fully match what the currently-applied machine config specifies. This state is called _configuration drift_. The Machine Config Daemon (MCD) regularly checks the nodes for configuration drift. If the MCD detects configuration drift, the MCO marks the node `degraded` until an administrator corrects the node configuration. A degraded node is online and operational, but, it cannot be updated.

post_installation_configuration/machine-configuration-tasks.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Tasks in this section describe how to use features of the Machine Config Operato
1616

1717
include::modules/machine-config-operator.adoc[leveloffset=+2]
1818
include::modules/machine-config-overview.adoc[leveloffset=+2]
19+
include::modules/machine-config-drift-detection.adoc[leveloffset=+2]
1920
include::modules/checking-mco-status.adoc[leveloffset=+2]
2021

2122
[id="using-machineconfigs-to-change-machines"]
@@ -26,6 +27,8 @@ link:https://access.redhat.com/solutions/4510281[updating] SSH authorized keys,
2627

2728
{product-title} supports link:https://coreos.github.io/ignition/configuration-v3_2/[Ignition specification version 3.2]. All new machine configs you create going forward should be based on Ignition specification version 3.2. If you are upgrading your {product-title} cluster, any existing Ignition specification version 2.x machine configs will be translated automatically to specification version 3.2.
2829

30+
There might be situations where the configuration on a node does not fully match what the currently-applied machine config specifies. This state is called _configuration drift_. The Machine Config Daemon (MCD) regularly checks the nodes for configuration drift. If the MCD detects configuration drift, the MCO marks the node `degraded` until an administrator corrects the node configuration. A degraded node is online and operational, but, it cannot be updated. For more information on configuration drift, see xref:../post_installation_configuration/machine-configuration-tasks.adoc#machine-config-drift-detection_post-install-machine-configuration-tasks[Understanding configuration drift detection].
31+
2932
include::modules/installation-special-config-chrony.adoc[leveloffset=+2]
3033
include::modules/cnf-disable-chronyd.adoc[leveloffset=+2]
3134
include::modules/nodes-nodes-kernel-arguments.adoc[leveloffset=+2]

0 commit comments

Comments
 (0)