|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * post_installation_configuration/machine-configuration-tasks.adoc |
| 4 | + |
| 5 | +[id="machine-config-drift-detection_{context}"] |
| 6 | += Understanding configuration drift detection |
| 7 | + |
| 8 | +There might be situations when the on-disk state of a node differs from what is configured in the machine config. This is known as _configuration drift_. For example, a cluster admin might manually modify a file, a systemd unit file, or a file permission that was configured through a machine config. This causes configuration drift. Configuration drift can cause problems between nodes in a Machine Config Pool or when the machine configs are updated. |
| 9 | + |
| 10 | +The Machine Config Operator (MCO) uses the Machine Config Daemon (MCD) to check nodes for configuration drift on a regular basis. If detected, the MCO sets the node and the machine config pool (MCP) to `Degraded` and reports the error. A degraded node is online and operational, but, it cannot be updated. |
| 11 | + |
| 12 | +The MCD performs configuration drift detection upon each of the following conditions: |
| 13 | + |
| 14 | +* When a node boots. |
| 15 | +* After any of the files (Ignition files and systemd drop-in units) specified in the machine config are modified outside of the machine config. |
| 16 | +* Before a new machine config is applied. |
| 17 | ++ |
| 18 | +[NOTE] |
| 19 | +==== |
| 20 | +If you apply a new machine config to the nodes, the MCD temporarily shuts down configuration drift detection. This shutdown is needed because the new machine config necessarily differs from the machine config on the nodes. After the new machine config is applied, the MCD restarts detecting configuration drift using the new machine config. |
| 21 | +==== |
| 22 | + |
| 23 | +When performing configuration drift detection, the MCD validates that the file contents and permissions fully match what the currently-applied machine config specifies. Typically, the MCD detects configuration drift in less than a second after the detection is triggered. |
| 24 | + |
| 25 | +If the MCD detects configuration drift, the MCD performs the following tasks: |
| 26 | + |
| 27 | +* Emits an error to the console logs |
| 28 | +* Emits a Kubernetes event |
| 29 | +* Stops further detection on the node |
| 30 | +* Sets the node and MCP to `degraded` |
| 31 | + |
| 32 | +You can check if you have a degraded node by listing the MCPs: |
| 33 | + |
| 34 | +[source,terminal] |
| 35 | +---- |
| 36 | +$ oc get mcp worker |
| 37 | +---- |
| 38 | + |
| 39 | +If you have a degraded MCP, the `DEGRADEDMACHINECOUNT` field is non-zero, similar to the following output: |
| 40 | + |
| 41 | +.Example output |
| 42 | +[source,terminal] |
| 43 | +---- |
| 44 | +NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE |
| 45 | +worker rendered-worker-404caf3180818d8ac1f50c32f14b57c3 False True True 2 1 1 1 5h51m |
| 46 | +---- |
| 47 | + |
| 48 | +You can determine if the problem is caused by configuration drift by examining the machine config pool: |
| 49 | + |
| 50 | +[source,terminal] |
| 51 | +---- |
| 52 | +$ oc describe mcp worker |
| 53 | +---- |
| 54 | + |
| 55 | +.Example output |
| 56 | +[source,terminal] |
| 57 | +---- |
| 58 | + ... |
| 59 | + Last Transition Time: 2021-12-20T18:54:00Z |
| 60 | + Message: Node ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4 is reporting: "content mismatch for file \"/etc/mco-test-file\"" <1> |
| 61 | + Reason: 1 nodes are reporting degraded status on sync |
| 62 | + Status: True |
| 63 | + Type: NodeDegraded <2> |
| 64 | + ... |
| 65 | +---- |
| 66 | +<1> This message shows that a node's `/etc/mco-test-file` file, which was added by the machine config, has changed outside of the machine config. |
| 67 | +<2> The state of the node is `NodeDegraded`. |
| 68 | + |
| 69 | +Or, if you know which node is degraded, examine that node: |
| 70 | + |
| 71 | +[source,terminal] |
| 72 | +---- |
| 73 | +$ oc describe node/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4 |
| 74 | +---- |
| 75 | + |
| 76 | +.Example output |
| 77 | +[source,terminal] |
| 78 | +---- |
| 79 | + ... |
| 80 | +
|
| 81 | +Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"nic0","ifaddr":{"ipv4":"10.0.128.0/17"},"capacity":{"ip":10}}] |
| 82 | + csi.volume.kubernetes.io/nodeid: |
| 83 | + {"pd.csi.storage.gke.io":"projects/openshift-gce-devel-ci/zones/us-central1-a/instances/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4"} |
| 84 | + machine.openshift.io/machine: openshift-machine-api/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4 |
| 85 | + machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable |
| 86 | + machineconfiguration.openshift.io/currentConfig: rendered-worker-67bd55d0b02b0f659aef33680693a9f9 |
| 87 | + machineconfiguration.openshift.io/desiredConfig: rendered-worker-67bd55d0b02b0f659aef33680693a9f9 |
| 88 | + machineconfiguration.openshift.io/reason: content mismatch for file "/etc/mco-test-file" <1> |
| 89 | + machineconfiguration.openshift.io/state: Degraded <2> |
| 90 | + ... |
| 91 | +---- |
| 92 | +<1> The error message indicating that configuration drift was detected between the node and the listed machine config. Here the error message indicates that the contents of the `/etc/mco-test-file`, which was added by the machine config, has changed outside of the machine config. |
| 93 | +<2> The state of the node is `Degraded`. |
| 94 | + |
| 95 | +You can correct configuration drift and return the node to the `Ready` state by performing one of the following remediations: |
| 96 | + |
| 97 | +* Ensure that the contents and file permissions of the files on the node match what is configured in the machine config. You can manually rewrite the file |
| 98 | +contents or change the file permissions. |
| 99 | +* Generate a link:https://access.redhat.com/solutions/5414371[force file] on the degraded node. The force file causes the MCD to bypass the usual configuration drift detection and reapplies the current machine config. |
| 100 | ++ |
| 101 | +[NOTE] |
| 102 | +==== |
| 103 | +Generating a force file on a node causes that node to reboot. |
| 104 | +==== |
| 105 | + |
0 commit comments