Improve node-{join,left} logging for troubleshooting (#92742)

DaveCTurner · web-flow · commit 518274831859 · 2023-01-09T04:34:41.000-05:00
Today to troubleshoot an unstable cluster we ask the users to parse the rather complex `node-join` and `node-left` messages emitted by the `MasterService`. These messages may refer to many nodes, may be truncated, and are generally pretty hard to work with. With this commit we start to emit a simplified log message about each node added and removed. It also renames the respective executor classes: - `JoinTaskExecutor` -> `NodeJoinExecutor` - `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor` This brings their names in line with each other, and the messages that they emit, whilst preserving the older `node-join` and `node-left` terminology as reported by the `MasterService`. Finally, it updates the troubleshooting logs to reflect these new and simplified logs. Relates #92741
diff --git a/docs/changelog/92742.yaml b/docs/changelog/92742.yaml
@@ -0,0 +1,5 @@
+pr: 92742
+summary: "Improve node-{join,left} logging for troubleshooting"
+area: Cluster Coordination
+type: enhancement
+issues: []
diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc
@@ -52,7 +52,7 @@ logs.
 * The master may appear busy due to frequent cluster state updates.
 
 To troubleshoot a cluster in this state, first ensure the cluster has a
-<<modules-discovery-troubleshooting,stable master>>. Next, focus on the nodes
+<<discovery-troubleshooting,stable master>>. Next, focus on the nodes
 unexpectedly leaving the cluster ahead of all other issues. It will not be
 possible to solve other issues until the cluster has a stable master node and
 stable node membership.
@@ -62,23 +62,33 @@ tools only offer a view of the state of the cluster at a single point in time.
 Instead, look at the cluster logs to see the pattern of behaviour over time.
 Focus particularly on logs from the elected master. When a node leaves the
 cluster, logs for the elected master include a message like this (with line
-breaks added for clarity):
+breaks added to make it easier to read):
 
 [source,text]
 ----
-[2022-03-21T11:02:35,513][INFO ][o.e.c.s.MasterService    ]
-    [instance-0000000000] node-left[
-        {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
-            reason: disconnected,
-        {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
-            reason: disconnected
-        ], term: 14, version: 1653415, ...
+[2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000] node-left:
+    removed [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}]
+    with reason [test reason]
+----
+
+This message says that the `NodeLeftExecutor` on the elected master
+(`instance-0000000000`) processed a `node-left` task, identifying the node that
+was removed and the reason for its removal. When the node joins the cluster
+again, logs for the elected master will include a message like this (with line
+breaks added to make it easier to read):
+
+[source,text]
+----
+[2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000] node-join:
+    added [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}]
+    with reason [joining after restart, removed [24s] ago with reason [disconnected]]
 ----
 
-This message says that the `MasterService` on the elected master
-(`instance-0000000000`) is processing a `node-left` task. It lists the nodes
-that are being removed and the reasons for their removal. Other nodes may log
-similar messages, but report fewer details:
+This message says that the `NodeJoinExecutor` on the elected master
+(`instance-0000000000`) processed a `node-join` task, identifying the node that
+was added to the cluster and the reason for the task.
+
+Other nodes may log similar messages, but report fewer details:
 
 [source,text]
 ----
@@ -89,9 +99,10 @@ similar messages, but report fewer details:
     }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
 ----
 
-Focus on the one from the `MasterService` which is only emitted on the elected
-master, since it contains more details. If you don't see the messages from the
-`MasterService`, check that:
+These messages are not especially useful for troubleshooting, so focus on the
+ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted
+on the elected master and which contain more details. If you don't see the
+messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that:
 
 * You're looking at the logs for the elected master node.
 
@@ -104,18 +115,14 @@ start or stop following the elected master. You can use these messages to
 determine each node's view of the state of the master over time.
 
 If a node restarts, it will leave the cluster and then join the cluster again.
-When it rejoins, the `MasterService` will log that it is processing a
-`node-join` task. You can tell from the master logs that the node was restarted
-because the `node-join` message will indicate that it is
-`joining after restart`. In older {es} versions, you can also determine that a
-node restarted by looking at the second "ephemeral" ID in the `node-left` and
-subsequent `node-join` messages. This ephemeral ID is different each time the
-node starts up. If a node is unexpectedly restarting, you'll need to look at
-the node's logs to see why it is shutting down.
+When it rejoins, the `NodeJoinExecutor` will log that it processed a
+`node-join` task indicating that the node is `joining after restart`. If a node
+is unexpectedly restarting, look at the node's logs to see why it is shutting
+down.
 
 If the node did not restart then you should look at the reason for its
-departure in the `node-left` message, which is reported after each node. There
-are three possible reasons:
+departure more closely. Each reason has different troubleshooting steps,
+described below. There are three possible reasons:
 
 * `disconnected`: The connection from the master node to the removed node was
 closed.
@@ -134,6 +141,10 @@ control this mechanism.
 
 ===== Diagnosing `disconnected` nodes
 
+Nodes typically leave the cluster with reason `disconnected` when they shut
+down, but if they rejoin the cluster without restarting then there is some
+other problem.
+
 {es} is designed to run on a fairly reliable network. It opens a number of TCP
 connections between nodes and expects these connections to remain open forever.
 If a connection is closed then {es} will try and reconnect, so the occasional
@@ -194,6 +205,10 @@ the logs on the elected master.
 
 ===== Diagnosing `follower check retry count exceeded` nodes
 
+Nodes sometimes leave the cluster with reason `follower check retry count
+exceeded` when they shut down, but if they rejoin the cluster without
+restarting then there is some other problem.
+
 {es} needs every node to respond to network messages successfully and
 reasonably quickly. If a node rejects requests or does not respond at all then
 it can be harmful to the cluster. If enough consecutive checks fail then the
diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -137,7 +137,7 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
     private final AllocationService allocationService;
     private final JoinHelper joinHelper;
     private final JoinValidationService joinValidationService;
-    private final NodeRemovalClusterStateTaskExecutor nodeRemovalExecutor;
+    private final NodeLeftExecutor nodeLeftExecutor;
     private final Supplier<CoordinationState.PersistedState> persistedStateSupplier;
     private final NoMasterBlockService noMasterBlockService;
     final Object mutex = new Object(); // package-private to allow tests to call methods that assert that the mutex is held
@@ -205,7 +205,7 @@ public Coordinator(
         this.transportService = transportService;
         this.masterService = masterService;
         this.allocationService = allocationService;
-        this.onJoinValidators = JoinTaskExecutor.addBuiltInJoinValidators(onJoinValidators);
+        this.onJoinValidators = NodeJoinExecutor.addBuiltInJoinValidators(onJoinValidators);
         this.singleNodeDiscovery = DiscoveryModule.isSingleNodeDiscovery(settings);
         this.electionStrategy = electionStrategy;
         this.joinReasonService = new JoinReasonService(transportService.getThreadPool()::relativeTimeInMillis);
@@ -272,7 +272,7 @@ public Coordinator(
             this::removeNode,
             nodeHealthService
         );
-        this.nodeRemovalExecutor = new NodeRemovalClusterStateTaskExecutor(allocationService);
+        this.nodeLeftExecutor = new NodeLeftExecutor(allocationService);
         this.clusterApplier = clusterApplier;
         masterService.setClusterStateSupplier(this::getStateForMasterService);
         this.reconfigurator = new Reconfigurator(settings, clusterSettings);
@@ -339,16 +339,11 @@ private void onLeaderFailure(Supplier<String> message, Exception e) {
     private void removeNode(DiscoveryNode discoveryNode, String reason) {
         synchronized (mutex) {
             if (mode == Mode.LEADER) {
-                var task = new NodeRemovalClusterStateTaskExecutor.Task(
-                    discoveryNode,
-                    reason,
-                    () -> joinReasonService.onNodeRemoved(discoveryNode, reason)
-                );
                 masterService.submitStateUpdateTask(
                     "node-left",
-                    task,
+                    new NodeLeftExecutor.Task(discoveryNode, reason, () -> joinReasonService.onNodeRemoved(discoveryNode, reason)),
                     ClusterStateTaskConfig.build(Priority.IMMEDIATE),
-                    nodeRemovalExecutor
+                    nodeLeftExecutor
                 );
             }
         }
@@ -664,7 +659,7 @@ private void validateJoinRequest(JoinRequest joinRequest, ActionListener<Void> v
             if (stateForJoinValidation.getBlocks().hasGlobalBlock(STATE_NOT_RECOVERED_BLOCK) == false) {
                 // We do this in a couple of places including the cluster update thread. This one here is really just best effort to ensure
                 // we fail as fast as possible.
-                JoinTaskExecutor.ensureVersionBarrier(
+                NodeJoinExecutor.ensureVersionBarrier(
                     joinRequest.getSourceNode().getVersion(),
                     stateForJoinValidation.getNodes().getMinNodeVersion()
                 );
diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java b/server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java
@@ -67,7 +67,7 @@ public class JoinHelper {
     private final MasterService masterService;
     private final ClusterApplier clusterApplier;
     private final TransportService transportService;
-    private final JoinTaskExecutor joinTaskExecutor;
+    private final NodeJoinExecutor nodeJoinExecutor;
     private final LongSupplier currentTermSupplier;
     private final NodeHealthService nodeHealthService;
     private final JoinReasonService joinReasonService;
@@ -94,7 +94,7 @@ public class JoinHelper {
         this.clusterApplier = clusterApplier;
         this.transportService = transportService;
         this.circuitBreakerService = circuitBreakerService;
-        this.joinTaskExecutor = new JoinTaskExecutor(allocationService, rerouteService);
+        this.nodeJoinExecutor = new NodeJoinExecutor(allocationService, rerouteService);
         this.currentTermSupplier = currentTermSupplier;
         this.nodeHealthService = nodeHealthService;
         this.joinReasonService = joinReasonService;
@@ -389,13 +389,17 @@ default void close(Mode newMode) {}
     class LeaderJoinAccumulator implements JoinAccumulator {
         @Override
         public void handleJoinRequest(DiscoveryNode sender, ActionListener<Void> joinListener) {
-            final JoinTask task = JoinTask.singleNode(
-                sender,
-                joinReasonService.getJoinReason(sender, Mode.LEADER),
-                joinListener,
-                currentTermSupplier.getAsLong()
+            masterService.submitStateUpdateTask(
+                "node-join",
+                JoinTask.singleNode(
+                    sender,
+                    joinReasonService.getJoinReason(sender, Mode.LEADER),
+                    joinListener,
+                    currentTermSupplier.getAsLong()
+                ),
+                ClusterStateTaskConfig.build(Priority.URGENT),
+                nodeJoinExecutor
             );
-            masterService.submitStateUpdateTask("node-join", task, ClusterStateTaskConfig.build(Priority.URGENT), joinTaskExecutor);
         }
 
         @Override
@@ -461,7 +465,7 @@ public void close(Mode newMode) {
                     "elected-as-master ([" + joinTask.nodeCount() + "] nodes joined)",
                     joinTask,
                     ClusterStateTaskConfig.build(Priority.URGENT),
-                    joinTaskExecutor
+                    nodeJoinExecutor
 
                 );
             } else {
diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/NodeJoinExecutor.java b/server/src/main/java/org/elasticsearch/cluster/coordination/NodeJoinExecutor.java
@@ -36,14 +36,14 @@
 
 import static org.elasticsearch.gateway.GatewayService.STATE_NOT_RECOVERED_BLOCK;
 
-public class JoinTaskExecutor implements ClusterStateTaskExecutor<JoinTask> {
+public class NodeJoinExecutor implements ClusterStateTaskExecutor<JoinTask> {
 
-    private static final Logger logger = LogManager.getLogger(JoinTaskExecutor.class);
+    private static final Logger logger = LogManager.getLogger(NodeJoinExecutor.class);
 
     private final AllocationService allocationService;
     private final RerouteService rerouteService;
 
-    public JoinTaskExecutor(AllocationService allocationService, RerouteService rerouteService) {
+    public NodeJoinExecutor(AllocationService allocationService, RerouteService rerouteService) {
         this.allocationService = allocationService;
         this.rerouteService = rerouteService;
     }
@@ -135,7 +135,14 @@ public ClusterState execute(BatchExecutionContext<JoinTask> batchExecutionContex
                         continue;
                     }
                 }
-                onTaskSuccess.add(() -> nodeJoinTask.listener().onResponse(null));
+                onTaskSuccess.add(() -> {
+                    logger.info(
+                        "node-join: [{}] with reason [{}]",
+                        nodeJoinTask.node().descriptionWithoutAttributes(),
+                        nodeJoinTask.reason()
+                    );
+                    nodeJoinTask.listener().onResponse(null);
+                });
             }
             joinTaskContext.success(() -> {
                 for (Runnable joinCompleter : onTaskSuccess) {
diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/NodeLeftExecutor.java b/server/src/main/java/org/elasticsearch/cluster/coordination/NodeLeftExecutor.java
@@ -19,9 +19,9 @@
 import org.elasticsearch.cluster.service.MasterService;
 import org.elasticsearch.persistent.PersistentTasksCustomMetadata;
 
-public class NodeRemovalClusterStateTaskExecutor implements ClusterStateTaskExecutor<NodeRemovalClusterStateTaskExecutor.Task> {
+public class NodeLeftExecutor implements ClusterStateTaskExecutor<NodeLeftExecutor.Task> {
 
-    private static final Logger logger = LogManager.getLogger(NodeRemovalClusterStateTaskExecutor.class);
+    private static final Logger logger = LogManager.getLogger(NodeLeftExecutor.class);
 
     private final AllocationService allocationService;
 
@@ -41,7 +41,7 @@ public String toString() {
         }
     }
 
-    public NodeRemovalClusterStateTaskExecutor(AllocationService allocationService) {
+    public NodeLeftExecutor(AllocationService allocationService) {
         this.allocationService = allocationService;
     }
 
@@ -52,13 +52,21 @@ public ClusterState execute(BatchExecutionContext<Task> batchExecutionContext) t
         boolean removed = false;
         for (final var taskContext : batchExecutionContext.taskContexts()) {
             final var task = taskContext.getTask();
+            final String reason;
             if (initialState.nodes().nodeExists(task.node())) {
                 remainingNodesBuilder.remove(task.node());
                 removed = true;
+                reason = task.reason();
             } else {
                 logger.debug("node [{}] does not exist in cluster state, ignoring", task);
+                reason = null;
             }
-            taskContext.success(task.onClusterStateProcessed::run);
+            taskContext.success(() -> {
+                if (reason != null) {
+                    logger.info("node-left: [{}] with reason [{}]", task.node().descriptionWithoutAttributes(), reason);
+                }
+                task.onClusterStateProcessed.run();
+            });
         }
 
         if (removed == false) {
diff --git a/server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodeWithStatus.java b/server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodeWithStatus.java
@@ -47,7 +47,7 @@ public record DesiredNodeWithStatus(DesiredNode desiredNode, Status status)
             ),
             // An unknown status is expected during upgrades to versions >= STATUS_TRACKING_SUPPORT_VERSION
             // the desired node status would be populated when a node in the newer version is elected as
-            // master, the desired nodes status update happens in JoinTaskExecutor.
+            // master, the desired nodes status update happens in NodeJoinExecutor.
             args[6] == null ? Status.PENDING : (Status) args[6]
         )
     );
@@ -84,7 +84,7 @@ public static DesiredNodeWithStatus readFrom(StreamInput in) throws IOException
             // since it's impossible to know if a node that was supposed to
             // join the cluster, it joined. The status will be updated
             // once the master node is upgraded to a version >= STATUS_TRACKING_SUPPORT_VERSION
-            // in JoinTaskExecutor or when the desired nodes are upgraded to a new version.
+            // in NodeJoinExecutor or when the desired nodes are upgraded to a new version.
             status = Status.PENDING;
         }
         return new DesiredNodeWithStatus(desiredNode, status);
diff --git a/server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodes.java b/server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodes.java
@@ -8,7 +8,9 @@
 
 package org.elasticsearch.cluster.metadata;
 
+import org.elasticsearch.action.admin.cluster.desirednodes.TransportUpdateDesiredNodesAction;
 import org.elasticsearch.cluster.ClusterState;
+import org.elasticsearch.cluster.coordination.NodeJoinExecutor;
 import org.elasticsearch.cluster.node.DiscoveryNode;
 import org.elasticsearch.cluster.node.DiscoveryNodes;
 import org.elasticsearch.common.io.stream.StreamInput;
@@ -97,8 +99,7 @@
  *  </ul>
  *
  * <p>
- *  See {@code JoinTaskExecutor} and {@code TransportUpdateDesiredNodesAction} for more details about
- *  desired nodes status tracking.
+ *  See {@link NodeJoinExecutor} and {@link TransportUpdateDesiredNodesAction} for more details about desired nodes status tracking.
  * </p>
  *
  * <p>
diff --git a/server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinExecutorTests.java b/server/src/test/java/org/elasticsearch/cluster/coordination/NodeJoinExecutorTests.java
diff --git a/server/src/test/java/org/elasticsearch/cluster/coordination/NodeLeftExecutorTests.java b/server/src/test/java/org/elasticsearch/cluster/coordination/NodeLeftExecutorTests.java
diff --git a/server/src/test/java/org/elasticsearch/indices/cluster/ClusterStateChanges.java b/server/src/test/java/org/elasticsearch/indices/cluster/ClusterStateChanges.java