Skip to content

Commit 5182748

Browse files
authored
Improve node-{join,left} logging for troubleshooting (#92742)
Today to troubleshoot an unstable cluster we ask the users to parse the rather complex `node-join` and `node-left` messages emitted by the `MasterService`. These messages may refer to many nodes, may be truncated, and are generally pretty hard to work with. With this commit we start to emit a simplified log message about each node added and removed. It also renames the respective executor classes: - `JoinTaskExecutor` -> `NodeJoinExecutor` - `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor` This brings their names in line with each other, and the messages that they emit, whilst preserving the older `node-join` and `node-left` terminology as reported by the `MasterService`. Finally, it updates the troubleshooting logs to reflect these new and simplified logs. Relates #92741
1 parent bd0b090 commit 5182748

File tree

11 files changed

+287
-130
lines changed

11 files changed

+287
-130
lines changed

docs/changelog/92742.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 92742
2+
summary: "Improve node-{join,left} logging for troubleshooting"
3+
area: Cluster Coordination
4+
type: enhancement
5+
issues: []

docs/reference/modules/discovery/fault-detection.asciidoc

Lines changed: 41 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ logs.
5252
* The master may appear busy due to frequent cluster state updates.
5353

5454
To troubleshoot a cluster in this state, first ensure the cluster has a
55-
<<modules-discovery-troubleshooting,stable master>>. Next, focus on the nodes
55+
<<discovery-troubleshooting,stable master>>. Next, focus on the nodes
5656
unexpectedly leaving the cluster ahead of all other issues. It will not be
5757
possible to solve other issues until the cluster has a stable master node and
5858
stable node membership.
@@ -62,23 +62,33 @@ tools only offer a view of the state of the cluster at a single point in time.
6262
Instead, look at the cluster logs to see the pattern of behaviour over time.
6363
Focus particularly on logs from the elected master. When a node leaves the
6464
cluster, logs for the elected master include a message like this (with line
65-
breaks added for clarity):
65+
breaks added to make it easier to read):
6666

6767
[source,text]
6868
----
69-
[2022-03-21T11:02:35,513][INFO ][o.e.c.s.MasterService ]
70-
[instance-0000000000] node-left[
71-
{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
72-
reason: disconnected,
73-
{tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
74-
reason: disconnected
75-
], term: 14, version: 1653415, ...
69+
[2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000] node-left:
70+
removed [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}]
71+
with reason [test reason]
72+
----
73+
74+
This message says that the `NodeLeftExecutor` on the elected master
75+
(`instance-0000000000`) processed a `node-left` task, identifying the node that
76+
was removed and the reason for its removal. When the node joins the cluster
77+
again, logs for the elected master will include a message like this (with line
78+
breaks added to make it easier to read):
79+
80+
[source,text]
81+
----
82+
[2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000] node-join:
83+
added [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}]
84+
with reason [joining after restart, removed [24s] ago with reason [disconnected]]
7685
----
7786

78-
This message says that the `MasterService` on the elected master
79-
(`instance-0000000000`) is processing a `node-left` task. It lists the nodes
80-
that are being removed and the reasons for their removal. Other nodes may log
81-
similar messages, but report fewer details:
87+
This message says that the `NodeJoinExecutor` on the elected master
88+
(`instance-0000000000`) processed a `node-join` task, identifying the node that
89+
was added to the cluster and the reason for the task.
90+
91+
Other nodes may log similar messages, but report fewer details:
8292

8393
[source,text]
8494
----
@@ -89,9 +99,10 @@ similar messages, but report fewer details:
8999
}, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
90100
----
91101

92-
Focus on the one from the `MasterService` which is only emitted on the elected
93-
master, since it contains more details. If you don't see the messages from the
94-
`MasterService`, check that:
102+
These messages are not especially useful for troubleshooting, so focus on the
103+
ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted
104+
on the elected master and which contain more details. If you don't see the
105+
messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that:
95106

96107
* You're looking at the logs for the elected master node.
97108

@@ -104,18 +115,14 @@ start or stop following the elected master. You can use these messages to
104115
determine each node's view of the state of the master over time.
105116

106117
If a node restarts, it will leave the cluster and then join the cluster again.
107-
When it rejoins, the `MasterService` will log that it is processing a
108-
`node-join` task. You can tell from the master logs that the node was restarted
109-
because the `node-join` message will indicate that it is
110-
`joining after restart`. In older {es} versions, you can also determine that a
111-
node restarted by looking at the second "ephemeral" ID in the `node-left` and
112-
subsequent `node-join` messages. This ephemeral ID is different each time the
113-
node starts up. If a node is unexpectedly restarting, you'll need to look at
114-
the node's logs to see why it is shutting down.
118+
When it rejoins, the `NodeJoinExecutor` will log that it processed a
119+
`node-join` task indicating that the node is `joining after restart`. If a node
120+
is unexpectedly restarting, look at the node's logs to see why it is shutting
121+
down.
115122

116123
If the node did not restart then you should look at the reason for its
117-
departure in the `node-left` message, which is reported after each node. There
118-
are three possible reasons:
124+
departure more closely. Each reason has different troubleshooting steps,
125+
described below. There are three possible reasons:
119126

120127
* `disconnected`: The connection from the master node to the removed node was
121128
closed.
@@ -134,6 +141,10 @@ control this mechanism.
134141

135142
===== Diagnosing `disconnected` nodes
136143

144+
Nodes typically leave the cluster with reason `disconnected` when they shut
145+
down, but if they rejoin the cluster without restarting then there is some
146+
other problem.
147+
137148
{es} is designed to run on a fairly reliable network. It opens a number of TCP
138149
connections between nodes and expects these connections to remain open forever.
139150
If a connection is closed then {es} will try and reconnect, so the occasional
@@ -194,6 +205,10 @@ the logs on the elected master.
194205

195206
===== Diagnosing `follower check retry count exceeded` nodes
196207

208+
Nodes sometimes leave the cluster with reason `follower check retry count
209+
exceeded` when they shut down, but if they rejoin the cluster without
210+
restarting then there is some other problem.
211+
197212
{es} needs every node to respond to network messages successfully and
198213
reasonably quickly. If a node rejects requests or does not respond at all then
199214
it can be harmful to the cluster. If enough consecutive checks fail then the

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
137137
private final AllocationService allocationService;
138138
private final JoinHelper joinHelper;
139139
private final JoinValidationService joinValidationService;
140-
private final NodeRemovalClusterStateTaskExecutor nodeRemovalExecutor;
140+
private final NodeLeftExecutor nodeLeftExecutor;
141141
private final Supplier<CoordinationState.PersistedState> persistedStateSupplier;
142142
private final NoMasterBlockService noMasterBlockService;
143143
final Object mutex = new Object(); // package-private to allow tests to call methods that assert that the mutex is held
@@ -205,7 +205,7 @@ public Coordinator(
205205
this.transportService = transportService;
206206
this.masterService = masterService;
207207
this.allocationService = allocationService;
208-
this.onJoinValidators = JoinTaskExecutor.addBuiltInJoinValidators(onJoinValidators);
208+
this.onJoinValidators = NodeJoinExecutor.addBuiltInJoinValidators(onJoinValidators);
209209
this.singleNodeDiscovery = DiscoveryModule.isSingleNodeDiscovery(settings);
210210
this.electionStrategy = electionStrategy;
211211
this.joinReasonService = new JoinReasonService(transportService.getThreadPool()::relativeTimeInMillis);
@@ -272,7 +272,7 @@ public Coordinator(
272272
this::removeNode,
273273
nodeHealthService
274274
);
275-
this.nodeRemovalExecutor = new NodeRemovalClusterStateTaskExecutor(allocationService);
275+
this.nodeLeftExecutor = new NodeLeftExecutor(allocationService);
276276
this.clusterApplier = clusterApplier;
277277
masterService.setClusterStateSupplier(this::getStateForMasterService);
278278
this.reconfigurator = new Reconfigurator(settings, clusterSettings);
@@ -339,16 +339,11 @@ private void onLeaderFailure(Supplier<String> message, Exception e) {
339339
private void removeNode(DiscoveryNode discoveryNode, String reason) {
340340
synchronized (mutex) {
341341
if (mode == Mode.LEADER) {
342-
var task = new NodeRemovalClusterStateTaskExecutor.Task(
343-
discoveryNode,
344-
reason,
345-
() -> joinReasonService.onNodeRemoved(discoveryNode, reason)
346-
);
347342
masterService.submitStateUpdateTask(
348343
"node-left",
349-
task,
344+
new NodeLeftExecutor.Task(discoveryNode, reason, () -> joinReasonService.onNodeRemoved(discoveryNode, reason)),
350345
ClusterStateTaskConfig.build(Priority.IMMEDIATE),
351-
nodeRemovalExecutor
346+
nodeLeftExecutor
352347
);
353348
}
354349
}
@@ -664,7 +659,7 @@ private void validateJoinRequest(JoinRequest joinRequest, ActionListener<Void> v
664659
if (stateForJoinValidation.getBlocks().hasGlobalBlock(STATE_NOT_RECOVERED_BLOCK) == false) {
665660
// We do this in a couple of places including the cluster update thread. This one here is really just best effort to ensure
666661
// we fail as fast as possible.
667-
JoinTaskExecutor.ensureVersionBarrier(
662+
NodeJoinExecutor.ensureVersionBarrier(
668663
joinRequest.getSourceNode().getVersion(),
669664
stateForJoinValidation.getNodes().getMinNodeVersion()
670665
);

server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ public class JoinHelper {
6767
private final MasterService masterService;
6868
private final ClusterApplier clusterApplier;
6969
private final TransportService transportService;
70-
private final JoinTaskExecutor joinTaskExecutor;
70+
private final NodeJoinExecutor nodeJoinExecutor;
7171
private final LongSupplier currentTermSupplier;
7272
private final NodeHealthService nodeHealthService;
7373
private final JoinReasonService joinReasonService;
@@ -94,7 +94,7 @@ public class JoinHelper {
9494
this.clusterApplier = clusterApplier;
9595
this.transportService = transportService;
9696
this.circuitBreakerService = circuitBreakerService;
97-
this.joinTaskExecutor = new JoinTaskExecutor(allocationService, rerouteService);
97+
this.nodeJoinExecutor = new NodeJoinExecutor(allocationService, rerouteService);
9898
this.currentTermSupplier = currentTermSupplier;
9999
this.nodeHealthService = nodeHealthService;
100100
this.joinReasonService = joinReasonService;
@@ -389,13 +389,17 @@ default void close(Mode newMode) {}
389389
class LeaderJoinAccumulator implements JoinAccumulator {
390390
@Override
391391
public void handleJoinRequest(DiscoveryNode sender, ActionListener<Void> joinListener) {
392-
final JoinTask task = JoinTask.singleNode(
393-
sender,
394-
joinReasonService.getJoinReason(sender, Mode.LEADER),
395-
joinListener,
396-
currentTermSupplier.getAsLong()
392+
masterService.submitStateUpdateTask(
393+
"node-join",
394+
JoinTask.singleNode(
395+
sender,
396+
joinReasonService.getJoinReason(sender, Mode.LEADER),
397+
joinListener,
398+
currentTermSupplier.getAsLong()
399+
),
400+
ClusterStateTaskConfig.build(Priority.URGENT),
401+
nodeJoinExecutor
397402
);
398-
masterService.submitStateUpdateTask("node-join", task, ClusterStateTaskConfig.build(Priority.URGENT), joinTaskExecutor);
399403
}
400404

401405
@Override
@@ -461,7 +465,7 @@ public void close(Mode newMode) {
461465
"elected-as-master ([" + joinTask.nodeCount() + "] nodes joined)",
462466
joinTask,
463467
ClusterStateTaskConfig.build(Priority.URGENT),
464-
joinTaskExecutor
468+
nodeJoinExecutor
465469

466470
);
467471
} else {

server/src/main/java/org/elasticsearch/cluster/coordination/JoinTaskExecutor.java renamed to server/src/main/java/org/elasticsearch/cluster/coordination/NodeJoinExecutor.java

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,14 @@
3636

3737
import static org.elasticsearch.gateway.GatewayService.STATE_NOT_RECOVERED_BLOCK;
3838

39-
public class JoinTaskExecutor implements ClusterStateTaskExecutor<JoinTask> {
39+
public class NodeJoinExecutor implements ClusterStateTaskExecutor<JoinTask> {
4040

41-
private static final Logger logger = LogManager.getLogger(JoinTaskExecutor.class);
41+
private static final Logger logger = LogManager.getLogger(NodeJoinExecutor.class);
4242

4343
private final AllocationService allocationService;
4444
private final RerouteService rerouteService;
4545

46-
public JoinTaskExecutor(AllocationService allocationService, RerouteService rerouteService) {
46+
public NodeJoinExecutor(AllocationService allocationService, RerouteService rerouteService) {
4747
this.allocationService = allocationService;
4848
this.rerouteService = rerouteService;
4949
}
@@ -135,7 +135,14 @@ public ClusterState execute(BatchExecutionContext<JoinTask> batchExecutionContex
135135
continue;
136136
}
137137
}
138-
onTaskSuccess.add(() -> nodeJoinTask.listener().onResponse(null));
138+
onTaskSuccess.add(() -> {
139+
logger.info(
140+
"node-join: [{}] with reason [{}]",
141+
nodeJoinTask.node().descriptionWithoutAttributes(),
142+
nodeJoinTask.reason()
143+
);
144+
nodeJoinTask.listener().onResponse(null);
145+
});
139146
}
140147
joinTaskContext.success(() -> {
141148
for (Runnable joinCompleter : onTaskSuccess) {
Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@
1919
import org.elasticsearch.cluster.service.MasterService;
2020
import org.elasticsearch.persistent.PersistentTasksCustomMetadata;
2121

22-
public class NodeRemovalClusterStateTaskExecutor implements ClusterStateTaskExecutor<NodeRemovalClusterStateTaskExecutor.Task> {
22+
public class NodeLeftExecutor implements ClusterStateTaskExecutor<NodeLeftExecutor.Task> {
2323

24-
private static final Logger logger = LogManager.getLogger(NodeRemovalClusterStateTaskExecutor.class);
24+
private static final Logger logger = LogManager.getLogger(NodeLeftExecutor.class);
2525

2626
private final AllocationService allocationService;
2727

@@ -41,7 +41,7 @@ public String toString() {
4141
}
4242
}
4343

44-
public NodeRemovalClusterStateTaskExecutor(AllocationService allocationService) {
44+
public NodeLeftExecutor(AllocationService allocationService) {
4545
this.allocationService = allocationService;
4646
}
4747

@@ -52,13 +52,21 @@ public ClusterState execute(BatchExecutionContext<Task> batchExecutionContext) t
5252
boolean removed = false;
5353
for (final var taskContext : batchExecutionContext.taskContexts()) {
5454
final var task = taskContext.getTask();
55+
final String reason;
5556
if (initialState.nodes().nodeExists(task.node())) {
5657
remainingNodesBuilder.remove(task.node());
5758
removed = true;
59+
reason = task.reason();
5860
} else {
5961
logger.debug("node [{}] does not exist in cluster state, ignoring", task);
62+
reason = null;
6063
}
61-
taskContext.success(task.onClusterStateProcessed::run);
64+
taskContext.success(() -> {
65+
if (reason != null) {
66+
logger.info("node-left: [{}] with reason [{}]", task.node().descriptionWithoutAttributes(), reason);
67+
}
68+
task.onClusterStateProcessed.run();
69+
});
6270
}
6371

6472
if (removed == false) {

server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodeWithStatus.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ public record DesiredNodeWithStatus(DesiredNode desiredNode, Status status)
4747
),
4848
// An unknown status is expected during upgrades to versions >= STATUS_TRACKING_SUPPORT_VERSION
4949
// the desired node status would be populated when a node in the newer version is elected as
50-
// master, the desired nodes status update happens in JoinTaskExecutor.
50+
// master, the desired nodes status update happens in NodeJoinExecutor.
5151
args[6] == null ? Status.PENDING : (Status) args[6]
5252
)
5353
);
@@ -84,7 +84,7 @@ public static DesiredNodeWithStatus readFrom(StreamInput in) throws IOException
8484
// since it's impossible to know if a node that was supposed to
8585
// join the cluster, it joined. The status will be updated
8686
// once the master node is upgraded to a version >= STATUS_TRACKING_SUPPORT_VERSION
87-
// in JoinTaskExecutor or when the desired nodes are upgraded to a new version.
87+
// in NodeJoinExecutor or when the desired nodes are upgraded to a new version.
8888
status = Status.PENDING;
8989
}
9090
return new DesiredNodeWithStatus(desiredNode, status);

server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodes.java

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@
88

99
package org.elasticsearch.cluster.metadata;
1010

11+
import org.elasticsearch.action.admin.cluster.desirednodes.TransportUpdateDesiredNodesAction;
1112
import org.elasticsearch.cluster.ClusterState;
13+
import org.elasticsearch.cluster.coordination.NodeJoinExecutor;
1214
import org.elasticsearch.cluster.node.DiscoveryNode;
1315
import org.elasticsearch.cluster.node.DiscoveryNodes;
1416
import org.elasticsearch.common.io.stream.StreamInput;
@@ -97,8 +99,7 @@
9799
* </ul>
98100
*
99101
* <p>
100-
* See {@code JoinTaskExecutor} and {@code TransportUpdateDesiredNodesAction} for more details about
101-
* desired nodes status tracking.
102+
* See {@link NodeJoinExecutor} and {@link TransportUpdateDesiredNodesAction} for more details about desired nodes status tracking.
102103
* </p>
103104
*
104105
* <p>

0 commit comments

Comments
 (0)