52
52
* The master may appear busy due to frequent cluster state updates.
53
53
54
54
To troubleshoot a cluster in this state, first ensure the cluster has a
55
- <<modules- discovery-troubleshooting,stable master>>. Next, focus on the nodes
55
+ <<discovery-troubleshooting,stable master>>. Next, focus on the nodes
56
56
unexpectedly leaving the cluster ahead of all other issues. It will not be
57
57
possible to solve other issues until the cluster has a stable master node and
58
58
stable node membership.
@@ -62,23 +62,33 @@ tools only offer a view of the state of the cluster at a single point in time.
62
62
Instead, look at the cluster logs to see the pattern of behaviour over time.
63
63
Focus particularly on logs from the elected master. When a node leaves the
64
64
cluster, logs for the elected master include a message like this (with line
65
- breaks added for clarity ):
65
+ breaks added to make it easier to read ):
66
66
67
67
[source,text]
68
68
----
69
- [2022-03-21T11:02:35,513][INFO ][o.e.c.s.MasterService ]
70
- [instance-0000000000] node-left[
71
- {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
72
- reason: disconnected,
73
- {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
74
- reason: disconnected
75
- ], term: 14, version: 1653415, ...
69
+ [2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000] node-left:
70
+ removed [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}]
71
+ with reason [test reason]
72
+ ----
73
+
74
+ This message says that the `NodeLeftExecutor` on the elected master
75
+ (`instance-0000000000`) processed a `node-left` task, identifying the node that
76
+ was removed and the reason for its removal. When the node joins the cluster
77
+ again, logs for the elected master will include a message like this (with line
78
+ breaks added to make it easier to read):
79
+
80
+ [source,text]
81
+ ----
82
+ [2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000] node-join:
83
+ added [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}]
84
+ with reason [joining after restart, removed [24s] ago with reason [disconnected]]
76
85
----
77
86
78
- This message says that the `MasterService` on the elected master
79
- (`instance-0000000000`) is processing a `node-left` task. It lists the nodes
80
- that are being removed and the reasons for their removal. Other nodes may log
81
- similar messages, but report fewer details:
87
+ This message says that the `NodeJoinExecutor` on the elected master
88
+ (`instance-0000000000`) processed a `node-join` task, identifying the node that
89
+ was added to the cluster and the reason for the task.
90
+
91
+ Other nodes may log similar messages, but report fewer details:
82
92
83
93
[source,text]
84
94
----
@@ -89,9 +99,10 @@ similar messages, but report fewer details:
89
99
}, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
90
100
----
91
101
92
- Focus on the one from the `MasterService` which is only emitted on the elected
93
- master, since it contains more details. If you don't see the messages from the
94
- `MasterService`, check that:
102
+ These messages are not especially useful for troubleshooting, so focus on the
103
+ ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted
104
+ on the elected master and which contain more details. If you don't see the
105
+ messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that:
95
106
96
107
* You're looking at the logs for the elected master node.
97
108
@@ -104,18 +115,14 @@ start or stop following the elected master. You can use these messages to
104
115
determine each node's view of the state of the master over time.
105
116
106
117
If a node restarts, it will leave the cluster and then join the cluster again.
107
- When it rejoins, the `MasterService` will log that it is processing a
108
- `node-join` task. You can tell from the master logs that the node was restarted
109
- because the `node-join` message will indicate that it is
110
- `joining after restart`. In older {es} versions, you can also determine that a
111
- node restarted by looking at the second "ephemeral" ID in the `node-left` and
112
- subsequent `node-join` messages. This ephemeral ID is different each time the
113
- node starts up. If a node is unexpectedly restarting, you'll need to look at
114
- the node's logs to see why it is shutting down.
118
+ When it rejoins, the `NodeJoinExecutor` will log that it processed a
119
+ `node-join` task indicating that the node is `joining after restart`. If a node
120
+ is unexpectedly restarting, look at the node's logs to see why it is shutting
121
+ down.
115
122
116
123
If the node did not restart then you should look at the reason for its
117
- departure in the `node-left` message, which is reported after each node. There
118
- are three possible reasons:
124
+ departure more closely. Each reason has different troubleshooting steps,
125
+ described below. There are three possible reasons:
119
126
120
127
* `disconnected`: The connection from the master node to the removed node was
121
128
closed.
@@ -134,6 +141,10 @@ control this mechanism.
134
141
135
142
===== Diagnosing `disconnected` nodes
136
143
144
+ Nodes typically leave the cluster with reason `disconnected` when they shut
145
+ down, but if they rejoin the cluster without restarting then there is some
146
+ other problem.
147
+
137
148
{es} is designed to run on a fairly reliable network. It opens a number of TCP
138
149
connections between nodes and expects these connections to remain open forever.
139
150
If a connection is closed then {es} will try and reconnect, so the occasional
@@ -194,6 +205,10 @@ the logs on the elected master.
194
205
195
206
===== Diagnosing `follower check retry count exceeded` nodes
196
207
208
+ Nodes sometimes leave the cluster with reason `follower check retry count
209
+ exceeded` when they shut down, but if they rejoin the cluster without
210
+ restarting then there is some other problem.
211
+
197
212
{es} needs every node to respond to network messages successfully and
198
213
reasonably quickly. If a node rejects requests or does not respond at all then
199
214
it can be harmful to the cluster. If enough consecutive checks fail then the
0 commit comments