diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index d12985b70597c..21f4ae2317e6a 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -35,313 +35,30 @@ starting from the beginning of the cluster state update. Refer to [[cluster-fault-detection-troubleshooting]] ==== Troubleshooting an unstable cluster -//tag::troubleshooting[] -Normally, a node will only leave a cluster if deliberately shut down. If a node -leaves the cluster unexpectedly, it's important to address the cause. A cluster -in which nodes leave unexpectedly is unstable and can create several issues. -For instance: -* The cluster health may be yellow or red. - -* Some shards will be initializing and other shards may be failing. - -* Search, indexing, and monitoring operations may fail and report exceptions in -logs. - -* The `.security` index may be unavailable, blocking access to the cluster. - -* The master may appear busy due to frequent cluster state updates. - -To troubleshoot a cluster in this state, first ensure the cluster has a -<>. Next, focus on the nodes -unexpectedly leaving the cluster ahead of all other issues. It will not be -possible to solve other issues until the cluster has a stable master node and -stable node membership. - -Diagnostics and statistics are usually not useful in an unstable cluster. These -tools only offer a view of the state of the cluster at a single point in time. -Instead, look at the cluster logs to see the pattern of behaviour over time. -Focus particularly on logs from the elected master. When a node leaves the -cluster, logs for the elected master include a message like this (with line -breaks added to make it easier to read): - -[source,text] ----- -[2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000] - node-left: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}] - with reason [disconnected] ----- - -This message says that the `NodeLeftExecutor` on the elected master -(`instance-0000000000`) processed a `node-left` task, identifying the node that -was removed and the reason for its removal. When the node joins the cluster -again, logs for the elected master will include a message like this (with line -breaks added to make it easier to read): - -[source,text] ----- -[2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000] - node-join: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}] - with reason [joining after restart, removed [24s] ago with reason [disconnected]] ----- - -This message says that the `NodeJoinExecutor` on the elected master -(`instance-0000000000`) processed a `node-join` task, identifying the node that -was added to the cluster and the reason for the task. - -Other nodes may log similar messages, but report fewer details: - -[source,text] ----- -[2020-01-29T11:02:36,985][INFO ][o.e.c.s.ClusterApplierService] - [instance-0000000001] removed { - {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m} - {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv} - }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415} ----- - -These messages are not especially useful for troubleshooting, so focus on the -ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted -on the elected master and which contain more details. If you don't see the -messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that: - -* You're looking at the logs for the elected master node. - -* The logs cover the correct time period. - -* Logging is enabled at `INFO` level. - -Nodes will also log a message containing `master node changed` whenever they -start or stop following the elected master. You can use these messages to -determine each node's view of the state of the master over time. - -If a node restarts, it will leave the cluster and then join the cluster again. -When it rejoins, the `NodeJoinExecutor` will log that it processed a -`node-join` task indicating that the node is `joining after restart`. If a node -is unexpectedly restarting, look at the node's logs to see why it is shutting -down. - -The <> API on the affected node will also provide some useful -information about the situation. - -If the node did not restart then you should look at the reason for its -departure more closely. Each reason has different troubleshooting steps, -described below. There are three possible reasons: - -* `disconnected`: The connection from the master node to the removed node was -closed. - -* `lagging`: The master published a cluster state update, but the removed node -did not apply it within the permitted timeout. By default, this timeout is 2 -minutes. Refer to <> for information about the -settings which control this mechanism. - -* `followers check retry count exceeded`: The master sent a number of -consecutive health checks to the removed node. These checks were rejected or -timed out. By default, each health check times out after 10 seconds and {es} -removes the node removed after three consecutively failed health checks. Refer -to <> for information about the settings which -control this mechanism. +See <>. [discrete] ===== Diagnosing `disconnected` nodes -Nodes typically leave the cluster with reason `disconnected` when they shut -down, but if they rejoin the cluster without restarting then there is some -other problem. - -{es} is designed to run on a fairly reliable network. It opens a number of TCP -connections between nodes and expects these connections to remain open -<>. If a connection is closed then {es} will -try and reconnect, so the occasional blip may fail some in-flight operations -but should otherwise have limited impact on the cluster. In contrast, -repeatedly-dropped connections will severely affect its operation. - -The connections from the elected master node to every other node in the cluster -are particularly important. The elected master never spontaneously closes its -outbound connections to other nodes. Similarly, once an inbound connection is -fully established, a node never spontaneously it unless the node is shutting -down. - -If you see a node unexpectedly leave the cluster with the `disconnected` -reason, something other than {es} likely caused the connection to close. A -common cause is a misconfigured firewall with an improper timeout or another -policy that's <>. It could also -be caused by general connectivity issues, such as packet loss due to faulty -hardware or network congestion. If you're an advanced user, configure the -following loggers to get more detailed information about network exceptions: - -[source,yaml] ----- -logger.org.elasticsearch.transport.TcpTransport: DEBUG -logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG ----- - -If these logs do not show enough information to diagnose the problem, obtain a -packet capture simultaneously from the nodes at both ends of an unstable -connection and analyse it alongside the {es} logs from those nodes to determine -if traffic between the nodes is being disrupted by another device on the -network. +See <>. [discrete] ===== Diagnosing `lagging` nodes -{es} needs every node to process cluster state updates reasonably quickly. If a -node takes too long to process a cluster state update, it can be harmful to the -cluster. The master will remove these nodes with the `lagging` reason. Refer to -<> for information about the settings which control -this mechanism. - -Lagging is typically caused by performance issues on the removed node. However, -a node may also lag due to severe network delays. To rule out network delays, -ensure that `net.ipv4.tcp_retries2` is <>. Log messages that contain `warn threshold` may provide more -information about the root cause. - -If you're an advanced user, you can get more detailed information about what -the node was doing when it was removed by configuring the following logger: - -[source,yaml] ----- -logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG ----- - -When this logger is enabled, {es} will attempt to run the -<> API on the faulty node and report the results in -the logs on the elected master. The results are compressed, encoded, and split -into chunks to avoid truncation: - -[source,text] ----- -[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 1]: H4sIAAAAAAAA/x... -[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 2]: p7x3w1hmOQVtuV... -[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 3]: v7uTboMGDbyOy+... -[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 4]: 4tse0RnPnLeDNN... -[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines) ----- - -To reconstruct the output, base64-decode the data and decompress it using -`gzip`. For instance, on Unix-like systems: - -[source,sh] ----- -cat lagdetector.log | sed -e 's/.*://' | base64 --decode | gzip --decompress ----- +See <>. [discrete] ===== Diagnosing `follower check retry count exceeded` nodes -Nodes sometimes leave the cluster with reason `follower check retry count -exceeded` when they shut down, but if they rejoin the cluster without -restarting then there is some other problem. - -{es} needs every node to respond to network messages successfully and -reasonably quickly. If a node rejects requests or does not respond at all then -it can be harmful to the cluster. If enough consecutive checks fail then the -master will remove the node with reason `follower check retry count exceeded` -and will indicate in the `node-left` message how many of the consecutive -unsuccessful checks failed and how many of them timed out. Refer to -<> for information about the settings which control -this mechanism. - -Timeouts and failures may be due to network delays or performance problems on -the affected nodes. Ensure that `net.ipv4.tcp_retries2` is -<> to eliminate network delays as -a possible cause for this kind of instability. Log messages containing -`warn threshold` may give further clues about the cause of the instability. - -If the last check failed with an exception then the exception is reported, and -typically indicates the problem that needs to be addressed. If any of the -checks timed out then narrow down the problem as follows. - -include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm] - -include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection] - -include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads] - -By default the follower checks will time out after 30s, so if node departures -are unpredictable then capture stack dumps every 15s to be sure that at least -one stack dump was taken at the right time. +See <>. [discrete] ===== Diagnosing `ShardLockObtainFailedException` failures -If a node leaves and rejoins the cluster then {es} will usually shut down and -re-initialize its shards. If the shards do not shut down quickly enough then -{es} may fail to re-initialize them due to a `ShardLockObtainFailedException`. - -To gather more information about the reason for shards shutting down slowly, -configure the following logger: - -[source,yaml] ----- -logger.org.elasticsearch.env.NodeEnvironment: DEBUG ----- - -When this logger is enabled, {es} will attempt to run the -<> API whenever it encounters a -`ShardLockObtainFailedException`. The results are compressed, encoded, and -split into chunks to avoid truncation: - -[source,text] ----- -[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 1]: H4sIAAAAAAAA/x... -[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 2]: p7x3w1hmOQVtuV... -[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 3]: v7uTboMGDbyOy+... -[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 4]: 4tse0RnPnLeDNN... -[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines) ----- - -To reconstruct the output, base64-decode the data and decompress it using -`gzip`. For instance, on Unix-like systems: - -[source,sh] ----- -cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress ----- +See <>. [discrete] ===== Diagnosing other network disconnections -{es} is designed to run on a fairly reliable network. It opens a number of TCP -connections between nodes and expects these connections to remain open -<>. If a connection is closed then {es} will -try and reconnect, so the occasional blip may fail some in-flight operations -but should otherwise have limited impact on the cluster. In contrast, -repeatedly-dropped connections will severely affect its operation. - -{es} nodes will only actively close an outbound connection to another node if -the other node leaves the cluster. See -<> for further information about -identifying and troubleshooting this situation. If an outbound connection -closes for some other reason, nodes will log a message such as the following: - -[source,text] ----- -[INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote ----- - -Similarly, once an inbound connection is fully established, a node never -spontaneously closes it unless the node is shutting down. - -Therefore if you see a node report that a connection to another node closed -unexpectedly, something other than {es} likely caused the connection to close. -A common cause is a misconfigured firewall with an improper timeout or another -policy that's <>. It could also -be caused by general connectivity issues, such as packet loss due to faulty -hardware or network congestion. If you're an advanced user, configure the -following loggers to get more detailed information about network exceptions: - -[source,yaml] ----- -logger.org.elasticsearch.transport.TcpTransport: DEBUG -logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG ----- - -If these logs do not show enough information to diagnose the problem, obtain a -packet capture simultaneously from the nodes at both ends of an unstable -connection and analyse it alongside the {es} logs from those nodes to determine -if traffic between the nodes is being disrupted by another device on the -network. -//end::troubleshooting[] +See <>. diff --git a/docs/reference/troubleshooting/troubleshooting-unstable-cluster.asciidoc b/docs/reference/troubleshooting/troubleshooting-unstable-cluster.asciidoc index 387ebcdcd43c0..cbb35f7731034 100644 --- a/docs/reference/troubleshooting/troubleshooting-unstable-cluster.asciidoc +++ b/docs/reference/troubleshooting/troubleshooting-unstable-cluster.asciidoc @@ -1,4 +1,316 @@ [[troubleshooting-unstable-cluster]] == Troubleshooting an unstable cluster -include::../modules/discovery/fault-detection.asciidoc[tag=troubleshooting,leveloffset=-2] \ No newline at end of file +Normally, a node will only leave a cluster if deliberately shut down. If a node +leaves the cluster unexpectedly, it's important to address the cause. A cluster +in which nodes leave unexpectedly is unstable and can create several issues. +For instance: + +* The cluster health may be yellow or red. + +* Some shards will be initializing and other shards may be failing. + +* Search, indexing, and monitoring operations may fail and report exceptions in +logs. + +* The `.security` index may be unavailable, blocking access to the cluster. + +* The master may appear busy due to frequent cluster state updates. + +To troubleshoot a cluster in this state, first ensure the cluster has a +<>. Next, focus on the nodes +unexpectedly leaving the cluster ahead of all other issues. It will not be +possible to solve other issues until the cluster has a stable master node and +stable node membership. + +Diagnostics and statistics are usually not useful in an unstable cluster. These +tools only offer a view of the state of the cluster at a single point in time. +Instead, look at the cluster logs to see the pattern of behaviour over time. +Focus particularly on logs from the elected master. When a node leaves the +cluster, logs for the elected master include a message like this (with line +breaks added to make it easier to read): + +[source,text] +---- +[2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000] + node-left: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}] + with reason [disconnected] +---- + +This message says that the `NodeLeftExecutor` on the elected master +(`instance-0000000000`) processed a `node-left` task, identifying the node that +was removed and the reason for its removal. When the node joins the cluster +again, logs for the elected master will include a message like this (with line +breaks added to make it easier to read): + +[source,text] +---- +[2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000] + node-join: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}] + with reason [joining after restart, removed [24s] ago with reason [disconnected]] +---- + +This message says that the `NodeJoinExecutor` on the elected master +(`instance-0000000000`) processed a `node-join` task, identifying the node that +was added to the cluster and the reason for the task. + +Other nodes may log similar messages, but report fewer details: + +[source,text] +---- +[2020-01-29T11:02:36,985][INFO ][o.e.c.s.ClusterApplierService] + [instance-0000000001] removed { + {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m} + {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv} + }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415} +---- + +These messages are not especially useful for troubleshooting, so focus on the +ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted +on the elected master and which contain more details. If you don't see the +messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that: + +* You're looking at the logs for the elected master node. + +* The logs cover the correct time period. + +* Logging is enabled at `INFO` level. + +Nodes will also log a message containing `master node changed` whenever they +start or stop following the elected master. You can use these messages to +determine each node's view of the state of the master over time. + +If a node restarts, it will leave the cluster and then join the cluster again. +When it rejoins, the `NodeJoinExecutor` will log that it processed a +`node-join` task indicating that the node is `joining after restart`. If a node +is unexpectedly restarting, look at the node's logs to see why it is shutting +down. + +The <> API on the affected node will also provide some useful +information about the situation. + +If the node did not restart then you should look at the reason for its +departure more closely. Each reason has different troubleshooting steps, +described below. There are three possible reasons: + +* `disconnected`: The connection from the master node to the removed node was +closed. + +* `lagging`: The master published a cluster state update, but the removed node +did not apply it within the permitted timeout. By default, this timeout is 2 +minutes. Refer to <> for information about the +settings which control this mechanism. + +* `followers check retry count exceeded`: The master sent a number of +consecutive health checks to the removed node. These checks were rejected or +timed out. By default, each health check times out after 10 seconds and {es} +removes the node removed after three consecutively failed health checks. Refer +to <> for information about the settings which +control this mechanism. + +[discrete] +[[troubleshooting-unstable-cluster-disconnected]] +=== Diagnosing `disconnected` nodes + +Nodes typically leave the cluster with reason `disconnected` when they shut +down, but if they rejoin the cluster without restarting then there is some +other problem. + +{es} is designed to run on a fairly reliable network. It opens a number of TCP +connections between nodes and expects these connections to remain open +<>. If a connection is closed then {es} will +try and reconnect, so the occasional blip may fail some in-flight operations +but should otherwise have limited impact on the cluster. In contrast, +repeatedly-dropped connections will severely affect its operation. + +The connections from the elected master node to every other node in the cluster +are particularly important. The elected master never spontaneously closes its +outbound connections to other nodes. Similarly, once an inbound connection is +fully established, a node never spontaneously it unless the node is shutting +down. + +If you see a node unexpectedly leave the cluster with the `disconnected` +reason, something other than {es} likely caused the connection to close. A +common cause is a misconfigured firewall with an improper timeout or another +policy that's <>. It could also +be caused by general connectivity issues, such as packet loss due to faulty +hardware or network congestion. If you're an advanced user, configure the +following loggers to get more detailed information about network exceptions: + +[source,yaml] +---- +logger.org.elasticsearch.transport.TcpTransport: DEBUG +logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG +---- + +If these logs do not show enough information to diagnose the problem, obtain a +packet capture simultaneously from the nodes at both ends of an unstable +connection and analyse it alongside the {es} logs from those nodes to determine +if traffic between the nodes is being disrupted by another device on the +network. + +[discrete] +[[troubleshooting-unstable-cluster-lagging]] +=== Diagnosing `lagging` nodes + +{es} needs every node to process cluster state updates reasonably quickly. If a +node takes too long to process a cluster state update, it can be harmful to the +cluster. The master will remove these nodes with the `lagging` reason. Refer to +<> for information about the settings which control +this mechanism. + +Lagging is typically caused by performance issues on the removed node. However, +a node may also lag due to severe network delays. To rule out network delays, +ensure that `net.ipv4.tcp_retries2` is <>. Log messages that contain `warn threshold` may provide more +information about the root cause. + +If you're an advanced user, you can get more detailed information about what +the node was doing when it was removed by configuring the following logger: + +[source,yaml] +---- +logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG +---- + +When this logger is enabled, {es} will attempt to run the +<> API on the faulty node and report the results in +the logs on the elected master. The results are compressed, encoded, and split +into chunks to avoid truncation: + +[source,text] +---- +[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 1]: H4sIAAAAAAAA/x... +[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 2]: p7x3w1hmOQVtuV... +[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 3]: v7uTboMGDbyOy+... +[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 4]: 4tse0RnPnLeDNN... +[DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines) +---- + +To reconstruct the output, base64-decode the data and decompress it using +`gzip`. For instance, on Unix-like systems: + +[source,sh] +---- +cat lagdetector.log | sed -e 's/.*://' | base64 --decode | gzip --decompress +---- + +[discrete] +[[troubleshooting-unstable-cluster-follower-check]] +=== Diagnosing `follower check retry count exceeded` nodes + +Nodes sometimes leave the cluster with reason `follower check retry count +exceeded` when they shut down, but if they rejoin the cluster without +restarting then there is some other problem. + +{es} needs every node to respond to network messages successfully and +reasonably quickly. If a node rejects requests or does not respond at all then +it can be harmful to the cluster. If enough consecutive checks fail then the +master will remove the node with reason `follower check retry count exceeded` +and will indicate in the `node-left` message how many of the consecutive +unsuccessful checks failed and how many of them timed out. Refer to +<> for information about the settings which control +this mechanism. + +Timeouts and failures may be due to network delays or performance problems on +the affected nodes. Ensure that `net.ipv4.tcp_retries2` is +<> to eliminate network delays as +a possible cause for this kind of instability. Log messages containing +`warn threshold` may give further clues about the cause of the instability. + +If the last check failed with an exception then the exception is reported, and +typically indicates the problem that needs to be addressed. If any of the +checks timed out then narrow down the problem as follows. + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads] + +By default the follower checks will time out after 30s, so if node departures +are unpredictable then capture stack dumps every 15s to be sure that at least +one stack dump was taken at the right time. + +[discrete] +[[troubleshooting-unstable-cluster-shardlockobtainfailedexception]] +=== Diagnosing `ShardLockObtainFailedException` failures + +If a node leaves and rejoins the cluster then {es} will usually shut down and +re-initialize its shards. If the shards do not shut down quickly enough then +{es} may fail to re-initialize them due to a `ShardLockObtainFailedException`. + +To gather more information about the reason for shards shutting down slowly, +configure the following logger: + +[source,yaml] +---- +logger.org.elasticsearch.env.NodeEnvironment: DEBUG +---- + +When this logger is enabled, {es} will attempt to run the +<> API whenever it encounters a +`ShardLockObtainFailedException`. The results are compressed, encoded, and +split into chunks to avoid truncation: + +[source,text] +---- +[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 1]: H4sIAAAAAAAA/x... +[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 2]: p7x3w1hmOQVtuV... +[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 3]: v7uTboMGDbyOy+... +[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 4]: 4tse0RnPnLeDNN... +[DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines) +---- + +To reconstruct the output, base64-decode the data and decompress it using +`gzip`. For instance, on Unix-like systems: + +[source,sh] +---- +cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress +---- + +[discrete] +[[troubleshooting-unstable-cluster-network]] +=== Diagnosing other network disconnections + +{es} is designed to run on a fairly reliable network. It opens a number of TCP +connections between nodes and expects these connections to remain open +<>. If a connection is closed then {es} will +try and reconnect, so the occasional blip may fail some in-flight operations +but should otherwise have limited impact on the cluster. In contrast, +repeatedly-dropped connections will severely affect its operation. + +{es} nodes will only actively close an outbound connection to another node if +the other node leaves the cluster. See +<> for further information about +identifying and troubleshooting this situation. If an outbound connection +closes for some other reason, nodes will log a message such as the following: + +[source,text] +---- +[INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote +---- + +Similarly, once an inbound connection is fully established, a node never +spontaneously closes it unless the node is shutting down. + +Therefore if you see a node report that a connection to another node closed +unexpectedly, something other than {es} likely caused the connection to close. +A common cause is a misconfigured firewall with an improper timeout or another +policy that's <>. It could also +be caused by general connectivity issues, such as packet loss due to faulty +hardware or network congestion. If you're an advanced user, configure the +following loggers to get more detailed information about network exceptions: + +[source,yaml] +---- +logger.org.elasticsearch.transport.TcpTransport: DEBUG +logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG +---- + +If these logs do not show enough information to diagnose the problem, obtain a +packet capture simultaneously from the nodes at both ends of an unstable +connection and analyse it alongside the {es} logs from those nodes to determine +if traffic between the nodes is being disrupted by another device on the +network. diff --git a/server/src/main/resources/org/elasticsearch/common/reference-docs-links.json b/server/src/main/resources/org/elasticsearch/common/reference-docs-links.json index 3eb8939c22a65..cc0bc5e2257c8 100644 --- a/server/src/main/resources/org/elasticsearch/common/reference-docs-links.json +++ b/server/src/main/resources/org/elasticsearch/common/reference-docs-links.json @@ -2,8 +2,8 @@ "INITIAL_MASTER_NODES": "important-settings.html#initial_master_nodes", "DISCOVERY_TROUBLESHOOTING": "discovery-troubleshooting.html", "UNSTABLE_CLUSTER_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html", - "LAGGING_NODE_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#_diagnosing_lagging_nodes_2", - "SHARD_LOCK_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#_diagnosing_shardlockobtainfailedexception_failures_2", + "LAGGING_NODE_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#troubleshooting-unstable-cluster-lagging", + "SHARD_LOCK_TROUBLESHOOTING": "troubleshooting-unstable-cluster.html#troubleshooting-unstable-cluster-shardlockobtainfailedexception", "CONCURRENT_REPOSITORY_WRITERS": "diagnosing-corrupted-repositories.html", "ARCHIVE_INDICES": "archive-indices.html", "HTTP_TRACER": "modules-network.html#http-rest-request-tracer",