Skip to content

Commit 8858839

Browse files
authored
Merge pull request ceph#54417 from zdover23/wip-doc-2023-11-08-rados-troubleshooting-mon-common-issues-2-of-x
doc/rados: edit t-mon "common issues" (2 of x) Reviewed-by: Anthony D'Atri <[email protected]>
2 parents dd3a082 + 7dcfa91 commit 8858839

File tree

1 file changed

+32
-25
lines changed

1 file changed

+32
-25
lines changed

doc/rados/troubleshooting/troubleshooting-mon.rst

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -218,31 +218,38 @@ detail`` returns a message similar to the following::
218218
the documentation.
219219

220220

221-
**What if the state is ``probing``?**
222-
223-
This means the monitor is still looking for the other monitors. Every time
224-
you start a monitor, the monitor will stay in this state for some time while
225-
trying to connect the rest of the monitors specified in the ``monmap``. The
226-
time a monitor will spend in this state can vary. For instance, when on a
227-
single-monitor cluster (never do this in production), the monitor will pass
228-
through the probing state almost instantaneously. In a multi-monitor
229-
cluster, the monitors will stay in this state until they find enough monitors
230-
to form a quorum |---| this means that if you have 2 out of 3 monitors down, the
231-
one remaining monitor will stay in this state indefinitely until you bring
232-
one of the other monitors up.
233-
234-
If you have a quorum the starting daemon should be able to find the
235-
other monitors quickly, as long as they can be reached. If your
236-
monitor is stuck probing and you have gone through with all the communication
237-
troubleshooting, then there is a fair chance that the monitor is trying
238-
to reach the other monitors on a wrong address. ``mon_status`` outputs the
239-
``monmap`` known to the monitor: check if the other monitor's locations
240-
match reality. If they don't, jump to
241-
`Recovering a Monitor's Broken monmap`_; if they do, then it may be related
242-
to severe clock skews amongst the monitor nodes and you should refer to
243-
`Clock Skews`_ first, but if that doesn't solve your problem then it is
244-
the time to prepare some logs and reach out to the community (please refer
245-
to `Preparing your logs`_ on how to best prepare your logs).
221+
**What does it mean if a Monitor's state is ``probing``?**
222+
223+
If ``ceph health detail`` shows that a Monitor's state is
224+
``probing``, then the Monitor is still looking for the other Monitors. Every
225+
Monitor remains in this state for some time when it is started. When a
226+
Monitor has connected to the other Monitors specified in the ``monmap``, it
227+
ceases to be in the ``probing`` state. The amount of time that a Monitor is
228+
in the ``probing`` state depends upon the parameters of the cluster of which
229+
it is a part. For example, when a Monitor is a part of a single-monitor
230+
cluster (never do this in production), the monitor passes through the probing
231+
state almost instantaneously. In a multi-monitor cluster, the Monitors stay
232+
in the ``probing`` state until they find enough monitors to form a quorum
233+
|---| this means that if two out of three Monitors in the cluster are
234+
``down``, the one remaining Monitor stays in the ``probing`` state
235+
indefinitely until you bring one of the other monitors up.
236+
237+
If quorum has been established, then the Monitor daemon should be able to
238+
find the other Monitors quickly, as long as they can be reached. If a Monitor
239+
is stuck in the ``probing`` state and you have exhausted the procedures above
240+
that describe the troubleshooting of communications between the Monitors,
241+
then it is possible that the problem Monitor is trying to reach the other
242+
Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is
243+
known to the monitor: determine whether the other Monitors' locations as
244+
specified in the ``monmap`` match the locations of the Monitors in the
245+
network. If they do not, see `Recovering a Monitor's Broken monmap`_.
246+
If the locations of the Monitors as specified in the ``monmap`` match the
247+
locations of the Monitors in the network, then the persistent
248+
``probing`` state could be related to severe clock skews amongst the monitor
249+
nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not
250+
bring the Monitor out of the ``probing`` state, then prepare your system logs
251+
and ask the Ceph community for help. See `Preparing your logs`_ for
252+
information about the proper preparation of logs.
246253

247254

248255
**What if state is ``electing``?**

0 commit comments

Comments
 (0)