Skip to content

Commit 912e809

Browse files
authored
Merge pull request ceph#60917 from zdover23/wip-doc-2024-12-03-rados-ops-health-checks
doc/rados: make sentences agree in health-checks.rst Reviewed-by: Anthony D'Atri <[email protected]>
2 parents c273264 + aec87b9 commit 912e809

File tree

1 file changed

+60
-62
lines changed

1 file changed

+60
-62
lines changed

doc/rados/operations/health-checks.rst

Lines changed: 60 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -29,58 +29,57 @@ Monitor
2929
DAEMON_OLD_VERSION
3030
__________________
3131

32-
Warn if one or more Ceph daemons are running an old Ceph release. A health
33-
check is raised if multiple versions are detected. This condition must exist
34-
for a period of time greater than ``mon_warn_older_version_delay`` (set to one
35-
week by default) in order for the health check to be raised. This allows most
32+
One or more Ceph daemons are running an old Ceph release. A health check is
33+
raised if multiple versions are detected. This condition must exist for a
34+
period of time greater than ``mon_warn_older_version_delay`` (set to one week
35+
by default) in order for the health check to be raised. This allows most
3636
upgrades to proceed without raising a warning that is both expected and
37-
ephemeral. If the upgrade
38-
is paused for an extended time, ``health mute`` can be used by running
39-
``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure, however, to run
40-
``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished so
41-
that any future, unexpected instances are not masked.
37+
ephemeral. If the upgrade is paused for an extended time, ``health mute`` can
38+
be used by running ``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure,
39+
however, to run ``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has
40+
finished so that any future, unexpected instances are not masked.
4241

4342
MON_DOWN
4443
________
4544

4645
One or more Ceph Monitor daemons are down. The cluster requires a majority
47-
(more than one-half) of the provsioned monitors to be available. When one or more monitors
48-
are down, clients may have a harder time forming their initial connection to
49-
the cluster, as they may need to try additional IP addresses before they reach an
50-
operating monitor.
46+
(more than one-half) of the provsioned monitors to be available. When one or
47+
more monitors are down, clients may have a harder time forming their initial
48+
connection to the cluster, as they may need to try additional IP addresses
49+
before they reach an operating monitor.
5150

52-
Down monitor daemons should be restored or restarted as soon as possible to reduce the
53-
risk that an additional monitor failure may cause a service outage.
51+
Down monitor daemons should be restored or restarted as soon as possible to
52+
reduce the risk that an additional monitor failure may cause a service outage.
5453

5554
MON_CLOCK_SKEW
5655
______________
5756

58-
The clocks on hosts running Ceph Monitor daemons are not
59-
well-synchronized. This health check is raised if the cluster detects a clock
60-
skew greater than ``mon_clock_drift_allowed``.
57+
The clocks on hosts running Ceph Monitor daemons are not well-synchronized.
58+
This health check is raised if the cluster detects a clock skew greater than
59+
``mon_clock_drift_allowed``.
6160

6261
This issue is best resolved by synchronizing the clocks by using a tool like
63-
the legacy ``ntpd`` or the newer ``chrony``. It is ideal to configure
64-
NTP daemons to sync against multiple internal and external sources for resilience;
62+
the legacy ``ntpd`` or the newer ``chrony``. It is ideal to configure NTP
63+
daemons to sync against multiple internal and external sources for resilience;
6564
the protocol will adaptively determine the best available source. It is also
66-
beneficial to have the NTP daemons on Ceph Monitor hosts sync against each other,
67-
as it is even more important that Monitors be synchronized with each other than it
68-
is for them to be _correct_ with respect to reference time.
65+
beneficial to have the NTP daemons on Ceph Monitor hosts sync against each
66+
other, as it is even more important that Monitors be synchronized with each
67+
other than it is for them to be _correct_ with respect to reference time.
6968

7069
If it is impractical to keep the clocks closely synchronized, the
71-
``mon_clock_drift_allowed`` threshold can be increased. However, this
72-
value must stay significantly below the ``mon_lease`` interval in order for the
70+
``mon_clock_drift_allowed`` threshold can be increased. However, this value
71+
must stay significantly below the ``mon_lease`` interval in order for the
7372
monitor cluster to function properly. It is not difficult with a quality NTP
74-
or PTP configuration to have sub-millisecond synchronization, so there are very, very
75-
few occasions when it is appropriate to change this value.
73+
or PTP configuration to have sub-millisecond synchronization, so there are
74+
very, very few occasions when it is appropriate to change this value.
7675

7776
MON_MSGR2_NOT_ENABLED
7877
_____________________
7978

80-
The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are
81-
not configured in the cluster's monmap to bind to a v2 port. This
82-
means that features specific to the msgr2 protocol (for example, encryption)
83-
are unavailable on some or all connections.
79+
The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are not
80+
configured in the cluster's monmap to bind to a v2 port. This means that
81+
features specific to the msgr2 protocol (for example, encryption) are
82+
unavailable on some or all connections.
8483

8584
In most cases this can be corrected by running the following command:
8685

@@ -100,32 +99,32 @@ manually.
10099
MON_DISK_LOW
101100
____________
102101

103-
One or more monitors are low on storage space. This health check is raised if the
104-
percentage of available space on the file system used by the monitor database
105-
(normally ``/var/lib/ceph/mon``) drops below the percentage value
102+
One or more monitors are low on storage space. This health check is raised if
103+
the percentage of available space on the file system used by the monitor
104+
database (normally ``/var/lib/ceph/mon``) drops below the percentage value
106105
``mon_data_avail_warn`` (default: 30%).
107106

108107
This alert might indicate that some other process or user on the system is
109-
filling up the file system used by the monitor. It might also
110-
indicate that the monitor database is too large (see ``MON_DISK_BIG``
111-
below). Another common scenario is that Ceph logging subsystem levels have
112-
been raised for troubleshooting purposes without subsequent return to default
113-
levels. Ongoing verbose logging can easily fill up the files system containing
114-
``/var/log``. If you trim logs that are currently open, remember to restart or
115-
instruct your syslog or other daemon to re-open the log file.
108+
filling up the file system used by the monitor. It might also indicate that the
109+
monitor database is too large (see ``MON_DISK_BIG`` below). Another common
110+
scenario is that Ceph logging subsystem levels have been raised for
111+
troubleshooting purposes without subsequent return to default levels. Ongoing
112+
verbose logging can easily fill up the files system containing ``/var/log``. If
113+
you trim logs that are currently open, remember to restart or instruct your
114+
syslog or other daemon to re-open the log file.
116115

117-
If space cannot be freed, the monitor's data directory might need to be
118-
moved to another storage device or file system (this relocation process must be carried out while the monitor
119-
daemon is not running).
116+
If space cannot be freed, the monitor's data directory might need to be moved
117+
to another storage device or file system (this relocation process must be
118+
carried out while the monitor daemon is not running).
120119

121120

122121
MON_DISK_CRIT
123122
_____________
124123

125-
One or more monitors are critically low on storage space. This health check is raised if the
126-
percentage of available space on the file system used by the monitor database
127-
(normally ``/var/lib/ceph/mon``) drops below the percentage value
128-
``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above.
124+
One or more monitors are critically low on storage space. This health check is
125+
raised if the percentage of available space on the file system used by the
126+
monitor database (normally ``/var/lib/ceph/mon``) drops below the percentage
127+
value ``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above.
129128

130129
MON_DISK_BIG
131130
____________
@@ -235,8 +234,8 @@ this alert can be temporarily silenced by running the following command:
235234

236235
ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w # 1 week
237236

238-
Although we do NOT recommend doing so, you can also disable this alert indefinitely
239-
by running the following command:
237+
Although we do NOT recommend doing so, you can also disable this alert
238+
indefinitely by running the following command:
240239

241240
.. prompt:: bash $
242241

@@ -258,8 +257,8 @@ However, the cluster will still be able to perform client I/O operations and
258257
recover from failures.
259258

260259
The down manager daemon(s) should be restarted as soon as possible to ensure
261-
that the cluster can be monitored (for example, so that ``ceph -s``
262-
information is available and up to date, and so that metrics can be scraped by Prometheus).
260+
that the cluster can be monitored (for example, so that ``ceph -s`` information
261+
is available and up to date, and so that metrics can be scraped by Prometheus).
263262

264263

265264
MGR_MODULE_DEPENDENCY
@@ -300,9 +299,8 @@ ________
300299

301300
One or more OSDs are marked ``down``. The ceph-osd daemon(s) or their host(s)
302301
may have crashed or been stopped, or peer OSDs might be unable to reach the OSD
303-
over the public or private network.
304-
Common causes include a stopped or crashed daemon, a "down" host, or a network
305-
failure.
302+
over the public or private network. Common causes include a stopped or crashed
303+
daemon, a "down" host, or a network failure.
306304

307305
Verify that the host is healthy, the daemon is started, and the network is
308306
functioning. If the daemon has crashed, the daemon log file
@@ -513,9 +511,9 @@ or newer to start. To safely set the flag, run the following command:
513511
OSD_FILESTORE
514512
__________________
515513

516-
Warn if OSDs are running the old Filestore back end. The Filestore OSD back end is
517-
deprecated; the BlueStore back end has been the default object store since the
518-
Ceph Luminous release.
514+
Warn if OSDs are running the old Filestore back end. The Filestore OSD back end
515+
is deprecated; the BlueStore back end has been the default object store since
516+
the Ceph Luminous release.
519517

520518
The 'mclock_scheduler' is not supported for Filestore OSDs. For this reason,
521519
the default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced
@@ -545,17 +543,17 @@ of any update to Reef or to later releases.
545543
OSD_UNREACHABLE
546544
_______________
547545

548-
Registered v1/v2 public address of one or more OSD(s) is/are out of the
549-
defined `public_network` subnet, which will prevent these unreachable OSDs
550-
from communicating with ceph clients properly.
546+
The registered v1/v2 public address or addresses of one or more OSD(s) is or
547+
are out of the defined `public_network` subnet, which prevents these
548+
unreachable OSDs from communicating with ceph clients properly.
551549

552550
Even though these unreachable OSDs are in up state, rados clients
553551
will hang till TCP timeout before erroring out due to this inconsistency.
554552

555553
POOL_FULL
556554
_________
557555

558-
One or more pools have reached their quota and are no longer allowing writes.
556+
One or more pools have reached quota and no longer allow writes.
559557

560558
To see pool quotas and utilization, run the following command:
561559

0 commit comments

Comments
 (0)