Skip to content

Commit 54e8f23

Browse files
Merge pull request ceph#59546 from anthonyeleven/improve-rados-operations-health-checks.rst
doc/rados/operations: Improve health-checks.rst
2 parents a71318b + 2aa8253 commit 54e8f23

File tree

1 file changed

+85
-60
lines changed

1 file changed

+85
-60
lines changed

doc/rados/operations/health-checks.rst

Lines changed: 85 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,18 @@
77
Overview
88
========
99

10-
There is a finite set of health messages that a Ceph cluster can raise. These
11-
messages are known as *health checks*. Each health check has a unique
12-
identifier.
10+
There is a set of health states that a Ceph cluster can raise. These
11+
are known as *health checks*. Each health check has a unique identifier.
1312

1413
The identifier is a terse human-readable string -- that is, the identifier is
1514
readable in much the same way as a typical variable name. It is intended to
16-
enable tools (for example, UIs) to make sense of health checks and present them
15+
enable tools (for example, monitoring and UIs) to make sense of health checks and present them
1716
in a way that reflects their meaning.
1817

1918
This page lists the health checks that are raised by the monitor and manager
20-
daemons. In addition to these, you might see health checks that originate
21-
from MDS daemons (see :ref:`cephfs-health-messages`), and health checks
22-
that are defined by ``ceph-mgr`` python modules.
19+
daemons. In addition to these, you may see health checks that originate
20+
from CephFS MDS daemons (see :ref:`cephfs-health-messages`), and health checks
21+
that are defined by ``ceph-mgr`` modules.
2322

2423
Definitions
2524
===========
@@ -30,47 +29,56 @@ Monitor
3029
DAEMON_OLD_VERSION
3130
__________________
3231

33-
Warn if one or more old versions of Ceph are running on any daemons. A health
32+
Warn if one or more Ceph daemons are running an old Ceph release. A health
3433
check is raised if multiple versions are detected. This condition must exist
3534
for a period of time greater than ``mon_warn_older_version_delay`` (set to one
3635
week by default) in order for the health check to be raised. This allows most
37-
upgrades to proceed without the occurrence of a false warning. If the upgrade
38-
is paused for an extended time period, ``health mute`` can be used by running
36+
upgrades to proceed without raising a warning that is both expected and
37+
ephemeral. If the upgrade
38+
is paused for an extended time, ``health mute`` can be used by running
3939
``ceph health mute DAEMON_OLD_VERSION --sticky``. Be sure, however, to run
40-
``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished.
40+
``ceph health unmute DAEMON_OLD_VERSION`` after the upgrade has finished so
41+
that any future, unexpected instances are not masked.
4142

4243
MON_DOWN
4344
________
4445

45-
One or more monitor daemons are currently down. The cluster requires a majority
46-
(more than one-half) of the monitors to be available. When one or more monitors
47-
are down, clients might have a harder time forming their initial connection to
48-
the cluster, as they might need to try more addresses before they reach an
46+
One or more Ceph Monitor daemons are down. The cluster requires a majority
47+
(more than one-half) of the provsioned monitors to be available. When one or more monitors
48+
are down, clients may have a harder time forming their initial connection to
49+
the cluster, as they may need to try additional IP addresses before they reach an
4950
operating monitor.
5051

51-
The down monitor daemon should be restarted as soon as possible to reduce the
52-
risk of a subsequent monitor failure leading to a service outage.
52+
Down monitor daemons should be restored or restarted as soon as possible to reduce the
53+
risk that an additional monitor failure may cause a service outage.
5354

5455
MON_CLOCK_SKEW
5556
______________
5657

57-
The clocks on the hosts running the ceph-mon monitor daemons are not
58+
The clocks on hosts running Ceph Monitor daemons are not
5859
well-synchronized. This health check is raised if the cluster detects a clock
5960
skew greater than ``mon_clock_drift_allowed``.
6061

6162
This issue is best resolved by synchronizing the clocks by using a tool like
62-
``ntpd`` or ``chrony``.
63+
the legacy ``ntpd`` or the newer ``chrony``. It is ideal to configure
64+
NTP daemons to sync against multiple internal and external sources for resilience;
65+
the protocol will adaptively determine the best available source. It is also
66+
beneficial to have the NTP daemons on Ceph Monitor hosts sync against each other,
67+
as it is even more important that Monitors be synchronized with each other than it
68+
is for them to be _correct_ with respect to reference time.
6369

6470
If it is impractical to keep the clocks closely synchronized, the
65-
``mon_clock_drift_allowed`` threshold can also be increased. However, this
71+
``mon_clock_drift_allowed`` threshold can be increased. However, this
6672
value must stay significantly below the ``mon_lease`` interval in order for the
67-
monitor cluster to function properly.
73+
monitor cluster to function properly. It is not difficult with a quality NTP
74+
or PTP configuration to have sub-millisecond synchronization, so there are very, very
75+
few occasions when it is appropriate to change this value.
6876

6977
MON_MSGR2_NOT_ENABLED
7078
_____________________
7179

7280
The :confval:`ms_bind_msgr2` option is enabled but one or more monitors are
73-
not configured to bind to a v2 port in the cluster's monmap. This
81+
not configured in the cluster's monmap to bind to a v2 port. This
7482
means that features specific to the msgr2 protocol (for example, encryption)
7583
are unavailable on some or all connections.
7684

@@ -85,22 +93,26 @@ port (6789) will continue to listen for v1 connections on 6789 and begin to
8593
listen for v2 connections on the new default port 3300.
8694

8795
If a monitor is configured to listen for v1 connections on a non-standard port
88-
(that is, a port other than 6789), then the monmap will need to be modified
96+
(that is, a port other than 6789), the monmap will need to be modified
8997
manually.
9098

9199

92100
MON_DISK_LOW
93101
____________
94102

95-
One or more monitors are low on disk space. This health check is raised if the
103+
One or more monitors are low on storage space. This health check is raised if the
96104
percentage of available space on the file system used by the monitor database
97105
(normally ``/var/lib/ceph/mon``) drops below the percentage value
98106
``mon_data_avail_warn`` (default: 30%).
99107

100108
This alert might indicate that some other process or user on the system is
101109
filling up the file system used by the monitor. It might also
102110
indicate that the monitor database is too large (see ``MON_DISK_BIG``
103-
below).
111+
below). Another common scenario is that Ceph logging subsystem levels have
112+
been raised for troubleshooting purposes without subsequent return to default
113+
levels. Ongoing verbose logging can easily fill up the files system containing
114+
``/var/log``. If you trim logs that are currently open, remember to restart or
115+
instruct your syslog or other daemon to re-open the log file.
104116

105117
If space cannot be freed, the monitor's data directory might need to be
106118
moved to another storage device or file system (this relocation process must be carried out while the monitor
@@ -110,7 +122,7 @@ daemon is not running).
110122
MON_DISK_CRIT
111123
_____________
112124

113-
One or more monitors are critically low on disk space. This health check is raised if the
125+
One or more monitors are critically low on storage space. This health check is raised if the
114126
percentage of available space on the file system used by the monitor database
115127
(normally ``/var/lib/ceph/mon``) drops below the percentage value
116128
``mon_data_avail_crit`` (default: 5%). See ``MON_DISK_LOW``, above.
@@ -124,14 +136,15 @@ raised if the size of the monitor database is larger than
124136

125137
A large database is unusual, but does not necessarily indicate a problem.
126138
Monitor databases might grow in size when there are placement groups that have
127-
not reached an ``active+clean`` state in a long time.
139+
not reached an ``active+clean`` state in a long time, or when extensive cluster
140+
recovery, expansion, or topology changes have recently occurred.
128141

129-
This alert might also indicate that the monitor's database is not properly
142+
This alert may also indicate that the monitor's database is not properly
130143
compacting, an issue that has been observed with some older versions of
131-
RocksDB. Forcing a compaction with ``ceph daemon mon.<id> compact`` might
132-
shrink the database's on-disk size.
144+
RocksDB. Forcing compaction with ``ceph daemon mon.<id> compact`` may suffice
145+
to shrink the database's storage usage.
133146

134-
This alert might also indicate that the monitor has a bug that prevents it from
147+
This alert may also indicate that the monitor has a bug that prevents it from
135148
pruning the cluster metadata that it stores. If the problem persists, please
136149
report a bug.
137150

@@ -236,17 +249,17 @@ Manager
236249
MGR_DOWN
237250
________
238251

239-
All manager daemons are currently down. The cluster should normally have at
240-
least one running manager (``ceph-mgr``) daemon. If no manager daemon is
241-
running, the cluster's ability to monitor itself will be compromised, and parts
242-
of the management API will become unavailable (for example, the dashboard will
243-
not work, and most CLI commands that report metrics or runtime state will
244-
block). However, the cluster will still be able to perform all I/O operations
245-
and to recover from failures.
252+
All Ceph Manager daemons are currently down. The cluster should normally have
253+
at least one running manager (``ceph-mgr``) daemon. If no manager daemon is
254+
running, the cluster's ability to monitor itself will be compromised, parts of
255+
the management API will become unavailable (for example, the dashboard will not
256+
work, and most CLI commands that report metrics or runtime state will block).
257+
However, the cluster will still be able to perform client I/O operations and
258+
recover from failures.
246259

247-
The "down" manager daemon should be restarted as soon as possible to ensure
248-
that the cluster can be monitored (for example, so that the ``ceph -s``
249-
information is up to date, or so that metrics can be scraped by Prometheus).
260+
The down manager daemon(s) should be restarted as soon as possible to ensure
261+
that the cluster can be monitored (for example, so that ``ceph -s``
262+
information is available and up to date, and so that metrics can be scraped by Prometheus).
250263

251264

252265
MGR_MODULE_DEPENDENCY
@@ -285,14 +298,15 @@ OSDs
285298
OSD_DOWN
286299
________
287300

288-
One or more OSDs are marked "down". The ceph-osd daemon might have been
289-
stopped, or peer OSDs might be unable to reach the OSD over the network.
301+
One or more OSDs are marked ``down``. The ceph-osd daemon(s) or their host(s)
302+
may have crashed or been stopped, or peer OSDs might be unable to reach the OSD
303+
over the public or private network.
290304
Common causes include a stopped or crashed daemon, a "down" host, or a network
291-
outage.
305+
failure.
292306

293307
Verify that the host is healthy, the daemon is started, and the network is
294308
functioning. If the daemon has crashed, the daemon log file
295-
(``/var/log/ceph/ceph-osd.*``) might contain debugging information.
309+
(``/var/log/ceph/ceph-osd.*``) may contain troubleshooting information.
296310

297311
OSD_<crush type>_DOWN
298312
_____________________
@@ -319,7 +333,7 @@ _____________________
319333
The utilization thresholds for `nearfull`, `backfillfull`, `full`, and/or
320334
`failsafe_full` are not ascending. In particular, the following pattern is
321335
expected: `nearfull < backfillfull`, `backfillfull < full`, and `full <
322-
failsafe_full`.
336+
failsafe_full`. This can result in unexpected cluster behavior.
323337

324338
To adjust these utilization thresholds, run the following commands:
325339

@@ -355,8 +369,14 @@ threshold by a small amount. To do so, run the following command:
355369

356370
ceph osd set-full-ratio <ratio>
357371

358-
Additional OSDs should be deployed in order to add new storage to the cluster,
359-
or existing data should be deleted in order to free up space in the cluster.
372+
Additional OSDs should be deployed within appropriate CRUSH failure domains
373+
in order to increase capacity, and / or existing data should be deleted
374+
in order to free up space in the cluster. One subtle situation is that the
375+
``rados bench`` tool may have been used to test one or more pools' performance,
376+
and the resulting RADOS objects were not subsequently cleaned up. You may
377+
check for this by invoking ``rados ls`` against each pool and looking for
378+
objects with names beginning with ``bench`` or other job names. These may
379+
then be manually but very, very carefully deleted in order to reclaim capacity.
360380

361381
OSD_BACKFILLFULL
362382
________________
@@ -493,7 +513,7 @@ or newer to start. To safely set the flag, run the following command:
493513
OSD_FILESTORE
494514
__________________
495515

496-
Warn if OSDs are running Filestore. The Filestore OSD back end has been
516+
Warn if OSDs are running the old Filestore back end. The Filestore OSD back end is
497517
deprecated; the BlueStore back end has been the default object store since the
498518
Ceph Luminous release.
499519

@@ -518,9 +538,9 @@ temporarily silence this alert by running the following command:
518538

519539
ceph health mute OSD_FILESTORE
520540

521-
Since this migration can take a considerable amount of time to complete, we
522-
recommend that you begin the process well in advance of any update to Reef or
523-
to later releases.
541+
Since migration of Filestore OSDs to BlueStore can take a considerable amount
542+
of time to complete, we recommend that you begin the process well in advance
543+
of any update to Reef or to later releases.
524544

525545
OSD_UNREACHABLE
526546
_______________
@@ -778,10 +798,10 @@ about the source of the problem.
778798
BLUESTORE_SPURIOUS_READ_ERRORS
779799
______________________________
780800

781-
One or more BlueStore OSDs detect spurious read errors on the main device.
801+
One or more BlueStore OSDs detect read errors on the main device.
782802
BlueStore has recovered from these errors by retrying disk reads. This alert
783803
might indicate issues with underlying hardware, issues with the I/O subsystem,
784-
or something similar. In theory, such issues can cause permanent data
804+
or something similar. Such issues can cause permanent data
785805
corruption. Some observations on the root cause of spurious read errors can be
786806
found here: https://tracker.ceph.com/issues/22464
787807

@@ -805,7 +825,8 @@ BLOCK_DEVICE_STALLED_READ_ALERT
805825
_______________________________
806826

807827
There are certain BlueStore log messages that surface storage drive issues
808-
that can lead to performance degradation and can cause bad disk.
828+
that can cause performance degradation and potentially data unavailability or
829+
loss.
809830

810831
``read stalled read 0x29f40370000~100000 (buffered) since 63410177.290546s, timeout is 5.000000s``
811832

@@ -815,7 +836,7 @@ can be found here: https://tracker.ceph.com/issues/62500
815836

816837
As there can be false positive ``stalled read`` instances, a mechanism
817838
has been added for more reliability. If in last ``bdev_stalled_read_warn_lifetime``
818-
duration ``stalled read`` indications are found more than or equal to
839+
duration the number of ``stalled read`` indications are found to be more than or equal to
819840
``bdev_stalled_read_warn_threshold`` for a given BlueStore block device, this
820841
warning will be reported in ``ceph health detail``.
821842

@@ -861,7 +882,7 @@ BLUESTORE_SLOW_OP_ALERT
861882
_______________________
862883

863884
There are certain BlueStore log messages that surface storage drive issues
864-
that can lead to performance degradation and can cause bad disk.
885+
that can lead to performance degradation and data unavailability or loss.
865886

866887
``log_latency_fn slow operation observed for _txc_committed_kv, latency = 12.028621219s, txc = 0x55a107c30f00``
867888
``log_latency_fn slow operation observed for upper_bound, latency = 6.25955s``
@@ -907,16 +928,20 @@ appropriate response to this expected failure is (1) to mark the OSD ``out`` so
907928
that data is migrated off of the OSD, and then (2) to remove the hardware from
908929
the system. Note that this marking ``out`` is normally done automatically if
909930
``mgr/devicehealth/self_heal`` is enabled (as determined by
910-
``mgr/devicehealth/mark_out_threshold``).
931+
``mgr/devicehealth/mark_out_threshold``). If an OSD device is compromised but
932+
the OSD(s) on that device are still ``up``, recovery can be degraded. In such
933+
cases it may be advantageous to forcibly stop the OSD daemon(s) in question so
934+
that recovery can proceed from surviving healthly OSDs. This should only be
935+
done with extreme care so that data availability is not compromised.
911936

912937
To check device health, run the following command:
913938

914939
.. prompt:: bash $
915940

916941
ceph device info <device-id>
917942

918-
Device life expectancy is set either by a prediction model that the mgr runs or
919-
by an external tool that is activated by running the following command:
943+
Device life expectancy is set either by a prediction model that the Manager
944+
runs or by an external tool that is activated by running the following command:
920945

921946
.. prompt:: bash $
922947

0 commit comments

Comments
 (0)