77Overview
88========
99
10- There is a finite set of health messages that a Ceph cluster can raise. These
11- messages are known as *health checks *. Each health check has a unique
12- identifier.
10+ There is a set of health states that a Ceph cluster can raise. These
11+ are known as *health checks *. Each health check has a unique identifier.
1312
1413The identifier is a terse human-readable string -- that is, the identifier is
1514readable in much the same way as a typical variable name. It is intended to
16- enable tools (for example, UIs) to make sense of health checks and present them
15+ enable tools (for example, monitoring and UIs) to make sense of health checks and present them
1716in a way that reflects their meaning.
1817
1918This page lists the health checks that are raised by the monitor and manager
20- daemons. In addition to these, you might see health checks that originate
21- from MDS daemons (see :ref: `cephfs-health-messages `), and health checks
22- that are defined by ``ceph-mgr `` python modules.
19+ daemons. In addition to these, you may see health checks that originate
20+ from CephFS MDS daemons (see :ref: `cephfs-health-messages `), and health checks
21+ that are defined by ``ceph-mgr `` modules.
2322
2423Definitions
2524===========
@@ -30,47 +29,56 @@ Monitor
3029DAEMON_OLD_VERSION
3130__________________
3231
33- Warn if one or more old versions of Ceph are running on any daemons . A health
32+ Warn if one or more Ceph daemons are running an old Ceph release . A health
3433check is raised if multiple versions are detected. This condition must exist
3534for a period of time greater than ``mon_warn_older_version_delay `` (set to one
3635week by default) in order for the health check to be raised. This allows most
37- upgrades to proceed without the occurrence of a false warning. If the upgrade
38- is paused for an extended time period, ``health mute `` can be used by running
36+ upgrades to proceed without raising a warning that is both expected and
37+ ephemeral. If the upgrade
38+ is paused for an extended time, ``health mute `` can be used by running
3939``ceph health mute DAEMON_OLD_VERSION --sticky ``. Be sure, however, to run
40- ``ceph health unmute DAEMON_OLD_VERSION `` after the upgrade has finished.
40+ ``ceph health unmute DAEMON_OLD_VERSION `` after the upgrade has finished so
41+ that any future, unexpected instances are not masked.
4142
4243MON_DOWN
4344________
4445
45- One or more monitor daemons are currently down. The cluster requires a majority
46- (more than one-half) of the monitors to be available. When one or more monitors
47- are down, clients might have a harder time forming their initial connection to
48- the cluster, as they might need to try more addresses before they reach an
46+ One or more Ceph Monitor daemons are down. The cluster requires a majority
47+ (more than one-half) of the provsioned monitors to be available. When one or more monitors
48+ are down, clients may have a harder time forming their initial connection to
49+ the cluster, as they may need to try additional IP addresses before they reach an
4950operating monitor.
5051
51- The down monitor daemon should be restarted as soon as possible to reduce the
52- risk of a subsequent monitor failure leading to a service outage.
52+ Down monitor daemons should be restored or restarted as soon as possible to reduce the
53+ risk that an additional monitor failure may cause a service outage.
5354
5455MON_CLOCK_SKEW
5556______________
5657
57- The clocks on the hosts running the ceph-mon monitor daemons are not
58+ The clocks on hosts running Ceph Monitor daemons are not
5859well-synchronized. This health check is raised if the cluster detects a clock
5960skew greater than ``mon_clock_drift_allowed ``.
6061
6162This issue is best resolved by synchronizing the clocks by using a tool like
62- ``ntpd `` or ``chrony ``.
63+ the legacy ``ntpd `` or the newer ``chrony ``. It is ideal to configure
64+ NTP daemons to sync against multiple internal and external sources for resilience;
65+ the protocol will adaptively determine the best available source. It is also
66+ beneficial to have the NTP daemons on Ceph Monitor hosts sync against each other,
67+ as it is even more important that Monitors be synchronized with each other than it
68+ is for them to be _correct_ with respect to reference time.
6369
6470If it is impractical to keep the clocks closely synchronized, the
65- ``mon_clock_drift_allowed `` threshold can also be increased. However, this
71+ ``mon_clock_drift_allowed `` threshold can be increased. However, this
6672value must stay significantly below the ``mon_lease `` interval in order for the
67- monitor cluster to function properly.
73+ monitor cluster to function properly. It is not difficult with a quality NTP
74+ or PTP configuration to have sub-millisecond synchronization, so there are very, very
75+ few occasions when it is appropriate to change this value.
6876
6977MON_MSGR2_NOT_ENABLED
7078_____________________
7179
7280The :confval: `ms_bind_msgr2 ` option is enabled but one or more monitors are
73- not configured to bind to a v2 port in the cluster's monmap . This
81+ not configured in the cluster's monmap to bind to a v2 port. This
7482means that features specific to the msgr2 protocol (for example, encryption)
7583are unavailable on some or all connections.
7684
@@ -85,22 +93,26 @@ port (6789) will continue to listen for v1 connections on 6789 and begin to
8593listen for v2 connections on the new default port 3300.
8694
8795If a monitor is configured to listen for v1 connections on a non-standard port
88- (that is, a port other than 6789), then the monmap will need to be modified
96+ (that is, a port other than 6789), the monmap will need to be modified
8997manually.
9098
9199
92100MON_DISK_LOW
93101____________
94102
95- One or more monitors are low on disk space. This health check is raised if the
103+ One or more monitors are low on storage space. This health check is raised if the
96104percentage of available space on the file system used by the monitor database
97105(normally ``/var/lib/ceph/mon ``) drops below the percentage value
98106``mon_data_avail_warn `` (default: 30%).
99107
100108This alert might indicate that some other process or user on the system is
101109filling up the file system used by the monitor. It might also
102110indicate that the monitor database is too large (see ``MON_DISK_BIG ``
103- below).
111+ below). Another common scenario is that Ceph logging subsystem levels have
112+ been raised for troubleshooting purposes without subsequent return to default
113+ levels. Ongoing verbose logging can easily fill up the files system containing
114+ ``/var/log ``. If you trim logs that are currently open, remember to restart or
115+ instruct your syslog or other daemon to re-open the log file.
104116
105117If space cannot be freed, the monitor's data directory might need to be
106118moved to another storage device or file system (this relocation process must be carried out while the monitor
@@ -110,7 +122,7 @@ daemon is not running).
110122MON_DISK_CRIT
111123_____________
112124
113- One or more monitors are critically low on disk space. This health check is raised if the
125+ One or more monitors are critically low on storage space. This health check is raised if the
114126percentage of available space on the file system used by the monitor database
115127(normally ``/var/lib/ceph/mon ``) drops below the percentage value
116128``mon_data_avail_crit `` (default: 5%). See ``MON_DISK_LOW ``, above.
@@ -124,14 +136,15 @@ raised if the size of the monitor database is larger than
124136
125137A large database is unusual, but does not necessarily indicate a problem.
126138Monitor databases might grow in size when there are placement groups that have
127- not reached an ``active+clean `` state in a long time.
139+ not reached an ``active+clean `` state in a long time, or when extensive cluster
140+ recovery, expansion, or topology changes have recently occurred.
128141
129- This alert might also indicate that the monitor's database is not properly
142+ This alert may also indicate that the monitor's database is not properly
130143compacting, an issue that has been observed with some older versions of
131- RocksDB. Forcing a compaction with ``ceph daemon mon.<id> compact `` might
132- shrink the database's on-disk size .
144+ RocksDB. Forcing compaction with ``ceph daemon mon.<id> compact `` may suffice
145+ to shrink the database's storage usage .
133146
134- This alert might also indicate that the monitor has a bug that prevents it from
147+ This alert may also indicate that the monitor has a bug that prevents it from
135148pruning the cluster metadata that it stores. If the problem persists, please
136149report a bug.
137150
@@ -236,17 +249,17 @@ Manager
236249MGR_DOWN
237250________
238251
239- All manager daemons are currently down. The cluster should normally have at
240- least one running manager (``ceph-mgr ``) daemon. If no manager daemon is
241- running, the cluster's ability to monitor itself will be compromised, and parts
242- of the management API will become unavailable (for example, the dashboard will
243- not work, and most CLI commands that report metrics or runtime state will
244- block). However, the cluster will still be able to perform all I/O operations
245- and to recover from failures.
252+ All Ceph Manager daemons are currently down. The cluster should normally have
253+ at least one running manager (``ceph-mgr ``) daemon. If no manager daemon is
254+ running, the cluster's ability to monitor itself will be compromised, parts of
255+ the management API will become unavailable (for example, the dashboard will not
256+ work, and most CLI commands that report metrics or runtime state will block).
257+ However, the cluster will still be able to perform client I/O operations and
258+ recover from failures.
246259
247- The " down" manager daemon should be restarted as soon as possible to ensure
248- that the cluster can be monitored (for example, so that the ``ceph -s ``
249- information is up to date, or so that metrics can be scraped by Prometheus).
260+ The down manager daemon(s) should be restarted as soon as possible to ensure
261+ that the cluster can be monitored (for example, so that ``ceph -s ``
262+ information is available and up to date, and so that metrics can be scraped by Prometheus).
250263
251264
252265MGR_MODULE_DEPENDENCY
@@ -285,14 +298,15 @@ OSDs
285298OSD_DOWN
286299________
287300
288- One or more OSDs are marked "down". The ceph-osd daemon might have been
289- stopped, or peer OSDs might be unable to reach the OSD over the network.
301+ One or more OSDs are marked ``down ``. The ceph-osd daemon(s) or their host(s)
302+ may have crashed or been stopped, or peer OSDs might be unable to reach the OSD
303+ over the public or private network.
290304Common causes include a stopped or crashed daemon, a "down" host, or a network
291- outage .
305+ failure .
292306
293307Verify that the host is healthy, the daemon is started, and the network is
294308functioning. If the daemon has crashed, the daemon log file
295- (``/var/log/ceph/ceph-osd.* ``) might contain debugging information.
309+ (``/var/log/ceph/ceph-osd.* ``) may contain troubleshooting information.
296310
297311OSD_<crush type>_DOWN
298312_____________________
@@ -319,7 +333,7 @@ _____________________
319333The utilization thresholds for `nearfull `, `backfillfull `, `full `, and/or
320334`failsafe_full ` are not ascending. In particular, the following pattern is
321335expected: `nearfull < backfillfull `, `backfillfull < full `, and `full <
322- failsafe_full `.
336+ failsafe_full `. This can result in unexpected cluster behavior.
323337
324338To adjust these utilization thresholds, run the following commands:
325339
@@ -355,8 +369,14 @@ threshold by a small amount. To do so, run the following command:
355369
356370 ceph osd set-full-ratio <ratio>
357371
358- Additional OSDs should be deployed in order to add new storage to the cluster,
359- or existing data should be deleted in order to free up space in the cluster.
372+ Additional OSDs should be deployed within appropriate CRUSH failure domains
373+ in order to increase capacity, and / or existing data should be deleted
374+ in order to free up space in the cluster. One subtle situation is that the
375+ ``rados bench `` tool may have been used to test one or more pools' performance,
376+ and the resulting RADOS objects were not subsequently cleaned up. You may
377+ check for this by invoking ``rados ls `` against each pool and looking for
378+ objects with names beginning with ``bench `` or other job names. These may
379+ then be manually but very, very carefully deleted in order to reclaim capacity.
360380
361381OSD_BACKFILLFULL
362382________________
@@ -493,7 +513,7 @@ or newer to start. To safely set the flag, run the following command:
493513OSD_FILESTORE
494514__________________
495515
496- Warn if OSDs are running Filestore. The Filestore OSD back end has been
516+ Warn if OSDs are running the old Filestore back end . The Filestore OSD back end is
497517deprecated; the BlueStore back end has been the default object store since the
498518Ceph Luminous release.
499519
@@ -518,9 +538,9 @@ temporarily silence this alert by running the following command:
518538
519539 ceph health mute OSD_FILESTORE
520540
521- Since this migration can take a considerable amount of time to complete, we
522- recommend that you begin the process well in advance of any update to Reef or
523- to later releases.
541+ Since migration of Filestore OSDs to BlueStore can take a considerable amount
542+ of time to complete, we recommend that you begin the process well in advance
543+ of any update to Reef or to later releases.
524544
525545OSD_UNREACHABLE
526546_______________
@@ -778,10 +798,10 @@ about the source of the problem.
778798BLUESTORE_SPURIOUS_READ_ERRORS
779799______________________________
780800
781- One or more BlueStore OSDs detect spurious read errors on the main device.
801+ One or more BlueStore OSDs detect read errors on the main device.
782802BlueStore has recovered from these errors by retrying disk reads. This alert
783803might indicate issues with underlying hardware, issues with the I/O subsystem,
784- or something similar. In theory, such issues can cause permanent data
804+ or something similar. Such issues can cause permanent data
785805corruption. Some observations on the root cause of spurious read errors can be
786806found here: https://tracker.ceph.com/issues/22464
787807
@@ -805,7 +825,8 @@ BLOCK_DEVICE_STALLED_READ_ALERT
805825_______________________________
806826
807827There are certain BlueStore log messages that surface storage drive issues
808- that can lead to performance degradation and can cause bad disk.
828+ that can cause performance degradation and potentially data unavailability or
829+ loss.
809830
810831``read stalled read 0x29f40370000~100000 (buffered) since 63410177.290546s, timeout is 5.000000s ``
811832
@@ -815,7 +836,7 @@ can be found here: https://tracker.ceph.com/issues/62500
815836
816837As there can be false positive ``stalled read `` instances, a mechanism
817838has been added for more reliability. If in last ``bdev_stalled_read_warn_lifetime ``
818- duration ``stalled read `` indications are found more than or equal to
839+ duration the number of ``stalled read `` indications are found to be more than or equal to
819840``bdev_stalled_read_warn_threshold `` for a given BlueStore block device, this
820841warning will be reported in ``ceph health detail ``.
821842
@@ -861,7 +882,7 @@ BLUESTORE_SLOW_OP_ALERT
861882_______________________
862883
863884There are certain BlueStore log messages that surface storage drive issues
864- that can lead to performance degradation and can cause bad disk .
885+ that can lead to performance degradation and data unavailability or loss .
865886
866887``log_latency_fn slow operation observed for _txc_committed_kv, latency = 12.028621219s, txc = 0x55a107c30f00 ``
867888``log_latency_fn slow operation observed for upper_bound, latency = 6.25955s ``
@@ -907,16 +928,20 @@ appropriate response to this expected failure is (1) to mark the OSD ``out`` so
907928that data is migrated off of the OSD, and then (2) to remove the hardware from
908929the system. Note that this marking ``out `` is normally done automatically if
909930``mgr/devicehealth/self_heal `` is enabled (as determined by
910- ``mgr/devicehealth/mark_out_threshold ``).
931+ ``mgr/devicehealth/mark_out_threshold ``). If an OSD device is compromised but
932+ the OSD(s) on that device are still ``up ``, recovery can be degraded. In such
933+ cases it may be advantageous to forcibly stop the OSD daemon(s) in question so
934+ that recovery can proceed from surviving healthly OSDs. This should only be
935+ done with extreme care so that data availability is not compromised.
911936
912937To check device health, run the following command:
913938
914939.. prompt :: bash $
915940
916941 ceph device info <device-id>
917942
918- Device life expectancy is set either by a prediction model that the mgr runs or
919- by an external tool that is activated by running the following command:
943+ Device life expectancy is set either by a prediction model that the Manager
944+ runs or by an external tool that is activated by running the following command:
920945
921946.. prompt :: bash $
922947
0 commit comments