@@ -99,23 +99,25 @@ manually.
9999MON_DISK_LOW
100100____________
101101
102- One or more monitors are low on storage space. This health check is raised if
103- the percentage of available space on the file system used by the monitor
104- database (normally ``/var/lib/ceph/mon ``) drops below the percentage value
102+ One or more Monitors are low on storage space. This health check is raised when
103+ available space on the file system used by the Monitor
104+ database (normally ``/var/lib/ceph/<fsid>/ mon.<monid> ``) drops below the threshold
105105``mon_data_avail_warn `` (default: 30%).
106106
107107This alert might indicate that some other process or user on the system is
108- filling up the file system used by the monitor . It might also indicate that the
109- monitor database is too large (see ``MON_DISK_BIG `` below). Another common
108+ filling up the file system used by the Monitor . It might also indicate that the
109+ Monitor database is too large (see ``MON_DISK_BIG `` below). Another common
110110scenario is that Ceph logging subsystem levels have been raised for
111111troubleshooting purposes without subsequent return to default levels. Ongoing
112112verbose logging can easily fill up the files system containing ``/var/log ``. If
113113you trim logs that are currently open, remember to restart or instruct your
114- syslog or other daemon to re-open the log file.
114+ syslog or other daemon to re-open the log file. Another common dynamic is
115+ that users or processes have written a large amount of data to ``/tmp `` or ``/var/tmp ``,
116+ which may be on the same filesystem.
115117
116118If space cannot be freed, the monitor's data directory might need to be moved
117- to another storage device or file system (this relocation process must be
118- carried out while the monitor daemon is not running) .
119+ to another storage device or file system. This relocation process must be
120+ carried out while the Monitor daemon is not running.
119121
120122
121123MON_DISK_CRIT
@@ -136,10 +138,12 @@ raised if the size of the monitor database is larger than
136138A large database is unusual, but does not necessarily indicate a problem.
137139Monitor databases might grow in size when there are placement groups that have
138140not reached an ``active+clean `` state in a long time, or when extensive cluster
139- recovery, expansion, or topology changes have recently occurred.
141+ recovery, expansion, or topology changes have recently occurred. It is recommended
142+ that when conducting large scale cluster changes that the cluster thus be
143+ left to "rest" for at least a few hours once each week.
140144
141145This alert may also indicate that the monitor's database is not properly
142- compacting, an issue that has been observed with some older versions of
146+ compacting, an issue that has been observed with older versions of
143147RocksDB. Forcing compaction with ``ceph daemon mon.<id> compact `` may suffice
144148to shrink the database's storage usage.
145149
@@ -909,6 +913,8 @@ potentially replaced.
909913``log_latency_fn slow operation observed for upper_bound, latency = 6.25955s ``
910914``log_latency slow operation observed for submit_transaction.. ``
911915
916+ This may also be reflected by the ``BLUESTORE_SLOW_OP_ALERT `` cluster health flag.
917+
912918As there can be false positive ``slow ops `` instances, a mechanism has
913919been added for more reliability. If in the last ``bluestore_slow_ops_warn_lifetime ``
914920seconds the number of ``slow ops `` indications are found greater than or equal to
@@ -920,20 +926,20 @@ The defaults for :confval:`bluestore_slow_ops_warn_lifetime` and
920926:confval: `bluestore_slow_ops_warn_threshold ` may be overidden globally or for
921927specific OSDs.
922928
923- To change this, run the following command :
929+ To change this, run a command of the following form :
924930
925931.. prompt :: bash $
926932
927- ceph config set global bluestore_slow_ops_warn_lifetime 10
933+ ceph config set global bluestore_slow_ops_warn_lifetime 300
928934 ceph config set global bluestore_slow_ops_warn_threshold 5
929935
930936this may be done for specific OSDs or a given mask, for example:
931937
932938.. prompt :: bash $
933939
934- ceph config set osd.123 bluestore_slow_ops_warn_lifetime 10
940+ ceph config set osd.123 bluestore_slow_ops_warn_lifetime 300
935941 ceph config set osd.123 bluestore_slow_ops_warn_threshold 5
936- ceph config set class:ssd bluestore_slow_ops_warn_lifetime 10
942+ ceph config set class:ssd bluestore_slow_ops_warn_lifetime 300
937943 ceph config set class:ssd bluestore_slow_ops_warn_threshold 5
938944
939945Device health
@@ -957,7 +963,7 @@ that recovery can proceed from surviving healthly OSDs. This must be
957963done with extreme care and attention to failure domains so that data availability
958964is not compromised.
959965
960- To check device health, run the following command :
966+ To check device health, run a command of the following form :
961967
962968.. prompt :: bash $
963969
@@ -1036,7 +1042,7 @@ command:
10361042In most cases, the root cause of this issue is that one or more OSDs are
10371043currently ``down ``: see ``OSD_DOWN `` above.
10381044
1039- To see the state of a specific problematic PG, run the following command :
1045+ To see the state of a specific problematic PG, run a command of the following form :
10401046
10411047.. prompt :: bash $
10421048
@@ -1064,7 +1070,7 @@ command:
10641070In most cases, the root cause of this issue is that one or more OSDs are
10651071currently "down": see ``OSD_DOWN `` above.
10661072
1067- To see the state of a specific problematic PG, run the following command :
1073+ To see the state of a specific problematic PG, run a command of the following form :
10681074
10691075.. prompt :: bash $
10701076
@@ -1145,7 +1151,7 @@ can be caused by RGW-bucket index objects that do not have automatic resharding
11451151enabled. For more information on resharding, see :ref: `RGW Dynamic Bucket Index
11461152Resharding <rgw_dynamic_bucket_index_resharding>`.
11471153
1148- To adjust the thresholds mentioned above, run the following commands :
1154+ To adjust the thresholds mentioned above, run a command of following form :
11491155
11501156.. prompt :: bash $
11511157
@@ -1161,7 +1167,7 @@ target threshold, write requests to the pool might block while data is flushed
11611167and evicted from the cache. This state normally leads to very high latencies
11621168and poor performance.
11631169
1164- To adjust the cache pool's target size, run the following commands :
1170+ To adjust the cache pool's target size, run a command of the following form :
11651171
11661172.. prompt :: bash $
11671173
@@ -1190,12 +1196,11 @@ POOL_PG_NUM_NOT_POWER_OF_TWO
11901196____________________________
11911197
11921198One or more pools have a ``pg_num `` value that is not a power of two. Although
1193- this is not strictly incorrect, it does lead to a less balanced distribution of
1194- data because some Placement Groups will have roughly twice as much data as
1195- others have.
1199+ this is not fatal, it does lead to a less balanced distribution of
1200+ data because some placement groups will comprise much more data than others.
11961201
11971202This is easily corrected by setting the ``pg_num `` value for the affected
1198- pool(s) to a nearby power of two. To do so, run the following command :
1203+ pool(s) to a nearby power of two. Enable the PG Autoscaler or run a command of the following form :
11991204
12001205.. prompt :: bash $
12011206
@@ -1207,6 +1212,9 @@ To disable this health check, run the following command:
12071212
12081213 ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false
12091214
1215+ Note that disabling this health check is not recommended.
1216+
1217+
12101218POOL_TOO_FEW_PGS
12111219________________
12121220
@@ -1224,14 +1232,14 @@ running the following command:
12241232 ceph osd pool set <pool-name> pg_autoscale_mode off
12251233
12261234To allow the cluster to automatically adjust the number of PGs for the pool,
1227- run the following command :
1235+ run a command of following form :
12281236
12291237.. prompt :: bash $
12301238
12311239 ceph osd pool set <pool-name> pg_autoscale_mode on
12321240
12331241Alternatively, to manually set the number of PGs for the pool to the
1234- recommended amount, run the following command :
1242+ recommended amount, run a command of the following form :
12351243
12361244.. prompt :: bash $
12371245
@@ -1256,7 +1264,7 @@ The simplest way to mitigate the problem is to increase the number of OSDs in
12561264the cluster by adding more hardware. Note that, because the OSD count that is
12571265used for the purposes of this health check is the number of ``in `` OSDs,
12581266marking ``out `` OSDs ``in `` (if there are any ``out `` OSDs available) can also
1259- help. To do so, run the following command :
1267+ help. To do so, run a command of the following form :
12601268
12611269.. prompt :: bash $
12621270
@@ -1282,14 +1290,14 @@ running the following command:
12821290 ceph osd pool set <pool-name> pg_autoscale_mode off
12831291
12841292To allow the cluster to automatically adjust the number of PGs for the pool,
1285- run the following command :
1293+ run a command of the following form :
12861294
12871295.. prompt :: bash $
12881296
12891297 ceph osd pool set <pool-name> pg_autoscale_mode on
12901298
12911299Alternatively, to manually set the number of PGs for the pool to the
1292- recommended amount, run the following command :
1300+ recommended amount, run a command of the following form :
12931301
12941302.. prompt :: bash $
12951303
@@ -1329,7 +1337,7 @@ in order to estimate the expected size of the pool. Only one of these
13291337properties should be non-zero. If both are set to a non-zero value, then
13301338``target_size_ratio `` takes precedence and ``target_size_bytes `` is ignored.
13311339
1332- To reset ``target_size_bytes `` to zero, run the following command :
1340+ To reset ``target_size_bytes `` to zero, run a command of the following form :
13331341
13341342.. prompt :: bash $
13351343
@@ -1356,7 +1364,7 @@ out the `split` step when the PG count is adjusted from the data migration that
13561364is needed when ``pgp_num `` is changed.
13571365
13581366This issue is normally resolved by setting ``pgp_num `` to match ``pg_num ``, so
1359- as to trigger the data migration, by running the following command :
1367+ as to trigger the data migration, by running a command of the following form :
13601368
13611369.. prompt :: bash $
13621370
@@ -1387,7 +1395,7 @@ A pool exists but the pool has not been tagged for use by a particular
13871395application.
13881396
13891397To resolve this issue, tag the pool for use by an application. For
1390- example, if the pool is used by RBD, run the following command :
1398+ example, if the pool is used by RBD, run a command of the following form :
13911399
13921400.. prompt :: bash $
13931401
@@ -1409,15 +1417,15 @@ One or more pools have reached (or are very close to reaching) their quota. The
14091417threshold to raise this health check is determined by the
14101418``mon_pool_quota_crit_threshold `` configuration option.
14111419
1412- Pool quotas can be adjusted up or down (or removed) by running the following
1413- commands :
1420+ Pool quotas can be adjusted up or down (or removed) by running commands of the the following
1421+ forms :
14141422
14151423.. prompt :: bash $
14161424
14171425 ceph osd pool set-quota <pool> max_bytes <bytes>
14181426 ceph osd pool set-quota <pool> max_objects <objects>
14191427
1420- To disable a quota, set the quota value to 0 .
1428+ To disable a quota, set the quota value to `` 0 `` .
14211429
14221430POOL_NEAR_FULL
14231431______________
@@ -1427,8 +1435,8 @@ One or more pools are approaching a configured fullness threshold.
14271435One of the several thresholds that can raise this health check is determined by
14281436the ``mon_pool_quota_warn_threshold `` configuration option.
14291437
1430- Pool quotas can be adjusted up or down (or removed) by running the following
1431- commands :
1438+ Pool quotas can be adjusted up or down (or removed) by running commands of the following
1439+ forms :
14321440
14331441.. prompt :: bash $
14341442
@@ -1463,8 +1471,8 @@ Read or write requests to unfound objects will block.
14631471
14641472Ideally, a "down" OSD that has a more recent copy of the unfound object can be
14651473brought back online. To identify candidate OSDs, check the peering state of the
1466- PG(s) responsible for the unfound object. To see the peering state, run the
1467- following command :
1474+ PG(s) responsible for the unfound object. To see the peering state, run a command
1475+ of the following form :
14681476
14691477.. prompt :: bash $
14701478
@@ -1488,13 +1496,13 @@ following command from the daemon's host:
14881496
14891497 ceph daemon osd.<id> ops
14901498
1491- To see a summary of the slowest recent requests, run the following command :
1499+ To see a summary of the slowest recent requests, run a command of the following form :
14921500
14931501.. prompt :: bash $
14941502
14951503 ceph daemon osd.<id> dump_historic_ops
14961504
1497- To see the location of a specific OSD, run the following command :
1505+ To see the location of a specific OSD, run a command of the following form :
14981506
14991507.. prompt :: bash $
15001508
@@ -1517,7 +1525,7 @@ they are to be cleaned, and not that they have been examined and found to be
15171525clean). Misplaced or degraded PGs will not be flagged as ``clean `` (see
15181526*PG_AVAILABILITY * and *PG_DEGRADED * above).
15191527
1520- To manually initiate a scrub of a clean PG, run the following command :
1528+ To manually initiate a scrub of a clean PG, run a command of the following form :
15211529
15221530.. prompt: bash $
15231531
@@ -1546,7 +1554,7 @@ the Manager daemon.
15461554First Method
15471555~~~~~~~~~~~~
15481556
1549- To manually initiate a deep scrub of a clean PG, run the following command :
1557+ To manually initiate a deep scrub of a clean PG, run a command of the following form :
15501558
15511559.. prompt :: bash $
15521560
@@ -1580,7 +1588,7 @@ See `Redmine tracker issue #44959 <https://tracker.ceph.com/issues/44959>`_.
15801588Second Method
15811589~~~~~~~~~~~~~
15821590
1583- To manually initiate a deep scrub of a clean PG, run the following command :
1591+ To manually initiate a deep scrub of a clean PG, run a command of the following form :
15841592
15851593.. prompt :: bash $
15861594
@@ -1723,14 +1731,14 @@ To list recent crashes, run the following command:
17231731
17241732 ceph crash ls-new
17251733
1726- To examine information about a specific crash, run the following command :
1734+ To examine information about a specific crash, run a command of the following form :
17271735
17281736.. prompt :: bash $
17291737
17301738 ceph crash info <crash-id>
17311739
17321740To silence this alert, you can archive the crash (perhaps after the crash
1733- has been examined by an administrator) by running the following command :
1741+ has been examined by an administrator) by running a command of the following form :
17341742
17351743.. prompt :: bash $
17361744
@@ -1772,7 +1780,7 @@ running the following command:
17721780 ceph crash info <crash-id>
17731781
17741782To silence this alert, you can archive the crash (perhaps after the crash has
1775- been examined by an administrator) by running the following command :
1783+ been examined by an administrator) by running a command of the following form :
17761784
17771785.. prompt :: bash $
17781786
@@ -1842,7 +1850,7 @@ were set with an older version of Ceph that did not properly validate the
18421850syntax of those capabilities, or if (2) the syntax of the capabilities has
18431851changed.
18441852
1845- To remove the user(s) in question, run the following command :
1853+ To remove the user(s) in question, run a command of the following form :
18461854
18471855.. prompt :: bash $
18481856
@@ -1851,8 +1859,8 @@ To remove the user(s) in question, run the following command:
18511859(This resolves the health check, but it prevents clients from being able to
18521860authenticate as the removed user.)
18531861
1854- Alternatively, to update the capabilities for the user(s), run the following
1855- command :
1862+ Alternatively, to update the capabilities for the user(s), run a command of the following
1863+ form :
18561864
18571865.. prompt :: bash $
18581866
0 commit comments