Skip to content

Commit ddb7be3

Browse files
authored
Merge pull request ceph#65230 from anthonyeleven/bs-slow-op-alert
doc/rados/operations: Improve health-checks.rst Reviewed-by: Zac Dover <[email protected]>
2 parents eb85a4b + ba5cb7b commit ddb7be3

File tree

1 file changed

+57
-49
lines changed

1 file changed

+57
-49
lines changed

doc/rados/operations/health-checks.rst

Lines changed: 57 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -99,23 +99,25 @@ manually.
9999
MON_DISK_LOW
100100
____________
101101

102-
One or more monitors are low on storage space. This health check is raised if
103-
the percentage of available space on the file system used by the monitor
104-
database (normally ``/var/lib/ceph/mon``) drops below the percentage value
102+
One or more Monitors are low on storage space. This health check is raised when
103+
available space on the file system used by the Monitor
104+
database (normally ``/var/lib/ceph/<fsid>/mon.<monid>``) drops below the threshold
105105
``mon_data_avail_warn`` (default: 30%).
106106

107107
This alert might indicate that some other process or user on the system is
108-
filling up the file system used by the monitor. It might also indicate that the
109-
monitor database is too large (see ``MON_DISK_BIG`` below). Another common
108+
filling up the file system used by the Monitor. It might also indicate that the
109+
Monitor database is too large (see ``MON_DISK_BIG`` below). Another common
110110
scenario is that Ceph logging subsystem levels have been raised for
111111
troubleshooting purposes without subsequent return to default levels. Ongoing
112112
verbose logging can easily fill up the files system containing ``/var/log``. If
113113
you trim logs that are currently open, remember to restart or instruct your
114-
syslog or other daemon to re-open the log file.
114+
syslog or other daemon to re-open the log file. Another common dynamic is
115+
that users or processes have written a large amount of data to ``/tmp`` or ``/var/tmp``,
116+
which may be on the same filesystem.
115117

116118
If space cannot be freed, the monitor's data directory might need to be moved
117-
to another storage device or file system (this relocation process must be
118-
carried out while the monitor daemon is not running).
119+
to another storage device or file system. This relocation process must be
120+
carried out while the Monitor daemon is not running.
119121

120122

121123
MON_DISK_CRIT
@@ -136,10 +138,12 @@ raised if the size of the monitor database is larger than
136138
A large database is unusual, but does not necessarily indicate a problem.
137139
Monitor databases might grow in size when there are placement groups that have
138140
not reached an ``active+clean`` state in a long time, or when extensive cluster
139-
recovery, expansion, or topology changes have recently occurred.
141+
recovery, expansion, or topology changes have recently occurred. It is recommended
142+
that when conducting large scale cluster changes that the cluster thus be
143+
left to "rest" for at least a few hours once each week.
140144

141145
This alert may also indicate that the monitor's database is not properly
142-
compacting, an issue that has been observed with some older versions of
146+
compacting, an issue that has been observed with older versions of
143147
RocksDB. Forcing compaction with ``ceph daemon mon.<id> compact`` may suffice
144148
to shrink the database's storage usage.
145149

@@ -909,6 +913,8 @@ potentially replaced.
909913
``log_latency_fn slow operation observed for upper_bound, latency = 6.25955s``
910914
``log_latency slow operation observed for submit_transaction..``
911915

916+
This may also be reflected by the ``BLUESTORE_SLOW_OP_ALERT`` cluster health flag.
917+
912918
As there can be false positive ``slow ops`` instances, a mechanism has
913919
been added for more reliability. If in the last ``bluestore_slow_ops_warn_lifetime``
914920
seconds the number of ``slow ops`` indications are found greater than or equal to
@@ -920,20 +926,20 @@ The defaults for :confval:`bluestore_slow_ops_warn_lifetime` and
920926
:confval:`bluestore_slow_ops_warn_threshold` may be overidden globally or for
921927
specific OSDs.
922928

923-
To change this, run the following command:
929+
To change this, run a command of the following form:
924930

925931
.. prompt:: bash $
926932

927-
ceph config set global bluestore_slow_ops_warn_lifetime 10
933+
ceph config set global bluestore_slow_ops_warn_lifetime 300
928934
ceph config set global bluestore_slow_ops_warn_threshold 5
929935

930936
this may be done for specific OSDs or a given mask, for example:
931937

932938
.. prompt:: bash $
933939

934-
ceph config set osd.123 bluestore_slow_ops_warn_lifetime 10
940+
ceph config set osd.123 bluestore_slow_ops_warn_lifetime 300
935941
ceph config set osd.123 bluestore_slow_ops_warn_threshold 5
936-
ceph config set class:ssd bluestore_slow_ops_warn_lifetime 10
942+
ceph config set class:ssd bluestore_slow_ops_warn_lifetime 300
937943
ceph config set class:ssd bluestore_slow_ops_warn_threshold 5
938944

939945
Device health
@@ -957,7 +963,7 @@ that recovery can proceed from surviving healthly OSDs. This must be
957963
done with extreme care and attention to failure domains so that data availability
958964
is not compromised.
959965

960-
To check device health, run the following command:
966+
To check device health, run a command of the following form:
961967

962968
.. prompt:: bash $
963969

@@ -1036,7 +1042,7 @@ command:
10361042
In most cases, the root cause of this issue is that one or more OSDs are
10371043
currently ``down``: see ``OSD_DOWN`` above.
10381044

1039-
To see the state of a specific problematic PG, run the following command:
1045+
To see the state of a specific problematic PG, run a command of the following form:
10401046

10411047
.. prompt:: bash $
10421048

@@ -1064,7 +1070,7 @@ command:
10641070
In most cases, the root cause of this issue is that one or more OSDs are
10651071
currently "down": see ``OSD_DOWN`` above.
10661072

1067-
To see the state of a specific problematic PG, run the following command:
1073+
To see the state of a specific problematic PG, run a command of the following form:
10681074

10691075
.. prompt:: bash $
10701076

@@ -1145,7 +1151,7 @@ can be caused by RGW-bucket index objects that do not have automatic resharding
11451151
enabled. For more information on resharding, see :ref:`RGW Dynamic Bucket Index
11461152
Resharding <rgw_dynamic_bucket_index_resharding>`.
11471153

1148-
To adjust the thresholds mentioned above, run the following commands:
1154+
To adjust the thresholds mentioned above, run a command of following form:
11491155

11501156
.. prompt:: bash $
11511157

@@ -1161,7 +1167,7 @@ target threshold, write requests to the pool might block while data is flushed
11611167
and evicted from the cache. This state normally leads to very high latencies
11621168
and poor performance.
11631169

1164-
To adjust the cache pool's target size, run the following commands:
1170+
To adjust the cache pool's target size, run a command of the following form:
11651171

11661172
.. prompt:: bash $
11671173

@@ -1190,12 +1196,11 @@ POOL_PG_NUM_NOT_POWER_OF_TWO
11901196
____________________________
11911197

11921198
One or more pools have a ``pg_num`` value that is not a power of two. Although
1193-
this is not strictly incorrect, it does lead to a less balanced distribution of
1194-
data because some Placement Groups will have roughly twice as much data as
1195-
others have.
1199+
this is not fatal, it does lead to a less balanced distribution of
1200+
data because some placement groups will comprise much more data than others.
11961201

11971202
This is easily corrected by setting the ``pg_num`` value for the affected
1198-
pool(s) to a nearby power of two. To do so, run the following command:
1203+
pool(s) to a nearby power of two. Enable the PG Autoscaler or run a command of the following form:
11991204

12001205
.. prompt:: bash $
12011206

@@ -1207,6 +1212,9 @@ To disable this health check, run the following command:
12071212

12081213
ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false
12091214

1215+
Note that disabling this health check is not recommended.
1216+
1217+
12101218
POOL_TOO_FEW_PGS
12111219
________________
12121220

@@ -1224,14 +1232,14 @@ running the following command:
12241232
ceph osd pool set <pool-name> pg_autoscale_mode off
12251233

12261234
To allow the cluster to automatically adjust the number of PGs for the pool,
1227-
run the following command:
1235+
run a command of following form:
12281236

12291237
.. prompt:: bash $
12301238

12311239
ceph osd pool set <pool-name> pg_autoscale_mode on
12321240

12331241
Alternatively, to manually set the number of PGs for the pool to the
1234-
recommended amount, run the following command:
1242+
recommended amount, run a command of the following form:
12351243

12361244
.. prompt:: bash $
12371245

@@ -1256,7 +1264,7 @@ The simplest way to mitigate the problem is to increase the number of OSDs in
12561264
the cluster by adding more hardware. Note that, because the OSD count that is
12571265
used for the purposes of this health check is the number of ``in`` OSDs,
12581266
marking ``out`` OSDs ``in`` (if there are any ``out`` OSDs available) can also
1259-
help. To do so, run the following command:
1267+
help. To do so, run a command of the following form:
12601268

12611269
.. prompt:: bash $
12621270

@@ -1282,14 +1290,14 @@ running the following command:
12821290
ceph osd pool set <pool-name> pg_autoscale_mode off
12831291

12841292
To allow the cluster to automatically adjust the number of PGs for the pool,
1285-
run the following command:
1293+
run a command of the following form:
12861294

12871295
.. prompt:: bash $
12881296

12891297
ceph osd pool set <pool-name> pg_autoscale_mode on
12901298

12911299
Alternatively, to manually set the number of PGs for the pool to the
1292-
recommended amount, run the following command:
1300+
recommended amount, run a command of the following form:
12931301

12941302
.. prompt:: bash $
12951303

@@ -1329,7 +1337,7 @@ in order to estimate the expected size of the pool. Only one of these
13291337
properties should be non-zero. If both are set to a non-zero value, then
13301338
``target_size_ratio`` takes precedence and ``target_size_bytes`` is ignored.
13311339

1332-
To reset ``target_size_bytes`` to zero, run the following command:
1340+
To reset ``target_size_bytes`` to zero, run a command of the following form:
13331341

13341342
.. prompt:: bash $
13351343

@@ -1356,7 +1364,7 @@ out the `split` step when the PG count is adjusted from the data migration that
13561364
is needed when ``pgp_num`` is changed.
13571365

13581366
This issue is normally resolved by setting ``pgp_num`` to match ``pg_num``, so
1359-
as to trigger the data migration, by running the following command:
1367+
as to trigger the data migration, by running a command of the following form:
13601368

13611369
.. prompt:: bash $
13621370

@@ -1387,7 +1395,7 @@ A pool exists but the pool has not been tagged for use by a particular
13871395
application.
13881396

13891397
To resolve this issue, tag the pool for use by an application. For
1390-
example, if the pool is used by RBD, run the following command:
1398+
example, if the pool is used by RBD, run a command of the following form:
13911399

13921400
.. prompt:: bash $
13931401

@@ -1409,15 +1417,15 @@ One or more pools have reached (or are very close to reaching) their quota. The
14091417
threshold to raise this health check is determined by the
14101418
``mon_pool_quota_crit_threshold`` configuration option.
14111419

1412-
Pool quotas can be adjusted up or down (or removed) by running the following
1413-
commands:
1420+
Pool quotas can be adjusted up or down (or removed) by running commands of the the following
1421+
forms:
14141422

14151423
.. prompt:: bash $
14161424

14171425
ceph osd pool set-quota <pool> max_bytes <bytes>
14181426
ceph osd pool set-quota <pool> max_objects <objects>
14191427

1420-
To disable a quota, set the quota value to 0.
1428+
To disable a quota, set the quota value to ``0``.
14211429

14221430
POOL_NEAR_FULL
14231431
______________
@@ -1427,8 +1435,8 @@ One or more pools are approaching a configured fullness threshold.
14271435
One of the several thresholds that can raise this health check is determined by
14281436
the ``mon_pool_quota_warn_threshold`` configuration option.
14291437

1430-
Pool quotas can be adjusted up or down (or removed) by running the following
1431-
commands:
1438+
Pool quotas can be adjusted up or down (or removed) by running commands of the following
1439+
forms:
14321440

14331441
.. prompt:: bash $
14341442

@@ -1463,8 +1471,8 @@ Read or write requests to unfound objects will block.
14631471

14641472
Ideally, a "down" OSD that has a more recent copy of the unfound object can be
14651473
brought back online. To identify candidate OSDs, check the peering state of the
1466-
PG(s) responsible for the unfound object. To see the peering state, run the
1467-
following command:
1474+
PG(s) responsible for the unfound object. To see the peering state, run a command
1475+
of the following form:
14681476

14691477
.. prompt:: bash $
14701478

@@ -1488,13 +1496,13 @@ following command from the daemon's host:
14881496

14891497
ceph daemon osd.<id> ops
14901498

1491-
To see a summary of the slowest recent requests, run the following command:
1499+
To see a summary of the slowest recent requests, run a command of the following form:
14921500

14931501
.. prompt:: bash $
14941502

14951503
ceph daemon osd.<id> dump_historic_ops
14961504

1497-
To see the location of a specific OSD, run the following command:
1505+
To see the location of a specific OSD, run a command of the following form:
14981506

14991507
.. prompt:: bash $
15001508

@@ -1517,7 +1525,7 @@ they are to be cleaned, and not that they have been examined and found to be
15171525
clean). Misplaced or degraded PGs will not be flagged as ``clean`` (see
15181526
*PG_AVAILABILITY* and *PG_DEGRADED* above).
15191527

1520-
To manually initiate a scrub of a clean PG, run the following command:
1528+
To manually initiate a scrub of a clean PG, run a command of the following form:
15211529

15221530
.. prompt: bash $
15231531
@@ -1546,7 +1554,7 @@ the Manager daemon.
15461554
First Method
15471555
~~~~~~~~~~~~
15481556

1549-
To manually initiate a deep scrub of a clean PG, run the following command:
1557+
To manually initiate a deep scrub of a clean PG, run a command of the following form:
15501558

15511559
.. prompt:: bash $
15521560

@@ -1580,7 +1588,7 @@ See `Redmine tracker issue #44959 <https://tracker.ceph.com/issues/44959>`_.
15801588
Second Method
15811589
~~~~~~~~~~~~~
15821590

1583-
To manually initiate a deep scrub of a clean PG, run the following command:
1591+
To manually initiate a deep scrub of a clean PG, run a command of the following form:
15841592

15851593
.. prompt:: bash $
15861594

@@ -1723,14 +1731,14 @@ To list recent crashes, run the following command:
17231731

17241732
ceph crash ls-new
17251733

1726-
To examine information about a specific crash, run the following command:
1734+
To examine information about a specific crash, run a command of the following form:
17271735

17281736
.. prompt:: bash $
17291737

17301738
ceph crash info <crash-id>
17311739

17321740
To silence this alert, you can archive the crash (perhaps after the crash
1733-
has been examined by an administrator) by running the following command:
1741+
has been examined by an administrator) by running a command of the following form:
17341742

17351743
.. prompt:: bash $
17361744

@@ -1772,7 +1780,7 @@ running the following command:
17721780
ceph crash info <crash-id>
17731781

17741782
To silence this alert, you can archive the crash (perhaps after the crash has
1775-
been examined by an administrator) by running the following command:
1783+
been examined by an administrator) by running a command of the following form:
17761784

17771785
.. prompt:: bash $
17781786

@@ -1842,7 +1850,7 @@ were set with an older version of Ceph that did not properly validate the
18421850
syntax of those capabilities, or if (2) the syntax of the capabilities has
18431851
changed.
18441852

1845-
To remove the user(s) in question, run the following command:
1853+
To remove the user(s) in question, run a command of the following form:
18461854

18471855
.. prompt:: bash $
18481856

@@ -1851,8 +1859,8 @@ To remove the user(s) in question, run the following command:
18511859
(This resolves the health check, but it prevents clients from being able to
18521860
authenticate as the removed user.)
18531861

1854-
Alternatively, to update the capabilities for the user(s), run the following
1855-
command:
1862+
Alternatively, to update the capabilities for the user(s), run a command of the following
1863+
form:
18561864

18571865
.. prompt:: bash $
18581866

0 commit comments

Comments
 (0)