Skip to content

Commit 85ab08a

Browse files
committed
doc: detailed explanation of set_choose_tries
- specifically call the *crushtool* output a histogram - include a surface explanation of how PG placement calculation works - more info on `choose_total_tries` - small but complete example for explanatory purposes - that way people can follow along locally and test out things Signed-off-by: benaryorg <[email protected]>
1 parent 365106c commit 85ab08a

File tree

1 file changed

+132
-62
lines changed

1 file changed

+132
-62
lines changed

doc/rados/troubleshooting/troubleshooting-pg.rst

Lines changed: 132 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -736,11 +736,16 @@ cluster and that they operate only on local files:
736736

737737
Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
738738
needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
739-
``ceph osd crush rule dump``. This test will attempt to map one million values
740-
(in this example, the range defined by ``[--min-x,--max-x]``) and must display
741-
at least one bad mapping. If this test outputs nothing, all mappings have been
742-
successful and you can be assured that the problem with your cluster is not
743-
caused by bad mappings.
739+
``ceph osd crush rule dump``. This test will simulate a number of PG placements
740+
based on the CRUSH map. The exact count is based on ``[--min-x,--max-x]``. PG
741+
placements are independent of each other, based only on the hash and bucket
742+
algorithms. Any placement may fail on its own. If this test outputs nothing
743+
then all mappings have been successful, indicating an issue other than CRUSH
744+
mappings. If it does output bad mappings, as shown above, Ceph is unable to
745+
consistently place PGs in the current topology. As long as not all mappings are
746+
considered bad, the CRUSH rule can be configured to search longer for a viable
747+
placement.
748+
744749

745750
Changing the value of set_choose_tries
746751
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -752,19 +757,91 @@ Changing the value of set_choose_tries
752757

753758
crushtool --decompile crush.map > crush.txt
754759

760+
For illustrative purposes a simplified CRUSH map will be used in this
761+
example, simulating a single host with four disks of sizes 3×1TiB and
762+
1×200GiB. The settings below are chosen specifically for this example and
763+
will diverge from the :ref:`CRUSH Map Tunables <crush-map-tunables>`
764+
generally found in production clusters. As defaults may change, please refer
765+
to the correct version of the documentation for your release of Ceph.
766+
767+
_
768+
769+
::
770+
771+
tunable choose_local_tries 0
772+
tunable choose_local_fallback_tries 0
773+
# artificially low total tries, for illustration
774+
tunable choose_total_tries 10
775+
tunable chooseleaf_descend_once 1
776+
tunable chooseleaf_vary_r 1
777+
tunable chooseleaf_stable 1
778+
tunable straw_calc_version 1
779+
tunable allowed_bucket_algs 54
780+
781+
# devices
782+
device 0 osd.0
783+
device 1 osd.1
784+
device 2 osd.2
785+
device 3 osd.3
786+
787+
# types
788+
type 0 osd
789+
type 1 host
790+
type 2 chassis
791+
type 3 rack
792+
type 4 row
793+
type 5 pdu
794+
type 6 pod
795+
type 7 room
796+
type 8 datacenter
797+
type 9 zone
798+
type 10 region
799+
type 11 root
800+
801+
# buckets
802+
host example {
803+
id -2
804+
alg straw2
805+
hash 0 # rjenkins1
806+
item osd.0 weight 1.00000
807+
item osd.1 weight 1.00000
808+
item osd.2 weight 1.00000
809+
item osd.3 weight 0.20000
810+
}
811+
root default {
812+
id -1
813+
alg straw2
814+
hash 0 # rjenkins1
815+
item example weight 3.20000
816+
}
817+
818+
# rules
819+
rule ec {
820+
id 0
821+
type erasure
822+
step set_chooseleaf_tries 5
823+
# artificially low tries, for illustration
824+
step set_choose_tries 5
825+
step take default
826+
step choose indep 0 type osd
827+
step emit
828+
}
829+
755830
#. Add the following line to the rule::
756831

757832
step set_choose_tries 100
758833

759-
The relevant part of the ``crush.txt`` file will resemble this::
834+
If the line does exist already, as in this example, only modify the value.
835+
Ensure that the rule in this ``crush.txt`` does resemble this after the
836+
change::
760837

761-
rule erasurepool {
762-
id 1
838+
rule ec {
839+
id 0
763840
type erasure
764841
step set_chooseleaf_tries 5
765842
step set_choose_tries 100
766843
step take default
767-
step chooseleaf indep 0 type host
844+
step choose indep 0 type osd
768845
step emit
769846
}
770847

@@ -783,59 +860,52 @@ Changing the value of set_choose_tries
783860

784861
crushtool -i better-crush.map --test --show-bad-mappings \
785862
--show-choose-tries \
786-
--rule 1 \
787-
--num-rep 9 \
788-
--min-x 1 --max-x $((1024 * 1024))
789-
...
790-
11: 42
791-
12: 44
792-
13: 54
793-
14: 45
794-
15: 35
795-
16: 34
796-
17: 30
797-
18: 25
798-
19: 19
799-
20: 22
800-
21: 20
801-
22: 17
802-
23: 13
803-
24: 16
804-
25: 13
805-
26: 11
806-
27: 11
807-
28: 13
808-
29: 11
809-
30: 10
810-
31: 6
811-
32: 5
812-
33: 10
813-
34: 3
814-
35: 7
815-
36: 5
816-
37: 2
817-
38: 5
818-
39: 5
819-
40: 2
820-
41: 5
821-
42: 4
822-
43: 1
823-
44: 2
824-
45: 2
825-
46: 3
826-
47: 1
827-
48: 0
828-
...
829-
102: 0
830-
103: 1
831-
104: 0
832-
...
833-
834-
This output indicates that it took eleven tries to map forty-two PGs, twelve
835-
tries to map forty-four PGs etc. The highest number of tries is the minimum
836-
value of ``set_choose_tries`` that prevents bad mappings (for example,
837-
``103`` in the above output, because it did not take more than 103 tries for
838-
any PG to be mapped).
863+
--rule 0 \
864+
--num-rep 3 \
865+
--min-x 1 --max-x 10
866+
::
867+
868+
0: 0
869+
1: 0
870+
2: 4
871+
3: 3
872+
4: 1
873+
5: 1
874+
6: 1
875+
7: 0
876+
8: 0
877+
9: 0
878+
879+
.. note:: The total number of lines displayed equals the ``choose_total_tries``
880+
value of the CRUSH map. However the calculation done by ``crushtool`` will
881+
not be affected by the setting, only the output will be truncated. The
882+
``--set-choose-total-tries`` flag can to be used to modify the value without
883+
modifying the CRUSH map.
884+
885+
The output is a histogram of the tries required for each placement. For
886+
``--min-x 1`` and ``--max-x 10`` this totals to 10 PG placements. All of these
887+
placements have been successful as is evident by the lack of the bad mapping
888+
diagnostic messages. This output indicates that four PGs could be placed within
889+
two tries, while one PG was only placed after four tries. Any failed placement
890+
groups would be counted in the bucket in which it failed, for example in the
891+
original ``crush.txt`` the eighth placement failed after the fifth try and
892+
would have been counted in the fifth bucket together with one other mapping
893+
which succeeded on the fifth try, visible in the histogram of the updated map
894+
showing exactly one entry for five and six tries. As mentioned above, PG
895+
placement is based solely on the CRUSH topology and the hash and bucket
896+
algorithms. Running the original ``crush.txt`` with just ``--x 8`` instead of
897+
the range will fail deterministically. This means that for evaluation of an
898+
appropriate value for production much larger ranges should be used such as the
899+
``1024 * 1024`` from an earlier example.
900+
901+
To find an appropriate value for tries, or to determine whether this is the
902+
underlying issue with placement to begin with, setting a very high value such
903+
as ``500`` and testing with a large sample size (large ``x`` range) can be used
904+
to show the general distribution. From a statistical point of view taking the
905+
last non-zero value as the maximum is very unlikely to cause any failed
906+
placements in practice, however if a lower value is desired then the lower
907+
value can be used at the chance of potentially hitting one of the rare cases in
908+
which placement fails, requiring manual intervention.
839909

840910
.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
841911
.. _Placement Groups: ../../operations/placement-groups

0 commit comments

Comments
 (0)