@@ -736,11 +736,16 @@ cluster and that they operate only on local files:
736736
737737Here, ``--num-rep `` is the number of OSDs that the erasure code CRUSH rule
738738needs, ``--rule `` is the value of the ``rule_id `` field that was displayed by
739- ``ceph osd crush rule dump ``. This test will attempt to map one million values
740- (in this example, the range defined by ``[--min-x,--max-x] ``) and must display
741- at least one bad mapping. If this test outputs nothing, all mappings have been
742- successful and you can be assured that the problem with your cluster is not
743- caused by bad mappings.
739+ ``ceph osd crush rule dump ``. This test will simulate a number of PG placements
740+ based on the CRUSH map. The exact count is based on ``[--min-x,--max-x] ``. PG
741+ placements are independent of each other, based only on the hash and bucket
742+ algorithms. Any placement may fail on its own. If this test outputs nothing
743+ then all mappings have been successful, indicating an issue other than CRUSH
744+ mappings. If it does output bad mappings, as shown above, Ceph is unable to
745+ consistently place PGs in the current topology. As long as not all mappings are
746+ considered bad, the CRUSH rule can be configured to search longer for a viable
747+ placement.
748+
744749
745750Changing the value of set_choose_tries
746751~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -752,19 +757,91 @@ Changing the value of set_choose_tries
752757
753758 crushtool --decompile crush.map > crush.txt
754759
760+ For illustrative purposes a simplified CRUSH map will be used in this
761+ example, simulating a single host with four disks of sizes 3×1TiB and
762+ 1×200GiB. The settings below are chosen specifically for this example and
763+ will diverge from the :ref: `CRUSH Map Tunables <crush-map-tunables >`
764+ generally found in production clusters. As defaults may change, please refer
765+ to the correct version of the documentation for your release of Ceph.
766+
767+ _
768+
769+ ::
770+
771+ tunable choose_local_tries 0
772+ tunable choose_local_fallback_tries 0
773+ # artificially low total tries, for illustration
774+ tunable choose_total_tries 10
775+ tunable chooseleaf_descend_once 1
776+ tunable chooseleaf_vary_r 1
777+ tunable chooseleaf_stable 1
778+ tunable straw_calc_version 1
779+ tunable allowed_bucket_algs 54
780+
781+ # devices
782+ device 0 osd.0
783+ device 1 osd.1
784+ device 2 osd.2
785+ device 3 osd.3
786+
787+ # types
788+ type 0 osd
789+ type 1 host
790+ type 2 chassis
791+ type 3 rack
792+ type 4 row
793+ type 5 pdu
794+ type 6 pod
795+ type 7 room
796+ type 8 datacenter
797+ type 9 zone
798+ type 10 region
799+ type 11 root
800+
801+ # buckets
802+ host example {
803+ id -2
804+ alg straw2
805+ hash 0 # rjenkins1
806+ item osd.0 weight 1.00000
807+ item osd.1 weight 1.00000
808+ item osd.2 weight 1.00000
809+ item osd.3 weight 0.20000
810+ }
811+ root default {
812+ id -1
813+ alg straw2
814+ hash 0 # rjenkins1
815+ item example weight 3.20000
816+ }
817+
818+ # rules
819+ rule ec {
820+ id 0
821+ type erasure
822+ step set_chooseleaf_tries 5
823+ # artificially low tries, for illustration
824+ step set_choose_tries 5
825+ step take default
826+ step choose indep 0 type osd
827+ step emit
828+ }
829+
755830#. Add the following line to the rule::
756831
757832 step set_choose_tries 100
758833
759- The relevant part of the ``crush.txt`` file will resemble this::
834+ If the line does exist already, as in this example, only modify the value.
835+ Ensure that the rule in this ``crush.txt`` does resemble this after the
836+ change::
760837
761- rule erasurepool {
762- id 1
838+ rule ec {
839+ id 0
763840 type erasure
764841 step set_chooseleaf_tries 5
765842 step set_choose_tries 100
766843 step take default
767- step chooseleaf indep 0 type host
844+ step choose indep 0 type osd
768845 step emit
769846 }
770847
@@ -783,59 +860,52 @@ Changing the value of set_choose_tries
783860
784861 crushtool -i better-crush.map --test --show-bad-mappings \
785862 --show-choose-tries \
786- --rule 1 \
787- --num-rep 9 \
788- --min-x 1 --max-x $((1024 * 1024))
789- ...
790- 11: 42
791- 12: 44
792- 13: 54
793- 14: 45
794- 15: 35
795- 16: 34
796- 17: 30
797- 18: 25
798- 19: 19
799- 20: 22
800- 21: 20
801- 22: 17
802- 23: 13
803- 24: 16
804- 25: 13
805- 26: 11
806- 27: 11
807- 28: 13
808- 29: 11
809- 30: 10
810- 31: 6
811- 32: 5
812- 33: 10
813- 34: 3
814- 35: 7
815- 36: 5
816- 37: 2
817- 38: 5
818- 39: 5
819- 40: 2
820- 41: 5
821- 42: 4
822- 43: 1
823- 44: 2
824- 45: 2
825- 46: 3
826- 47: 1
827- 48: 0
828- ...
829- 102: 0
830- 103: 1
831- 104: 0
832- ...
833-
834- This output indicates that it took eleven tries to map forty-two PGs, twelve
835- tries to map forty-four PGs etc. The highest number of tries is the minimum
836- value of ``set_choose_tries `` that prevents bad mappings (for example,
837- ``103 `` in the above output, because it did not take more than 103 tries for
838- any PG to be mapped).
863+ --rule 0 \
864+ --num-rep 3 \
865+ --min-x 1 --max-x 10
866+ ::
867+
868+ 0: 0
869+ 1: 0
870+ 2: 4
871+ 3: 3
872+ 4: 1
873+ 5: 1
874+ 6: 1
875+ 7: 0
876+ 8: 0
877+ 9: 0
878+
879+ .. note :: The total number of lines displayed equals the ``choose_total_tries``
880+ value of the CRUSH map. However the calculation done by ``crushtool `` will
881+ not be affected by the setting, only the output will be truncated. The
882+ ``--set-choose-total-tries `` flag can to be used to modify the value without
883+ modifying the CRUSH map.
884+
885+ The output is a histogram of the tries required for each placement. For
886+ ``--min-x 1 `` and ``--max-x 10 `` this totals to 10 PG placements. All of these
887+ placements have been successful as is evident by the lack of the bad mapping
888+ diagnostic messages. This output indicates that four PGs could be placed within
889+ two tries, while one PG was only placed after four tries. Any failed placement
890+ groups would be counted in the bucket in which it failed, for example in the
891+ original ``crush.txt `` the eighth placement failed after the fifth try and
892+ would have been counted in the fifth bucket together with one other mapping
893+ which succeeded on the fifth try, visible in the histogram of the updated map
894+ showing exactly one entry for five and six tries. As mentioned above, PG
895+ placement is based solely on the CRUSH topology and the hash and bucket
896+ algorithms. Running the original ``crush.txt `` with just ``--x 8 `` instead of
897+ the range will fail deterministically. This means that for evaluation of an
898+ appropriate value for production much larger ranges should be used such as the
899+ ``1024 * 1024 `` from an earlier example.
900+
901+ To find an appropriate value for tries, or to determine whether this is the
902+ underlying issue with placement to begin with, setting a very high value such
903+ as ``500 `` and testing with a large sample size (large ``x `` range) can be used
904+ to show the general distribution. From a statistical point of view taking the
905+ last non-zero value as the maximum is very unlikely to cause any failed
906+ placements in practice, however if a lower value is desired then the lower
907+ value can be used at the chance of potentially hitting one of the rare cases in
908+ which placement fails, requiring manual intervention.
839909
840910.. _check : ../../operations/placement-groups#get-the-number-of-placement-groups
841911.. _Placement Groups : ../../operations/placement-groups
0 commit comments