@@ -27,15 +27,15 @@ Autoscaling placement groups
2727Placement groups (PGs) are an internal implementation detail of how Ceph
2828distributes data. Autoscaling provides a way to manage PGs, and especially to
2929manage the number of PGs present in different pools. When *pg-autoscaling * is
30- enabled, the cluster is allowed to make recommendations or automatic
30+ enabled, the cluster makes recommendations or automatic
3131adjustments with respect to the number of PGs for each pool (``pgp_num ``) in
32- accordance with expected cluster utilization and expected pool utilization.
32+ accordance with observed and expected pool utilization.
3333
3434Each pool has a ``pg_autoscale_mode `` property that can be set to ``off ``,
3535``on ``, or ``warn ``:
3636
3737* ``off ``: Disable autoscaling for this pool. It is up to the administrator to
38- choose an appropriate ``pgp_num `` for each pool. For more information, see
38+ choose an appropriate ``pg_num `` for each pool. For more information, see
3939 :ref: `choosing-number-of-placement-groups `.
4040* ``on ``: Enable automated adjustments of the PG count for the given pool.
4141* ``warn ``: Raise health checks when the PG count is in need of adjustment.
@@ -53,7 +53,8 @@ For example, to enable autoscaling on pool ``foo``, run the following command:
5353
5454 ceph osd pool set foo pg_autoscale_mode on
5555
56- There is also a ``pg_autoscale_mode `` setting for any pools that are created
56+ There is also a central config ``pg_autoscale_mode `` option that controls the
57+ autoscale mode for pools that are created
5758after the initial setup of the cluster. To change this setting, run a command
5859of the following form:
5960
@@ -75,7 +76,7 @@ To set the ``noautoscale`` flag to ``off``, run the following command:
7576
7677 ceph osd pool unset noautoscale
7778
78- To get the value of the flag, run the following command:
79+ To get the current value of the flag, run the following command:
7980
8081.. prompt :: bash #
8182
@@ -104,53 +105,57 @@ The output will resemble the following::
104105
105106- **TARGET SIZE ** (if present) is the amount of data that is expected to be
106107 stored in the pool, as specified by the administrator. The system uses the
107- greater of the two values for its calculation .
108+ greater of ** SIZE ** and ** TARGET SIZE ** for its calculations .
108109
109- - **RATE ** is the multiplier for the pool that determines how much raw storage
110- capacity is consumed. For example, a three-replica pool will have a ratio of
111- 3.0, and a ``k=4 m=2 `` erasure-coded pool will have a ratio of 1.5.
110+ - **RATE ** is the space amplification factor for the pool that indicates how much raw storage
111+ capacity is consumed for a given amount of user data . For example, a three-replica pool
112+ will show a value of 3.0, and a ``k=4 m=2 `` erasure-coded pool will have a value of 1.5.
112113
113114- **RAW CAPACITY ** is the total amount of raw storage capacity on the specific
114- OSDs that are responsible for storing the data of the pool (and perhaps the
115- data of other pools).
115+ OSDs available to the pool. Note that in many cases this capacity is shared
116+ among multiple pools.
116117
117118- **RATIO ** is the ratio of (1) the storage consumed by the pool to (2) the
118119 total raw storage capacity. In order words, RATIO is defined as
119- (SIZE * RATE) / RAW CAPACITY.
120+ (SIZE * RATE) / RAW CAPACITY and may be thought of as a fullness percentage .
120121
121122- **TARGET RATIO ** (if present) is the ratio of the expected storage of this
122- pool (that is, the amount of storage that this pool is expected to consume,
123- as specified by the administrator) to the expected storage of all other pools
123+ pool relative to the expected storage of all other pools
124124 that have target ratios set. If both ``target_size_bytes `` and
125125 ``target_size_ratio `` are specified, then ``target_size_ratio `` takes
126- precedence.
126+ precedence. Note that when the BIAS value is other than 1, notably for
127+ CephFS metadata and RGW index pools, the target ratio is best left alone,
128+ as adjusting both can result in inappropriate ``pg_num `` values via double-dipping.
127129
128130- **EFFECTIVE RATIO ** is the result of making two adjustments to the target
129131 ratio:
130132
131133 #. Subtracting any capacity expected to be used by pools that have target
132134 size set.
133135
134- #. Normalizing the target ratios among pools that have target ratio set so
136+ #. Normalizing the target ratios among pools that have a target ratio set so
135137 that collectively they target cluster capacity. For example, four pools
136- with target_ratio 1.0 would have an effective ratio of 0.25.
138+ with target_ratio 1.0 would each have an effective ratio of 0.25.
137139
138140 The system's calculations use whichever of these two ratios (that is, the
139141 target ratio and the effective ratio) is greater.
140142
141143- **BIAS ** is used as a multiplier to manually adjust a pool's PG in accordance
142144 with prior information about how many PGs a specific pool is expected to
143- have.
145+ have. This is important for pools that primarily store data in omaps vs
146+ RADOS objects, notably RGW index and CephFS / RBD EC metadata pools. When
147+ a bias other than 1.0 is set for a pool, it is advised to not set
148+ a target ratio.
144149
145150- **PG_NUM ** is either the current number of PGs associated with the pool or,
146- if a ``pg_num `` change is in progress, the current number of PGs that the
147- pool is working towards.
151+ if a ``pg_num `` change is in progress, the target value.
148152
149153- **NEW PG_NUM ** (if present) is the value that the system recommends that the
150154 ``pg_num `` of the pool should be. It is always a power of two, and it
151155 is present only if the recommended value varies from the current value by
152- more than the default factor of ``3 ``. To adjust this multiple (in the
153- following example, it is changed to ``2 ``), run the following command:
156+ more than the default factor of ``3 ``.
157+ To adjust this multiple (in the following example, it is changed
158+ to ``2 ``), run a command of the following form:
154159
155160 .. prompt :: bash #
156161
@@ -161,7 +166,7 @@ The output will resemble the following::
161166
162167- **BULK ** determines whether the pool is ``bulk ``. It has a value of ``True ``
163168 or ``False ``. A ``bulk `` pool is expected to be large and should initially
164- have a large number of PGs so that performance does not suffer] . On the other
169+ have a large number of PGs so that performance does not suffer. On the other
165170 hand, a pool that is not ``bulk `` is expected to be small (for example, a
166171 ``.mgr `` pool or a meta pool).
167172
@@ -185,16 +190,16 @@ The output will resemble the following::
185190 ceph osd pool set .mgr crush_rule replicated-ssd
186191
187192 This intervention will result in a small amount of backfill, but
188- typically this traffic completes quickly.
193+ typically this is not disruptive and completes quickly.
189194
190195
191196Automated scaling
192197-----------------
193198
194199In the simplest approach to automated scaling, the cluster is allowed to
195- automatically scale `` pgp_num `` in accordance with usage. Ceph considers the
196- total available storage and the target number of PGs for the whole system ,
197- considers how much data is stored in each pool, and apportions PGs accordingly.
200+ automatically scale each pool's `` pg_num `` in accordance with usage. Ceph considers the
201+ total available storage, the target number of PG replicas for each OSD ,
202+ and how much data is stored in each pool, then apportions PGs accordingly.
198203The system is conservative with its approach, making changes to a pool only
199204when the current number of PGs (``pg_num ``) varies by more than a factor of 3
200205from the recommended number.
@@ -207,12 +212,16 @@ command:
207212
208213 ceph config set global mon_target_pg_per_osd 100
209214
215+ For all but the very smallest deployments a value of 200 is recommended.
216+ A value above 500 may result in excessive peering traffic and RAM usage.
217+
210218The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each
211219pool might map to a different CRUSH rule, and each rule might distribute data
212- across different devices, Ceph will consider the utilization of each subtree of
213- the hierarchy independently. For example, a pool that maps to OSDs of class
214- ``ssd `` and a pool that maps to OSDs of class ``hdd `` will each have optimal PG
215- counts that are determined by how many of these two different device types
220+ across different and possibly overlapping sets of devices,
221+ Ceph will consider the utilization of each subtree of
222+ the CRUSH hierarchy independently. For example, a pool that maps to OSDs of class
223+ ``ssd `` and a pool that maps to OSDs of class ``hdd `` will each have calculated PG
224+ counts that are determined by how many OSDs of these two different device types
216225there are.
217226
218227If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
@@ -221,8 +230,8 @@ user in the manager log. The warning states the name of the pool and the set of
221230roots that overlap each other. The autoscaler does not scale any pools with
222231overlapping roots because this condition can cause problems with the scaling
223232process. We recommend constraining each pool so that it belongs to only one
224- root (that is, one OSD class) to silence the warning and ensure a successful
225- scaling process .
233+ root (that is, one device OSD class) to silence the warning and ensure successful
234+ scaling.
226235
227236.. _managing_bulk_flagged_pools :
228237
@@ -233,7 +242,8 @@ If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
233242complement of PGs and then scales down the number of PGs only if the usage
234243ratio across the pool is uneven. However, if a pool is not flagged ``bulk ``,
235244then the autoscaler starts the pool with minimal PGs and creates additional PGs
236- only if there is more usage in the pool.
245+ only if there is more usage in the pool. This flag should be used with care,
246+ as it may not have the results one would think.
237247
238248To create a pool that will be flagged ``bulk ``, run the following command:
239249
@@ -265,7 +275,8 @@ a small number of PGs. However, in some cases, cluster administrators know
265275which pools are likely to consume most of the system capacity in the long run.
266276When Ceph is provided with this information, a more appropriate number of PGs
267277can be used from the beginning, obviating subsequent changes in ``pg_num `` and
268- the associated overhead cost of relocating data.
278+ the associated overhead cost of relocating data. This also helps with performance
279+ and data uniformity by ensuring that PGs are placed on all available OSDs.
269280
270281The *target size * of a pool can be specified in two ways: either in relation to
271282the absolute size (in bytes) of the pool, or as a weight relative to all other
@@ -305,17 +316,22 @@ pool, then the latter will be ignored, the former will be used in system
305316calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO ``)
306317will be raised.
307318
319+ Note that in most cases it is advised to not set both a bias value other than 1.0
320+ and a target ratio on the same pool. Use a higher bias value for metadata /
321+ omap-rich pools and a target ratio for RADOS data-heavy pools.
322+
323+
308324Specifying bounds on a pool's PGs
309325---------------------------------
310326
311327It is possible to specify both the minimum number and the maximum number of PGs
312328for a pool.
313329
314- Setting a Minimum Number of PGs and a Maximum Number of PGs
315- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
330+ Setting a Minimum Number of PGs or a Maximum Number of PGs
331+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
316332
317- If a minimum is set, then Ceph will not itself reduce (nor recommend that you
318- reduce) the number of PGs to a value below the configured value. Setting a
333+ If a minimum is set on a pool , then Ceph will not itself reduce (nor recommend that you
334+ reduce) the `` pg_num `` for that pool to a value below the configured value. Setting a
319335minimum serves to establish a lower bound on the amount of parallelism enjoyed
320336by a client during I/O, even if a pool is mostly empty.
321337
@@ -365,16 +381,18 @@ running a command of the following form:
365381
366382 ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
367383
368- Without the balancer, the suggested target is approximately 100 PG replicas on
369- each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
370- reasonable.
384+ Without the balancer, the suggested (but not default) target for most clusters is
385+ 200 PG replicas on each OSD. With the balancer on and default values, an initial
386+ result of roughly 50-70 PG replicas on each OSD is expected. This is the value
387+ reported under the ``PGS `` column in the output of ``ceph df `` and is notably
388+ not the cluster's total number of PGs divided by the number of OSDs.
371389
372390The autoscaler attempts to satisfy the following conditions:
373391
374- - the number of PGs per OSD should be proportional to the amount of data in the
375- pool
376- - there should be 50-100 PGs per pool, taking into account the replication
377- overhead or erasure-coding fan-out of each PG's replicas across OSDs
392+ - The number of PG replicas per OSD should be proportional to the amount of data in the
393+ pool.
394+ - There should by default 50-100 PGs per pool, taking into account the replication
395+ overhead or erasure-coding fan-out of each PG's replicas across OSDs.
378396
379397Use of Placement Groups
380398=======================
@@ -447,10 +465,14 @@ from the old ones.
447465Factors Relevant To Specifying pg_num
448466=====================================
449467
450- On the one hand, the criteria of data durability and even distribution across
451- OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
452- saving CPU resources and minimizing memory usage weigh in favor of a low number
453- of PGs.
468+ Performance and and even data distribution across
469+ OSDs weigh in favor of a higher number of PGs. Conserving CPU resources and
470+ minimizing memory usage weigh in favor of a lower number of PGs.
471+ The latter was more of a concern before Filestore OSDs were deprecated, so
472+ most modern clusters with BlueStore OSDs can favor the former by
473+ configuring a value of 200-250 for ``mon_target_pg_per_osd `` and
474+ 500 for ``mon_max_pg_per_osd ``. Note that the latter is only a failsafe
475+ and does not itself influence ``pg_num `` calculations.
454476
455477.. _data durability :
456478
@@ -478,12 +500,15 @@ let's imagine a scenario that results in permanent data loss in a single PG:
478500 OSD happened to contain the only remaining copy of an object, the object is
479501 permanently lost.
480502
481- In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
482- will give each PG three OSDs. Ultimately, each OSD hosts :math: `\frac {(512 *
483- 3 )}{10 } = ~150 ` PGs. So when the first OSD fails in the above scenario,
484- recovery will begin for all 150 PGs at the same time.
503+ This is one of the subtle reasons why replicated pools with ``size=2 `` and
504+ EC pools with ``m=1 `` are risky and generally not recommended.
505+
506+ In a cluster containing 10 OSDs and 512 PGs in a three-replica pool, CRUSH
507+ will place each PG on three OSDs. Ultimately, each OSD hosts :math: `\frac {(512 *
508+ 3 )}{10 } = ~150 ` PGs. So when one OSD fails in the above scenario,
509+ recovery will be triggered for all ~150 PGs that were placed on that OSD.
485510
486- The 150 PGs that are being recovered are likely to be homogeneously distributed
511+ The 150 PGs to be recovered are likely to be evenly distributed
487512across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
488513copies of objects to all other OSDs and also likely to receive some new objects
489514to be stored because it has become part of a new PG.
@@ -506,7 +531,8 @@ participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
506531still be required to replicate the same number of objects in order to recover.
507532But instead of there being only 10 OSDs that have to copy ~100 GB each, there
508533are now 20 OSDs that have to copy only 50 GB each. If the network had
509- previously been a bottleneck, recovery now happens twice as fast.
534+ previously been a bottleneck, recovery now happens twice as fast since the
535+ per-OSD limit on the number of parallel recovery operations is larger.
510536
511537Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
512538~38 PGs. And if an OSD dies, recovery will take place faster than before unless
@@ -582,11 +608,11 @@ Memory, CPU and network usage
582608-----------------------------
583609
584610Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
585- MONs . These needs must be met at all times and are increased during recovery.
586- Indeed, one of the main reasons PGs were developed was to share this overhead
587- by clustering objects together .
611+ Monitors . These needs must be met at all times and are increased during recovery.
612+ Indeed, one of the main reasons PGs were developed was to decrease this overhead
613+ by aggregating RADOS objects into a sets of a manageable size .
588614
589- For this reason, minimizing the number of PGs saves significant resources.
615+ For this reason, limiting the number of PGs saves significant resources.
590616
591617.. _choosing-number-of-placement-groups :
592618
@@ -598,7 +624,7 @@ Choosing the Number of PGs
598624 with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
599625 more information, see :ref:`pg-autoscaler`.
600626
601- If you have more than 50 OSDs, we recommend approximately 50- 100 PGs per OSD in
627+ If you have more than 50 OSDs, we recommend approximately 100-250 PG replicas per OSD in
602628order to balance resource usage, data durability, and data distribution. If you
603629have fewer than 50 OSDs, follow the guidance in the `preselection `_ section.
604630For a single pool, use the following formula to get a baseline value:
@@ -648,40 +674,40 @@ Setting the Number of PGs
648674
649675:ref: `Placement Group Link <pgcalc >`
650676
651- Setting the initial number of PGs in a pool must be done at the time you create
652- the pool. See `Create a Pool `_ for details.
677+ Setting the initial number of PGs in a pool is done implicitly or explicitly
678+ at the time a pool is created . See `Create a Pool `_ for details.
653679
654- However, even after a pool is created, if the ``pg_autoscaler `` is not being
680+ However, after a pool is created, if the ``pg_autoscaler `` is not being
655681used to manage ``pg_num `` values, you can change the number of PGs by running a
656682command of the following form:
657683
658684.. prompt :: bash #
659685
660686 ceph osd pool set {pool-name} pg_num {pg_num}
661687
662- Since the Nautilus release, Ceph automatically steps ``pgp_num `` for a pool
688+ Since the Nautilus release, Ceph automatically and incrementally steps ``pgp_num `` for a pool
663689whenever ``pg_num `` is changed, either by the PG autoscaler or manually. Admins
664690generally do not need to touch ``pgp_num `` directly, but can monitor progress
665691with ``watch ceph osd pool ls detail ``. When ``pg_num `` is changed, the value
666692of ``pgp_num `` is stepped slowly so that the cost of splitting or merging PGs
667693is amortized over time to minimize performance impact.
668694
669- Increasing ``pg_num `` splits the PGs in your cluster , but data will not be
670- migrated to the newer PGs until ``pgp_num `` is increased.
695+ Increasing ``pg_num `` for a pool splits some PGs in that pool , but data will not be
696+ migrated to the new PGs via backfill operations until the pool's ``pgp_num `` is increased.
671697
672- It is possible to manually set the ``pgp_num `` parameter. The ``pgp_num ``
698+ It is possible but rarely appropriate to manually set the ``pgp_num `` parameter. The ``pgp_num ``
673699parameter should be equal to the ``pg_num `` parameter. To increase the number
674700of PGs for placement, run a command of the following form:
675701
676702.. prompt :: bash #
677703
678704 ceph osd pool set {pool-name} pgp_num {pgp_num}
679705
680- If you decrease or increase the number of PGs , then ``pgp_num `` is adjusted
681- automatically. In releases of Ceph that are Nautilus and later (inclusive) ,
706+ If you decrease or increase `` pg_num `` for a pool , then ``pgp_num `` is adjusted
707+ automatically. In releases of Ceph beginning with Nautilus,
682708when the ``pg_autoscaler `` is not used, ``pgp_num `` is automatically stepped to
683709match ``pg_num ``. This process manifests as periods of remapping of PGs and of
684- backfill, and is expected behavior and normal .
710+ backfill, which is expected behavior.
685711
686712.. _rados_ops_pgs_get_pg_num :
687713
0 commit comments