Skip to content

Commit 505a151

Browse files
authored
Merge pull request ceph#63645 from zdover23/wip-doc-2025-06-03-backport-63618-to-tentacle
tentacle: doc/rados/operations: Improve placement-groups.rst Reviewed-by: Anthony D'Atri <[email protected]>
2 parents 4769930 + f7509be commit 505a151

File tree

1 file changed

+97
-71
lines changed

1 file changed

+97
-71
lines changed

doc/rados/operations/placement-groups.rst

Lines changed: 97 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@ Autoscaling placement groups
2727
Placement groups (PGs) are an internal implementation detail of how Ceph
2828
distributes data. Autoscaling provides a way to manage PGs, and especially to
2929
manage the number of PGs present in different pools. When *pg-autoscaling* is
30-
enabled, the cluster is allowed to make recommendations or automatic
30+
enabled, the cluster makes recommendations or automatic
3131
adjustments with respect to the number of PGs for each pool (``pgp_num``) in
32-
accordance with expected cluster utilization and expected pool utilization.
32+
accordance with observed and expected pool utilization.
3333

3434
Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``,
3535
``on``, or ``warn``:
3636

3737
* ``off``: Disable autoscaling for this pool. It is up to the administrator to
38-
choose an appropriate ``pgp_num`` for each pool. For more information, see
38+
choose an appropriate ``pg_num`` for each pool. For more information, see
3939
:ref:`choosing-number-of-placement-groups`.
4040
* ``on``: Enable automated adjustments of the PG count for the given pool.
4141
* ``warn``: Raise health checks when the PG count is in need of adjustment.
@@ -53,7 +53,8 @@ For example, to enable autoscaling on pool ``foo``, run the following command:
5353

5454
ceph osd pool set foo pg_autoscale_mode on
5555

56-
There is also a ``pg_autoscale_mode`` setting for any pools that are created
56+
There is also a central config ``pg_autoscale_mode`` option that controls the
57+
autoscale mode for pools that are created
5758
after the initial setup of the cluster. To change this setting, run a command
5859
of the following form:
5960

@@ -75,7 +76,7 @@ To set the ``noautoscale`` flag to ``off``, run the following command:
7576

7677
ceph osd pool unset noautoscale
7778

78-
To get the value of the flag, run the following command:
79+
To get the current value of the flag, run the following command:
7980

8081
.. prompt:: bash #
8182

@@ -104,53 +105,57 @@ The output will resemble the following::
104105

105106
- **TARGET SIZE** (if present) is the amount of data that is expected to be
106107
stored in the pool, as specified by the administrator. The system uses the
107-
greater of the two values for its calculation.
108+
greater of **SIZE** and **TARGET SIZE** for its calculations.
108109

109-
- **RATE** is the multiplier for the pool that determines how much raw storage
110-
capacity is consumed. For example, a three-replica pool will have a ratio of
111-
3.0, and a ``k=4 m=2`` erasure-coded pool will have a ratio of 1.5.
110+
- **RATE** is the space amplification factor for the pool that indicates how much raw storage
111+
capacity is consumed for a given amount of user data. For example, a three-replica pool
112+
will show a value of 3.0, and a ``k=4 m=2`` erasure-coded pool will have a value of 1.5.
112113

113114
- **RAW CAPACITY** is the total amount of raw storage capacity on the specific
114-
OSDs that are responsible for storing the data of the pool (and perhaps the
115-
data of other pools).
115+
OSDs available to the pool. Note that in many cases this capacity is shared
116+
among multiple pools.
116117

117118
- **RATIO** is the ratio of (1) the storage consumed by the pool to (2) the
118119
total raw storage capacity. In order words, RATIO is defined as
119-
(SIZE * RATE) / RAW CAPACITY.
120+
(SIZE * RATE) / RAW CAPACITY and may be thought of as a fullness percentage.
120121

121122
- **TARGET RATIO** (if present) is the ratio of the expected storage of this
122-
pool (that is, the amount of storage that this pool is expected to consume,
123-
as specified by the administrator) to the expected storage of all other pools
123+
pool relative to the expected storage of all other pools
124124
that have target ratios set. If both ``target_size_bytes`` and
125125
``target_size_ratio`` are specified, then ``target_size_ratio`` takes
126-
precedence.
126+
precedence. Note that when the BIAS value is other than 1, notably for
127+
CephFS metadata and RGW index pools, the target ratio is best left alone,
128+
as adjusting both can result in inappropriate ``pg_num`` values via double-dipping.
127129

128130
- **EFFECTIVE RATIO** is the result of making two adjustments to the target
129131
ratio:
130132

131133
#. Subtracting any capacity expected to be used by pools that have target
132134
size set.
133135

134-
#. Normalizing the target ratios among pools that have target ratio set so
136+
#. Normalizing the target ratios among pools that have a target ratio set so
135137
that collectively they target cluster capacity. For example, four pools
136-
with target_ratio 1.0 would have an effective ratio of 0.25.
138+
with target_ratio 1.0 would each have an effective ratio of 0.25.
137139

138140
The system's calculations use whichever of these two ratios (that is, the
139141
target ratio and the effective ratio) is greater.
140142

141143
- **BIAS** is used as a multiplier to manually adjust a pool's PG in accordance
142144
with prior information about how many PGs a specific pool is expected to
143-
have.
145+
have. This is important for pools that primarily store data in omaps vs
146+
RADOS objects, notably RGW index and CephFS / RBD EC metadata pools. When
147+
a bias other than 1.0 is set for a pool, it is advised to not set
148+
a target ratio.
144149

145150
- **PG_NUM** is either the current number of PGs associated with the pool or,
146-
if a ``pg_num`` change is in progress, the current number of PGs that the
147-
pool is working towards.
151+
if a ``pg_num`` change is in progress, the target value.
148152

149153
- **NEW PG_NUM** (if present) is the value that the system recommends that the
150154
``pg_num`` of the pool should be. It is always a power of two, and it
151155
is present only if the recommended value varies from the current value by
152-
more than the default factor of ``3``. To adjust this multiple (in the
153-
following example, it is changed to ``2``), run the following command:
156+
more than the default factor of ``3``.
157+
To adjust this multiple (in the following example, it is changed
158+
to ``2``), run a command of the following form:
154159

155160
.. prompt:: bash #
156161

@@ -161,7 +166,7 @@ The output will resemble the following::
161166

162167
- **BULK** determines whether the pool is ``bulk``. It has a value of ``True``
163168
or ``False``. A ``bulk`` pool is expected to be large and should initially
164-
have a large number of PGs so that performance does not suffer]. On the other
169+
have a large number of PGs so that performance does not suffer. On the other
165170
hand, a pool that is not ``bulk`` is expected to be small (for example, a
166171
``.mgr`` pool or a meta pool).
167172

@@ -185,16 +190,16 @@ The output will resemble the following::
185190
ceph osd pool set .mgr crush_rule replicated-ssd
186191

187192
This intervention will result in a small amount of backfill, but
188-
typically this traffic completes quickly.
193+
typically this is not disruptive and completes quickly.
189194

190195

191196
Automated scaling
192197
-----------------
193198

194199
In the simplest approach to automated scaling, the cluster is allowed to
195-
automatically scale ``pgp_num`` in accordance with usage. Ceph considers the
196-
total available storage and the target number of PGs for the whole system,
197-
considers how much data is stored in each pool, and apportions PGs accordingly.
200+
automatically scale each pool's ``pg_num`` in accordance with usage. Ceph considers the
201+
total available storage, the target number of PG replicas for each OSD,
202+
and how much data is stored in each pool, then apportions PGs accordingly.
198203
The system is conservative with its approach, making changes to a pool only
199204
when the current number of PGs (``pg_num``) varies by more than a factor of 3
200205
from the recommended number.
@@ -207,12 +212,16 @@ command:
207212

208213
ceph config set global mon_target_pg_per_osd 100
209214

215+
For all but the very smallest deployments a value of 200 is recommended.
216+
A value above 500 may result in excessive peering traffic and RAM usage.
217+
210218
The autoscaler analyzes pools and adjusts on a per-subtree basis. Because each
211219
pool might map to a different CRUSH rule, and each rule might distribute data
212-
across different devices, Ceph will consider the utilization of each subtree of
213-
the hierarchy independently. For example, a pool that maps to OSDs of class
214-
``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have optimal PG
215-
counts that are determined by how many of these two different device types
220+
across different and possibly overlapping sets of devices,
221+
Ceph will consider the utilization of each subtree of
222+
the CRUSH hierarchy independently. For example, a pool that maps to OSDs of class
223+
``ssd`` and a pool that maps to OSDs of class ``hdd`` will each have calculated PG
224+
counts that are determined by how many OSDs of these two different device types
216225
there are.
217226

218227
If a pool uses OSDs under two or more CRUSH roots (for example, shadow trees
@@ -221,8 +230,8 @@ user in the manager log. The warning states the name of the pool and the set of
221230
roots that overlap each other. The autoscaler does not scale any pools with
222231
overlapping roots because this condition can cause problems with the scaling
223232
process. We recommend constraining each pool so that it belongs to only one
224-
root (that is, one OSD class) to silence the warning and ensure a successful
225-
scaling process.
233+
root (that is, one device OSD class) to silence the warning and ensure successful
234+
scaling.
226235

227236
.. _managing_bulk_flagged_pools:
228237

@@ -233,7 +242,8 @@ If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
233242
complement of PGs and then scales down the number of PGs only if the usage
234243
ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
235244
then the autoscaler starts the pool with minimal PGs and creates additional PGs
236-
only if there is more usage in the pool.
245+
only if there is more usage in the pool. This flag should be used with care,
246+
as it may not have the results one would think.
237247

238248
To create a pool that will be flagged ``bulk``, run the following command:
239249

@@ -265,7 +275,8 @@ a small number of PGs. However, in some cases, cluster administrators know
265275
which pools are likely to consume most of the system capacity in the long run.
266276
When Ceph is provided with this information, a more appropriate number of PGs
267277
can be used from the beginning, obviating subsequent changes in ``pg_num`` and
268-
the associated overhead cost of relocating data.
278+
the associated overhead cost of relocating data. This also helps with performance
279+
and data uniformity by ensuring that PGs are placed on all available OSDs.
269280

270281
The *target size* of a pool can be specified in two ways: either in relation to
271282
the absolute size (in bytes) of the pool, or as a weight relative to all other
@@ -305,17 +316,22 @@ pool, then the latter will be ignored, the former will be used in system
305316
calculations, and a health check (``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``)
306317
will be raised.
307318

319+
Note that in most cases it is advised to not set both a bias value other than 1.0
320+
and a target ratio on the same pool. Use a higher bias value for metadata /
321+
omap-rich pools and a target ratio for RADOS data-heavy pools.
322+
323+
308324
Specifying bounds on a pool's PGs
309325
---------------------------------
310326

311327
It is possible to specify both the minimum number and the maximum number of PGs
312328
for a pool.
313329

314-
Setting a Minimum Number of PGs and a Maximum Number of PGs
315-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
330+
Setting a Minimum Number of PGs or a Maximum Number of PGs
331+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
316332

317-
If a minimum is set, then Ceph will not itself reduce (nor recommend that you
318-
reduce) the number of PGs to a value below the configured value. Setting a
333+
If a minimum is set on a pool, then Ceph will not itself reduce (nor recommend that you
334+
reduce) the ``pg_num`` for that pool to a value below the configured value. Setting a
319335
minimum serves to establish a lower bound on the amount of parallelism enjoyed
320336
by a client during I/O, even if a pool is mostly empty.
321337

@@ -365,16 +381,18 @@ running a command of the following form:
365381

366382
ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)
367383

368-
Without the balancer, the suggested target is approximately 100 PG replicas on
369-
each OSD. With the balancer, an initial target of 50 PG replicas on each OSD is
370-
reasonable.
384+
Without the balancer, the suggested (but not default) target for most clusters is
385+
200 PG replicas on each OSD. With the balancer on and default values, an initial
386+
result of roughly 50-70 PG replicas on each OSD is expected. This is the value
387+
reported under the ``PGS`` column in the output of ``ceph df`` and is notably
388+
not the cluster's total number of PGs divided by the number of OSDs.
371389

372390
The autoscaler attempts to satisfy the following conditions:
373391

374-
- the number of PGs per OSD should be proportional to the amount of data in the
375-
pool
376-
- there should be 50-100 PGs per pool, taking into account the replication
377-
overhead or erasure-coding fan-out of each PG's replicas across OSDs
392+
- The number of PG replicas per OSD should be proportional to the amount of data in the
393+
pool.
394+
- There should by default 50-100 PGs per pool, taking into account the replication
395+
overhead or erasure-coding fan-out of each PG's replicas across OSDs.
378396

379397
Use of Placement Groups
380398
=======================
@@ -447,10 +465,14 @@ from the old ones.
447465
Factors Relevant To Specifying pg_num
448466
=====================================
449467

450-
On the one hand, the criteria of data durability and even distribution across
451-
OSDs weigh in favor of a high number of PGs. On the other hand, the criteria of
452-
saving CPU resources and minimizing memory usage weigh in favor of a low number
453-
of PGs.
468+
Performance and and even data distribution across
469+
OSDs weigh in favor of a higher number of PGs. Conserving CPU resources and
470+
minimizing memory usage weigh in favor of a lower number of PGs.
471+
The latter was more of a concern before Filestore OSDs were deprecated, so
472+
most modern clusters with BlueStore OSDs can favor the former by
473+
configuring a value of 200-250 for ``mon_target_pg_per_osd`` and
474+
500 for ``mon_max_pg_per_osd``. Note that the latter is only a failsafe
475+
and does not itself influence ``pg_num`` calculations.
454476

455477
.. _data durability:
456478

@@ -478,12 +500,15 @@ let's imagine a scenario that results in permanent data loss in a single PG:
478500
OSD happened to contain the only remaining copy of an object, the object is
479501
permanently lost.
480502

481-
In a cluster containing 10 OSDs with 512 PGs in a three-replica pool, CRUSH
482-
will give each PG three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 *
483-
3)}{10} = ~150` PGs. So when the first OSD fails in the above scenario,
484-
recovery will begin for all 150 PGs at the same time.
503+
This is one of the subtle reasons why replicated pools with ``size=2`` and
504+
EC pools with ``m=1`` are risky and generally not recommended.
505+
506+
In a cluster containing 10 OSDs and 512 PGs in a three-replica pool, CRUSH
507+
will place each PG on three OSDs. Ultimately, each OSD hosts :math:`\frac{(512 *
508+
3)}{10} = ~150` PGs. So when one OSD fails in the above scenario,
509+
recovery will be triggered for all ~150 PGs that were placed on that OSD.
485510

486-
The 150 PGs that are being recovered are likely to be homogeneously distributed
511+
The 150 PGs to be recovered are likely to be evenly distributed
487512
across the 9 remaining OSDs. Each remaining OSD is therefore likely to send
488513
copies of objects to all other OSDs and also likely to receive some new objects
489514
to be stored because it has become part of a new PG.
@@ -506,7 +531,8 @@ participates in only ~75 PGs rather than ~150 PGs. All 19 remaining OSDs will
506531
still be required to replicate the same number of objects in order to recover.
507532
But instead of there being only 10 OSDs that have to copy ~100 GB each, there
508533
are now 20 OSDs that have to copy only 50 GB each. If the network had
509-
previously been a bottleneck, recovery now happens twice as fast.
534+
previously been a bottleneck, recovery now happens twice as fast since the
535+
per-OSD limit on the number of parallel recovery operations is larger.
510536

511537
Similarly, suppose that our cluster grows to 40 OSDs. Each OSD will host only
512538
~38 PGs. And if an OSD dies, recovery will take place faster than before unless
@@ -582,11 +608,11 @@ Memory, CPU and network usage
582608
-----------------------------
583609

584610
Every PG in the cluster imposes memory, network, and CPU demands upon OSDs and
585-
MONs. These needs must be met at all times and are increased during recovery.
586-
Indeed, one of the main reasons PGs were developed was to share this overhead
587-
by clustering objects together.
611+
Monitors. These needs must be met at all times and are increased during recovery.
612+
Indeed, one of the main reasons PGs were developed was to decrease this overhead
613+
by aggregating RADOS objects into a sets of a manageable size.
588614

589-
For this reason, minimizing the number of PGs saves significant resources.
615+
For this reason, limiting the number of PGs saves significant resources.
590616

591617
.. _choosing-number-of-placement-groups:
592618

@@ -598,7 +624,7 @@ Choosing the Number of PGs
598624
with the ``target_size_bytes`` or ``target_size_ratio`` pool properties. For
599625
more information, see :ref:`pg-autoscaler`.
600626
601-
If you have more than 50 OSDs, we recommend approximately 50-100 PGs per OSD in
627+
If you have more than 50 OSDs, we recommend approximately 100-250 PG replicas per OSD in
602628
order to balance resource usage, data durability, and data distribution. If you
603629
have fewer than 50 OSDs, follow the guidance in the `preselection`_ section.
604630
For a single pool, use the following formula to get a baseline value:
@@ -648,40 +674,40 @@ Setting the Number of PGs
648674

649675
:ref:`Placement Group Link <pgcalc>`
650676

651-
Setting the initial number of PGs in a pool must be done at the time you create
652-
the pool. See `Create a Pool`_ for details.
677+
Setting the initial number of PGs in a pool is done implicitly or explicitly
678+
at the time a pool is created. See `Create a Pool`_ for details.
653679

654-
However, even after a pool is created, if the ``pg_autoscaler`` is not being
680+
However, after a pool is created, if the ``pg_autoscaler`` is not being
655681
used to manage ``pg_num`` values, you can change the number of PGs by running a
656682
command of the following form:
657683

658684
.. prompt:: bash #
659685

660686
ceph osd pool set {pool-name} pg_num {pg_num}
661687

662-
Since the Nautilus release, Ceph automatically steps ``pgp_num`` for a pool
688+
Since the Nautilus release, Ceph automatically and incrementally steps ``pgp_num`` for a pool
663689
whenever ``pg_num`` is changed, either by the PG autoscaler or manually. Admins
664690
generally do not need to touch ``pgp_num`` directly, but can monitor progress
665691
with ``watch ceph osd pool ls detail``. When ``pg_num`` is changed, the value
666692
of ``pgp_num`` is stepped slowly so that the cost of splitting or merging PGs
667693
is amortized over time to minimize performance impact.
668694

669-
Increasing ``pg_num`` splits the PGs in your cluster, but data will not be
670-
migrated to the newer PGs until ``pgp_num`` is increased.
695+
Increasing ``pg_num`` for a pool splits some PGs in that pool, but data will not be
696+
migrated to the new PGs via backfill operations until the pool's ``pgp_num`` is increased.
671697

672-
It is possible to manually set the ``pgp_num`` parameter. The ``pgp_num``
698+
It is possible but rarely appropriate to manually set the ``pgp_num`` parameter. The ``pgp_num``
673699
parameter should be equal to the ``pg_num`` parameter. To increase the number
674700
of PGs for placement, run a command of the following form:
675701

676702
.. prompt:: bash #
677703

678704
ceph osd pool set {pool-name} pgp_num {pgp_num}
679705

680-
If you decrease or increase the number of PGs, then ``pgp_num`` is adjusted
681-
automatically. In releases of Ceph that are Nautilus and later (inclusive),
706+
If you decrease or increase ``pg_num`` for a pool, then ``pgp_num`` is adjusted
707+
automatically. In releases of Ceph beginning with Nautilus,
682708
when the ``pg_autoscaler`` is not used, ``pgp_num`` is automatically stepped to
683709
match ``pg_num``. This process manifests as periods of remapping of PGs and of
684-
backfill, and is expected behavior and normal.
710+
backfill, which is expected behavior.
685711

686712
.. _rados_ops_pgs_get_pg_num:
687713

0 commit comments

Comments
 (0)