Skip to content

Commit 0d81e72

Browse files
committed
common/options: Change HDD OSD shard configuration defaults for mClock
Based on tests performed at scale on a HDD based cluster, it was found that scheduling with mClock was not optimal with multiple OSD shards. For e.g., in the scaled cluster with multiple OSD node failures, the client throughput was found to be inconsistent across test runs coupled with multiple reported slow requests. However, the same test with a single OSD shard and with multiple worker threads yielded significantly better results in terms of consistency of client and recovery throughput across multiple test runs. For more details see https://tracker.ceph.com/issues/66289. Therefore, as an interim measure until the issue with multiple OSD shards (or multiple mClock queues per OSD) is investigated and fixed, the following change to the default HDD OSD shard configuration is made: - osd_op_num_shards_hdd = 1 (was 5) - osd_op_num_threads_per_shard_hdd = 5 (was 1) The other changes in this commit include: - Doc change to the OSD and mClock config reference describing this change. - OSD troubleshooting entry on the procedure to change the shard configuration for clusters affected by this issue running on older releases. - Add release note for this change. Fixes: https://tracker.ceph.com/issues/66289 Signed-off-by: Sridhar Seshasayee <[email protected]> # Conflicts: # doc/rados/troubleshooting/troubleshooting-osd.rst
1 parent b0d8273 commit 0d81e72

File tree

5 files changed

+127
-2
lines changed

5 files changed

+127
-2
lines changed

PendingReleaseNotes

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,19 @@
1212
of the column showing the state of a group snapshot in the unformatted CLI
1313
output is changed from 'STATUS' to 'STATE'. The state of a group snapshot
1414
that was shown as 'ok' is now shown as 'complete', which is more descriptive.
15+
* Based on tests performed at scale on a HDD based Ceph cluster, it was found
16+
that scheduling with mClock was not optimal with multiple OSD shards. For
17+
example, in the test cluster with multiple OSD node failures, the client
18+
throughput was found to be inconsistent across test runs coupled with multiple
19+
reported slow requests. However, the same test with a single OSD shard and
20+
with multiple worker threads yielded significantly better results in terms of
21+
consistency of client and recovery throughput across multiple test runs.
22+
Therefore, as an interim measure until the issue with multiple OSD shards
23+
(or multiple mClock queues per OSD) is investigated and fixed, the following
24+
change to the default HDD OSD shard configuration is made:
25+
- osd_op_num_shards_hdd = 1 (was 5)
26+
- osd_op_num_threads_per_shard_hdd = 5 (was 1)
27+
For more details see https://tracker.ceph.com/issues/66289.
1528

1629
>=19.0.0
1730

doc/rados/configuration/mclock-config-ref.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,60 @@ parameters. This profile should be used with caution and is meant for advanced
164164
users, who understand mclock and Ceph related configuration options.
165165

166166

167+
.. index:: mclock; shard config for HDD clusters
168+
169+
.. _mclock-hdd-cfg:
170+
171+
OSD Shard Configuration For HDD Based Clusters With mClock
172+
==========================================================
173+
Each OSD is configured with one or more shards to perform tasks. Each shard
174+
comprises a unique queue to handle various types of OSD specific operations
175+
like client I/O, recovery, scrub and so on. The scheduling of these operations
176+
in the queue is performed by a scheduler - in this case the mClock scheduler.
177+
178+
For HDD based OSDs, the number of shards is controlled by configuration
179+
:confval:`osd_op_num_shards_hdd`. Items are queued and dequeued by one or
180+
more worker threads and this is controlled by configuration
181+
:confval:`osd_op_num_threads_per_shard_hdd`.
182+
183+
As described in :ref:`dmclock-qos-caveats`, the number of OSD shards employed
184+
determines the impact of mClock queue. In general, a lower number of shards
185+
increases the impact of mClock queues with respect to scheduling accuracy.
186+
This is providing there are enough number of worker threads per shard
187+
to help process the items in the mClock queue.
188+
189+
Based on tests performed at scale with small objects in the range
190+
[1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes,
191+
150 Million objects), it was found that scheduling with mClock was not optimal
192+
with multiple OSD shards. For example, in this cluster with multiple OSD node
193+
failures, the client throughput was found to be inconsistent across test runs
194+
coupled with multiple reported slow requests. For more details
195+
see https://tracker.ceph.com/issues/66289. With multiple shards, the situation
196+
was exacerbated when MAX limit was allocated to both client and background
197+
recovery class of operations. During the OSD failure phase, since both client
198+
and recovery ops were in direct competition to utilize the full bandwidth of
199+
OSDs, there was no predictability with respect to the throughput of either
200+
class of services.
201+
202+
However, the same test with a single OSD shard and with multiple worker threads
203+
yielded significantly better results in terms of consistency of client and
204+
recovery throughput across multiple test runs. Please refer to the tracker
205+
above for more details. For sanity, the same test executed using this shard
206+
configuration with large objects in the range [1 MiB - 256 MiB] yielded similar
207+
results.
208+
209+
Therefore, as an interim measure until the issue with multiple OSD shards
210+
(or multiple mClock queues per OSD) is investigated and fixed, the following
211+
change to the default HDD OSD shard configuration is made:
212+
213+
+---------------------------------------------+------------------+----------------+
214+
| Config Option | Old Default | New Default |
215+
+=============================================+==================+================+
216+
| :confval:`osd_op_num_shards_hdd` | 5 | 1 |
217+
+---------------------------------------------+------------------+----------------+
218+
| :confval:`osd_op_num_threads_per_shard_hdd` | 1 | 5 |
219+
+---------------------------------------------+------------------+----------------+
220+
167221
.. index:: mclock; built-in profiles
168222

169223
mClock Built-in Profiles - Locked Config Options

doc/rados/configuration/osd-config-ref.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,9 @@ Operations
189189
.. confval:: osd_op_num_shards
190190
.. confval:: osd_op_num_shards_hdd
191191
.. confval:: osd_op_num_shards_ssd
192+
.. confval:: osd_op_num_threads_per_shard
193+
.. confval:: osd_op_num_threads_per_shard_hdd
194+
.. confval:: osd_op_num_threads_per_shard_ssd
192195
.. confval:: osd_op_queue
193196
.. confval:: osd_op_queue_cut_off
194197
.. confval:: osd_client_op_priority
@@ -292,6 +295,9 @@ of the current time. The ultimate lesson is that values for weight
292295
should not be too large. They should be under the number of requests
293296
one expects to be serviced each second.
294297

298+
299+
.. _dmclock-qos-caveats:
300+
295301
Caveats
296302
```````
297303

@@ -303,6 +309,11 @@ number of shards can be controlled with the configuration options
303309
:confval:`osd_op_num_shards`, :confval:`osd_op_num_shards_hdd`, and
304310
:confval:`osd_op_num_shards_ssd`. A lower number of shards will increase the
305311
impact of the mClock queues, but may have other deleterious effects.
312+
This is especially the case if there are insufficient shard worker
313+
threads. The number of shard worker threads can be controlled with the
314+
configuration options :confval:`osd_op_num_threads_per_shard`,
315+
:confval:`osd_op_num_threads_per_shard_hdd` and
316+
:confval:`osd_op_num_threads_per_shard_ssd`.
306317

307318
Second, requests are transferred from the operation queue to the
308319
operation sequencer, in which they go through the phases of

doc/rados/troubleshooting/troubleshooting-osd.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -618,6 +618,7 @@ Possible causes include:
618618
- A bug in the kernel file system (check ``dmesg`` output)
619619
- An overloaded cluster (check system load, iostat, etc.)
620620
- A bug in the ``ceph-osd`` daemon.
621+
- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
621622

622623
Possible solutions:
623624

@@ -626,6 +627,8 @@ Possible solutions:
626627
- Upgrade Ceph
627628
- Restart OSDs
628629
- Replace failed or failing components
630+
- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
631+
- See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
629632

630633
Debugging Slow Requests
631634
-----------------------
@@ -680,6 +683,43 @@ Although some of these events may appear redundant, they cross important
680683
boundaries in the internal code (such as passing data across locks into new
681684
threads).
682685

686+
.. _mclock-tblshoot-hdd-shard-config:
687+
688+
Slow Requests or Slow Recovery With mClock Scheduler
689+
----------------------------------------------------
690+
691+
.. note:: This troubleshooting is applicable only for HDD based clusters running
692+
mClock scheduler and with the following OSD shard configuration:
693+
``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
694+
Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
695+
made to the default OSD HDD shard configuration for mClock.
696+
697+
On scaled HDD based clusters with mClock scheduler enabled and under multiple
698+
OSD node failure condition, the following could be reported or observed:
699+
700+
- slow requests: This also manifests into degraded client I/O performance.
701+
- slow background recoveries: Lower than expected recovery throughput.
702+
703+
**Troubleshooting Steps:**
704+
705+
#. Verify from OSD events that the slow requests are predominantly of type
706+
``queued_for_pg``.
707+
#. Verify if the reported recovery rate is significantly lower than the expected
708+
rate considering the QoS allocations for background recovery service.
709+
710+
If either of the above steps are true, then the following resolution may be
711+
applied. Note that this is disruptive as it involves OSD restarts. Run the
712+
following commands to change the default OSD shard configuration for HDDs:
713+
714+
.. prompt:: bash
715+
716+
ceph config set osd osd_op_num_shards_hdd 1
717+
ceph config set osd osd_op_num_threads_per_shard_hdd 5
718+
719+
The above configuration won't take effect immediately and would require a
720+
restart of the OSDs in the environment. For this process to be least disruptive,
721+
the OSDs may be restarted in a carefully staggered manner.
722+
683723
.. _rados_tshooting_flapping_osd:
684724

685725
Flapping OSDs

src/common/options/osd.yaml.in

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -834,14 +834,19 @@ options:
834834
- name: osd_op_num_threads_per_shard
835835
type: int
836836
level: advanced
837+
fmt_desc: The number of worker threads spawned per OSD shard for a given OSD.
838+
Each worker thread when operational processes items in the shard queue.
839+
This setting overrides _ssd and _hdd if non-zero.
837840
default: 0
838841
flags:
839842
- startup
840843
with_legacy: true
841844
- name: osd_op_num_threads_per_shard_hdd
842845
type: int
843846
level: advanced
844-
default: 1
847+
fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
848+
(for rotational media).
849+
default: 5
845850
see_also:
846851
- osd_op_num_threads_per_shard
847852
flags:
@@ -850,6 +855,8 @@ options:
850855
- name: osd_op_num_threads_per_shard_ssd
851856
type: int
852857
level: advanced
858+
fmt_desc: The number of worker threads spawned per OSD shard for a given OSD
859+
(for solid state media).
853860
default: 2
854861
see_also:
855862
- osd_op_num_threads_per_shard
@@ -870,7 +877,7 @@ options:
870877
type: int
871878
level: advanced
872879
fmt_desc: the number of shards allocated for a given OSD (for rotational media).
873-
default: 5
880+
default: 1
874881
see_also:
875882
- osd_op_num_shards
876883
flags:

0 commit comments

Comments
 (0)