Skip to content

Commit da4b85c

Browse files
committed
common,osd: Use last valid OSD IOPS value if measured IOPS is unrealistic
The OSD's IOPS capacity is used by the mClock scheduler to determine the quantum of bandwidth allocation for the various operations on the OSD. Prior to this commit, maybe_override_max_osd_capacity_for_qos() only checked if the measured IOPS capacity exceeded the higher threshold defined by 'osd_mclock_iops_capacity_threshold_[hdd|ssd]' and if so fallback to the last valid or the default IOPS capacity as defined by osd_mclock_max_capacity_iops_[hdd|ssd]. It's quite possible that the reported IOPS is unrealistically low. This could be due to transient factors on the underlying device or it could indicate bad health of the device. Either way, the safer option would be to fallback to the last valid or the default IOPS setting for that OSD in order to avoid cluster performance (slow or stalled ops) issues down the line. Therefore, to handle this case, the commit introduces additional config options viz., - osd_mclock_iops_capacity_low_threshold_hdd - set to 50 IOPS and - osd_mclock_iops_capacity_low_threshold_ssd - set to 1000 IOPS If the measured IOPS capacity doesn't fall within the low and high threshold range, the default or the last valid IOPS capacity is used. The existing cluster log warning is suitably modified to convey the reason. Additionally, for a couple of valgrind related teuthology tests, the cluster warning is added to the ignorelist since the reported IOPS can be very low due to slowness. Fixes: https://tracker.ceph.com/issues/67421 Signed-off-by: Sridhar Seshasayee <[email protected]>
1 parent b1d5705 commit da4b85c

File tree

5 files changed

+68
-16
lines changed

5 files changed

+68
-16
lines changed

doc/rados/configuration/mclock-config-ref.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -748,6 +748,8 @@ mClock Config Options
748748
.. confval:: osd_mclock_skip_benchmark
749749
.. confval:: osd_mclock_override_recovery_settings
750750
.. confval:: osd_mclock_iops_capacity_threshold_hdd
751+
.. confval:: osd_mclock_iops_capacity_low_threshold_hdd
751752
.. confval:: osd_mclock_iops_capacity_threshold_ssd
753+
.. confval:: osd_mclock_iops_capacity_low_threshold_ssd
752754

753755
.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf

qa/suites/rados/valgrind-leaks/1-start.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ overrides:
1212
- overall HEALTH_
1313
- \(PG_
1414
- \(POOL_APP_NOT_ENABLED\)
15+
- OSD bench result
1516
conf:
1617
global:
1718
osd heartbeat grace: 40

qa/suites/rados/verify/validater/valgrind.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ overrides:
2626
- \(MON_DOWN\)
2727
- \(SLOW_OPS\)
2828
- slow request
29+
- OSD bench result
2930
valgrind:
3031
mon: [--tool=memcheck, --leak-check=full, --show-reachable=yes]
3132
osd: [--tool=memcheck]

src/common/options/osd.yaml.in

Lines changed: 50 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1293,25 +1293,67 @@ options:
12931293
level: basic
12941294
desc: The threshold IOPs capacity (at 4KiB block size) beyond which to ignore
12951295
the OSD bench results for an OSD (for rotational media)
1296-
long_desc: This option specifies the threshold IOPS capacity for an OSD under
1297-
which the OSD bench results can be considered for QoS calculations. Only
1298-
considered for osd_op_queue = mclock_scheduler
1296+
long_desc: This option specifies the high threshold IOPS capacity for an OSD
1297+
below which the OSD bench results can be considered for QoS calculations.
1298+
Only considered when osd_op_queue = mclock_scheduler
12991299
fmt_desc: The threshold IOPS capacity (at 4KiB block size) beyond which to
1300-
ignore OSD bench results for an OSD (for rotational media)
1300+
ignore OSD bench results for an OSD (for rotational media) and fall back to
1301+
the last valid or default IOPS capacity defined by
1302+
``osd_mclock_max_capacity_iops_hdd``.
13011303
default: 500
1304+
see_also:
1305+
- osd_mclock_max_capacity_iops_hdd
1306+
flags:
1307+
- runtime
1308+
- name: osd_mclock_iops_capacity_low_threshold_hdd
1309+
type: float
1310+
level: basic
1311+
desc: The threshold IOPs capacity (at 4KiB block size) below which to ignore
1312+
the OSD bench results for an OSD (for rotational media)
1313+
long_desc: This option specifies the low threshold IOPS capacity of an OSD
1314+
above which the OSD bench results can be considered for QoS calculations.
1315+
Only considered when osd_op_queue = mclock_scheduler
1316+
fmt_desc: The threshold IOPS capacity (at 4KiB block size) below which to
1317+
ignore OSD bench results for an OSD (for rotational media) and fall back to
1318+
the last valid or default IOPS capacity defined by
1319+
``osd_mclock_max_capacity_iops_hdd``.
1320+
default: 50
1321+
see_also:
1322+
- osd_mclock_max_capacity_iops_hdd
13021323
flags:
13031324
- runtime
13041325
- name: osd_mclock_iops_capacity_threshold_ssd
13051326
type: float
13061327
level: basic
13071328
desc: The threshold IOPs capacity (at 4KiB block size) beyond which to ignore
13081329
the OSD bench results for an OSD (for solid state media)
1309-
long_desc: This option specifies the threshold IOPS capacity for an OSD under
1310-
which the OSD bench results can be considered for QoS calculations. Only
1311-
considered for osd_op_queue = mclock_scheduler
1330+
long_desc: This option specifies the high threshold IOPS capacity for an OSD
1331+
below which the OSD bench results can be considered for QoS calculations.
1332+
Only considered when osd_op_queue = mclock_scheduler
13121333
fmt_desc: The threshold IOPS capacity (at 4KiB block size) beyond which to
1313-
ignore OSD bench results for an OSD (for solid state media)
1334+
ignore OSD bench results for an OSD (for solid state media) and fall back to
1335+
the last valid or default IOPS capacity defined by
1336+
``osd_mclock_max_capacity_iops_ssd``.
13141337
default: 80000
1338+
see_also:
1339+
- osd_mclock_max_capacity_iops_ssd
1340+
flags:
1341+
- runtime
1342+
- name: osd_mclock_iops_capacity_low_threshold_ssd
1343+
type: float
1344+
level: basic
1345+
desc: The threshold IOPs capacity (at 4KiB block size) below which to ignore
1346+
the OSD bench results for an OSD (for solid state media)
1347+
long_desc: This option specifies the low threshold IOPS capacity for an OSD
1348+
above which the OSD bench results can be considered for QoS calculations.
1349+
Only considered when osd_op_queue = mclock_scheduler
1350+
fmt_desc: The threshold IOPS capacity (at 4KiB block size) below which to
1351+
ignore OSD bench results for an OSD (for solid state media) and fall back to
1352+
the last valid or default IOPS capacity defined by
1353+
``osd_mclock_max_capacity_iops_ssd``.
1354+
default: 1000
1355+
see_also:
1356+
- osd_mclock_max_capacity_iops_ssd
13151357
flags:
13161358
- runtime
13171359
# Set to true for testing. Users should NOT set this.

src/osd/OSD.cc

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10154,22 +10154,28 @@ void OSD::maybe_override_max_osd_capacity_for_qos()
1015410154
<< dendl;
1015510155

1015610156
// Get the threshold IOPS set for the underlying hdd/ssd.
10157-
double threshold_iops = 0.0;
10157+
double hi_threshold_iops = 0.0;
10158+
double lo_threshold_iops = 0.0;
1015810159
if (store_is_rotational) {
10159-
threshold_iops = cct->_conf.get_val<double>(
10160+
hi_threshold_iops = cct->_conf.get_val<double>(
1016010161
"osd_mclock_iops_capacity_threshold_hdd");
10162+
lo_threshold_iops = cct->_conf.get_val<double>(
10163+
"osd_mclock_iops_capacity_low_threshold_hdd");
1016110164
} else {
10162-
threshold_iops = cct->_conf.get_val<double>(
10165+
hi_threshold_iops = cct->_conf.get_val<double>(
1016310166
"osd_mclock_iops_capacity_threshold_ssd");
10167+
lo_threshold_iops = cct->_conf.get_val<double>(
10168+
"osd_mclock_iops_capacity_low_threshold_ssd");
1016410169
}
1016510170

1016610171
// Persist the iops value to the MON store or throw cluster warning
10167-
// if the measured iops exceeds the set threshold. If the iops exceed
10168-
// the threshold, the default value is used.
10169-
if (iops > threshold_iops) {
10172+
// if the measured iops is not in the threshold range. If the iops is
10173+
// not within the threshold range, the current/default value is retained.
10174+
if (iops < lo_threshold_iops || iops > hi_threshold_iops) {
1017010175
clog->warn() << "OSD bench result of " << std::to_string(iops)
10171-
<< " IOPS exceeded the threshold limit of "
10172-
<< std::to_string(threshold_iops) << " IOPS for osd."
10176+
<< " IOPS is not within the threshold limit range of "
10177+
<< std::to_string(lo_threshold_iops) << " IOPS and "
10178+
<< std::to_string(hi_threshold_iops) << " IOPS for osd."
1017310179
<< std::to_string(whoami) << ". IOPS capacity is unchanged"
1017410180
<< " at " << std::to_string(cur_iops) << " IOPS. The"
1017510181
<< " recommendation is to establish the osd's IOPS capacity"

0 commit comments

Comments
 (0)