Skip to content

Commit 73b80a9

Browse files
committed
Warning added for slow operations and stalled read in BlueStore. User can control how much time the warning should persist after last occurence and maximum number of operations as a threshold will be considered for the warning.
Fixes: https://tracker.ceph.com/issues/62500 Signed-off-by: Md Mahamudur Rahaman Sajib <[email protected]>
1 parent 08f2aaf commit 73b80a9

File tree

10 files changed

+226
-9
lines changed

10 files changed

+226
-9
lines changed

doc/rados/operations/health-checks.rst

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -801,6 +801,98 @@ Or, to disable this alert on a specific OSD, run the following command:
801801

802802
ceph config set osd.123 bluestore_warn_on_spurious_read_errors false
803803

804+
BLOCK_DEVICE_STALLED_READ_ALERT
805+
_______________________________
806+
807+
There are certain BlueStore log messages that surface storage drive issues
808+
that can lead to performance degradation and can cause bad disk.
809+
810+
``read stalled read 0x29f40370000~100000 (buffered) since 63410177.290546s, timeout is 5.000000s``
811+
812+
However, this is difficult to spot as there's no discernible warning (a
813+
health warning or info in ``ceph health detail`` for example). More observations
814+
can be found here: https://tracker.ceph.com/issues/62500
815+
816+
As there can be false positive ``stalled read`` instances, a mechanism
817+
has been added for more reliability. If in last ``bdev_stalled_read_warn_lifetime``
818+
duration ``stalled read`` indications are found more than or equal to
819+
``bdev_stalled_read_warn_threshold`` for a given BlueStore block device, this
820+
warning will be reported in ``ceph health detail``.
821+
822+
By default value of ``bdev_stalled_read_warn_lifetime = 86400s`` and
823+
``bdev_stalled_read_warn_threshold = 1``. But user can configure it for
824+
individual OSDs.
825+
826+
To change this, run the following command:
827+
828+
.. prompt:: bash $
829+
830+
ceph config set global bdev_stalled_read_warn_lifetime 10
831+
ceph config set global bdev_stalled_read_warn_threshold 5
832+
833+
this may be done surgically for individual OSDs or a given mask
834+
835+
.. prompt:: bash $
836+
837+
ceph config set osd.123 bdev_stalled_read_warn_lifetime 10
838+
ceph config set osd.123 bdev_stalled_read_warn_threshold 5
839+
ceph config set class:ssd bdev_stalled_read_warn_lifetime 10
840+
ceph config set class:ssd bdev_stalled_read_warn_threshold 5
841+
842+
WAL_DEVICE_STALLED_READ_ALERT
843+
_____________________________
844+
845+
A similar warning like ``BLOCK_DEVICE_STALLED_READ_ALERT`` will be raised to
846+
identify ``stalled read`` instances on a given BlueStore OSD's ``WAL_DEVICE``.
847+
This warning can be configured via ``bdev_stalled_read_warn_lifetime`` and
848+
``bdev_stalled_read_warn_threshold`` parameters similarly described in the
849+
``BLOCK_DEVICE_STALLED_READ_ALERT`` warning section.
850+
851+
DB_DEVICE_STALLED_READ_ALERT
852+
____________________________
853+
854+
A similar warning like ``BLOCK_DEVICE_STALLED_READ_ALERT`` will be raised to
855+
identify ``stalled read`` instances on a given BlueStore OSD's ``WAL_DEVICE``.
856+
This warning can be configured via ``bdev_stalled_read_warn_lifetime`` and
857+
``bdev_stalled_read_warn_threshold`` parameters similarly described in the
858+
``BLOCK_DEVICE_STALLED_READ_ALERT`` warning section.
859+
860+
BLUESTORE_SLOW_OP_ALERT
861+
_______________________
862+
863+
There are certain BlueStore log messages that surface storage drive issues
864+
that can lead to performance degradation and can cause bad disk.
865+
866+
``log_latency_fn slow operation observed for _txc_committed_kv, latency = 12.028621219s, txc = 0x55a107c30f00``
867+
``log_latency_fn slow operation observed for upper_bound, latency = 6.25955s``
868+
``log_latency slow operation observed for submit_transaction..``
869+
870+
As there can be false positive ``slow ops`` instances, a mechanism has
871+
been added for more reliability. If in last ``bluestore_slow_ops_warn_lifetime``
872+
duration ``slow ops`` indications are found more than or equal to
873+
``bluestore_slow_ops_warn_threshold`` for a given BlueStore OSD, this warning
874+
will be reported in ``ceph health detail``.
875+
876+
By default value of ``bluestore_slow_ops_warn_lifetime = 86400s`` and
877+
``bluestore_slow_ops_warn_threshold = 1``. But user can configure it for
878+
individual OSDs.
879+
880+
To change this, run the following command:
881+
882+
.. prompt:: bash $
883+
884+
ceph config set global bluestore_slow_ops_warn_lifetime 10
885+
ceph config set global bluestore_slow_ops_warn_threshold 5
886+
887+
this may be done surgically for individual OSDs or a given mask
888+
889+
.. prompt:: bash $
890+
891+
ceph config set osd.123 bluestore_slow_ops_warn_lifetime 10
892+
ceph config set osd.123 bluestore_slow_ops_warn_threshold 5
893+
ceph config set class:ssd bluestore_slow_ops_warn_lifetime 10
894+
ceph config set class:ssd bluestore_slow_ops_warn_threshold 5
895+
804896
Device health
805897
-------------
806898

src/blk/BlockDevice.cc

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
#define dout_prefix *_dout << "bdev "
4343

4444
using std::string;
45+
using ceph::mono_clock;
4546

4647

4748
blk_access_mode_t buffermode(bool buffered)
@@ -191,3 +192,39 @@ bool BlockDevice::is_valid_io(uint64_t off, uint64_t len) const {
191192
}
192193
return ret;
193194
}
195+
196+
size_t BlockDevice::trim_stalled_read_event_queue(mono_clock::time_point cur_time) {
197+
std::lock_guard lock(stalled_read_event_queue_lock);
198+
auto warn_duration = std::chrono::seconds(cct->_conf->bdev_stalled_read_warn_lifetime);
199+
while (!stalled_read_event_queue.empty() &&
200+
((stalled_read_event_queue.front() < cur_time - warn_duration) ||
201+
(stalled_read_event_queue.size() > cct->_conf->bdev_stalled_read_warn_threshold))) {
202+
stalled_read_event_queue.pop();
203+
}
204+
return stalled_read_event_queue.size();
205+
}
206+
207+
void BlockDevice::add_stalled_read_event() {
208+
if (!cct->_conf->bdev_stalled_read_warn_threshold) {
209+
return;
210+
}
211+
auto cur_time = mono_clock::now();
212+
{
213+
std::lock_guard lock(stalled_read_event_queue_lock);
214+
stalled_read_event_queue.push(cur_time);
215+
}
216+
trim_stalled_read_event_queue(cur_time);
217+
}
218+
219+
void BlockDevice::collect_alerts(osd_alert_list_t& alerts, const std::string& device_name) {
220+
if (cct->_conf->bdev_stalled_read_warn_threshold) {
221+
size_t qsize = trim_stalled_read_event_queue(mono_clock::now());
222+
if (qsize >= cct->_conf->bdev_stalled_read_warn_threshold) {
223+
std::ostringstream ss;
224+
ss << "observed stalled read indications in "
225+
<< device_name << " device";
226+
alerts.emplace(device_name + "_DEVICE_STALLED_READ_ALERT", ss.str());
227+
}
228+
}
229+
}
230+

src/blk/BlockDevice.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@
2525
#include <set>
2626
#include <string>
2727
#include <vector>
28+
#include <queue>
2829

2930
#include "acconfig.h"
3031
#include "common/ceph_mutex.h"
3132
#include "include/common_fwd.h"
3233
#include "extblkdev/ExtBlkDevInterface.h"
34+
#include "osd/osd_types.h"
3335

3436
#if defined(HAVE_LIBAIO) || defined(HAVE_POSIXAIO)
3537
#include "aio/aio.h"
@@ -148,6 +150,8 @@ class BlockDevice {
148150
public:
149151
CephContext* cct;
150152
typedef void (*aio_callback_t)(void *handle, void *aio);
153+
void collect_alerts(osd_alert_list_t& alerts, const std::string& device_name);
154+
151155
private:
152156
ceph::mutex ioc_reap_lock = ceph::make_mutex("BlockDevice::ioc_reap_lock");
153157
std::vector<IOContext*> ioc_reap_queue;
@@ -164,12 +168,14 @@ class BlockDevice {
164168
pmem,
165169
#endif
166170
};
171+
std::queue <ceph::mono_clock::time_point> stalled_read_event_queue;
172+
ceph::mutex stalled_read_event_queue_lock = ceph::make_mutex("BlockDevice::stalled_read_event_queue_lock");
173+
size_t trim_stalled_read_event_queue(mono_clock::time_point cur_time);
167174
static block_device_t detect_device_type(const std::string& path);
168175
static block_device_t device_type_from_name(const std::string& blk_dev_name);
169176
static BlockDevice *create_with_type(block_device_t device_type,
170177
CephContext* cct, const std::string& path, aio_callback_t cb,
171178
void *cbpriv, aio_callback_t d_cb, void *d_cbpriv);
172-
173179
protected:
174180
uint64_t size = 0;
175181
uint64_t block_size = 0;
@@ -187,6 +193,7 @@ class BlockDevice {
187193
// of the drive. The zones 524-52155 are sequential zones.
188194
uint64_t conventional_region_size = 0;
189195
uint64_t zone_size = 0;
196+
void add_stalled_read_event();
190197

191198
public:
192199
aio_callback_t aio_callback;

src/blk/kernel/KernelDevice.cc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1310,6 +1310,7 @@ int KernelDevice::read(uint64_t off, uint64_t len, bufferlist *pbl,
13101310
<< " since " << start1 << ", timeout is "
13111311
<< age
13121312
<< "s" << dendl;
1313+
add_stalled_read_event();
13131314
}
13141315
if (r < 0) {
13151316
if (ioc->allow_eio && is_expected_ioerr(-errno)) {
@@ -1383,6 +1384,7 @@ int KernelDevice::direct_read_unaligned(uint64_t off, uint64_t len, char *buf)
13831384
<< " since " << start1 << ", timeout is "
13841385
<< age
13851386
<< "s" << dendl;
1387+
add_stalled_read_event();
13861388
}
13871389

13881390
if (r < 0) {
@@ -1446,6 +1448,7 @@ int KernelDevice::read_random(uint64_t off, uint64_t len, char *buf,
14461448
<< " (buffered) since " << start1 << ", timeout is "
14471449
<< age
14481450
<< "s" << dendl;
1451+
add_stalled_read_event();
14491452
}
14501453
} else {
14511454
//direct and aligned read
@@ -1456,6 +1459,7 @@ int KernelDevice::read_random(uint64_t off, uint64_t len, char *buf,
14561459
<< " (direct) since " << start1 << ", timeout is "
14571460
<< age
14581461
<< "s" << dendl;
1462+
add_stalled_read_event();
14591463
}
14601464
if (r < 0) {
14611465
r = -errno;

src/common/options/global.yaml.in

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5365,6 +5365,18 @@ options:
53655365
desc: Enable health indication when spurious read errors are observed by OSD
53665366
default: true
53675367
with_legacy: true
5368+
- name: bluestore_slow_ops_warn_lifetime
5369+
type: uint
5370+
level: advanced
5371+
desc: A configurable duration for slow ops warning to be appeared if number of occurence pass `bluestore_slow_ops_warn_threshold` in `bluestore_slow_ops_warn_lifetime` seconds
5372+
default: 86400
5373+
with_legacy: true
5374+
- name: bluestore_slow_ops_warn_threshold
5375+
type: uint
5376+
level: advanced
5377+
desc: A configurable number for slow ops warning to be appeared if number of occurence pass `bluestore_slow_ops_warn_threshold` in `bluestore_slow_ops_warn_lifetime` seconds
5378+
default: 1
5379+
with_legacy: true
53685380
- name: bluestore_fsck_error_on_no_per_pool_omap
53695381
type: bool
53705382
level: advanced
@@ -6442,6 +6454,18 @@ options:
64426454
- aio
64436455
- spdk
64446456
- pmem
6457+
- name: bdev_stalled_read_warn_lifetime
6458+
type: uint
6459+
level: advanced
6460+
desc: A configurable duration for stalled read warning to be appeared if number of stalled read occurence pass `bdev_stalled_read_warn_threshold` in `bdev_stalled_read_warn_lifetime` seconds
6461+
default: 86400
6462+
with_legacy: true
6463+
- name: bdev_stalled_read_warn_threshold
6464+
type: uint
6465+
level: advanced
6466+
desc: A configurable number for stalled read warning to be appeared if number of stalled read occurence pass `bdev_stalled_read_warn_threshold` in `bdev_stalled_read_warn_lifetime` seconds
6467+
default: 1
6468+
with_legacy: true
64456469
- name: bluestore_cleaner_sleep_interval
64466470
type: float
64476471
level: advanced

src/mon/PGMap.cc

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3238,6 +3238,14 @@ void PGMap::get_health_checks(
32383238
summary += " reporting legacy (not per-pool) BlueStore omap usage stats";
32393239
} else if (asum.first == "BLUESTORE_SPURIOUS_READ_ERRORS") {
32403240
summary += " have spurious read errors";
3241+
} else if (asum.first == "BLUESTORE_SLOW_OP_ALERT") {
3242+
summary += " experiencing slow operations in BlueStore";
3243+
} else if (asum.first == "BLOCK_DEVICE_STALLED_READ_ALERT") {
3244+
summary += " experiencing stalled read in block device of BlueStore";
3245+
} else if (asum.first == "WAL_DEVICE_STALLED_READ_ALERT") {
3246+
summary += " experiencing stalled read in wal device of BlueFS";
3247+
} else if (asum.first == "DB_DEVICE_STALLED_READ_ALERT") {
3248+
summary += " experiencing stalled read in db device of BlueFS";
32413249
}
32423250

32433251
auto& d = checks->add(asum.first, HEALTH_WARN, summary, asum.second.first);

src/os/bluestore/BlueFS.cc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4214,6 +4214,15 @@ bool BlueFS::debug_get_is_dev_dirty(FileWriter *h, uint8_t dev)
42144214
return h->dirty_devs[dev];
42154215
}
42164216

4217+
void BlueFS::collect_alerts(osd_alert_list_t& alerts) {
4218+
if (bdev[BDEV_DB]) {
4219+
bdev[BDEV_DB]->collect_alerts(alerts, "DB");
4220+
}
4221+
if (bdev[BDEV_WAL]) {
4222+
bdev[BDEV_WAL]->collect_alerts(alerts, "WAL");
4223+
}
4224+
}
4225+
42174226
int BlueFS::open_for_read(
42184227
std::string_view dirname,
42194228
std::string_view filename,

src/os/bluestore/BlueFS.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,7 @@ class BlueFS {
231231
WRITER_WAL,
232232
WRITER_SST,
233233
};
234+
void collect_alerts(osd_alert_list_t& alerts);
234235

235236
struct File : public RefCountedObject {
236237
MEMPOOL_CLASS_HELPERS();
@@ -433,7 +434,6 @@ class BlueFS {
433434
FileRef file;
434435
explicit FileLock(FileRef f) : file(std::move(f)) {}
435436
};
436-
437437
private:
438438
PerfCounters *logger = nullptr;
439439

src/os/bluestore/BlueStore.cc

Lines changed: 36 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12185,7 +12185,7 @@ int BlueStore::_do_read(
1218512185
int BlueStore::_verify_csum(OnodeRef& o,
1218612186
const bluestore_blob_t* blob, uint64_t blob_xoffset,
1218712187
const bufferlist& bl,
12188-
uint64_t logical_offset) const
12188+
uint64_t logical_offset)
1218912189
{
1219012190
int bad;
1219112191
uint64_t bad_csum;
@@ -17865,13 +17865,34 @@ int BlueStore::_merge_collection(
1786517865
return r;
1786617866
}
1786717867

17868+
size_t BlueStore::_trim_slow_op_event_queue(mono_clock::time_point cur_time) {
17869+
ceph_assert(ceph_mutex_is_locked(qlock));
17870+
auto warn_duration = std::chrono::seconds(cct->_conf->bluestore_slow_ops_warn_lifetime);
17871+
while (!slow_op_event_queue.empty() &&
17872+
((slow_op_event_queue.front() < cur_time - warn_duration) ||
17873+
(slow_op_event_queue.size() > cct->_conf->bluestore_slow_ops_warn_threshold))) {
17874+
slow_op_event_queue.pop();
17875+
}
17876+
return slow_op_event_queue.size();
17877+
}
17878+
17879+
void BlueStore::_add_slow_op_event() {
17880+
if (!cct->_conf->bluestore_slow_ops_warn_threshold) {
17881+
return;
17882+
}
17883+
std::lock_guard lock(qlock);
17884+
auto cur_time = mono_clock::now();
17885+
slow_op_event_queue.push(cur_time);
17886+
_trim_slow_op_event_queue(cur_time);
17887+
}
17888+
1786817889
void BlueStore::log_latency(
1786917890
const char* name,
1787017891
int idx,
1787117892
const ceph::timespan& l,
1787217893
double lat_threshold,
1787317894
const char* info,
17874-
int idx2) const
17895+
int idx2)
1787517896
{
1787617897
logger->tinc(idx, l);
1787717898
if (lat_threshold > 0.0 &&
@@ -17880,6 +17901,7 @@ void BlueStore::log_latency(
1788017901
<< ", latency = " << l
1788117902
<< info
1788217903
<< dendl;
17904+
_add_slow_op_event();
1788317905
if (idx2 > l_bluestore_first && idx2 < l_bluestore_last) {
1788417906
logger->inc(idx2);
1788517907
}
@@ -17892,7 +17914,7 @@ void BlueStore::log_latency_fn(
1789217914
const ceph::timespan& l,
1789317915
double lat_threshold,
1789417916
std::function<string (const ceph::timespan& lat)> fn,
17895-
int idx2) const
17917+
int idx2)
1789617918
{
1789717919
logger->tinc(idx, l);
1789817920
if (lat_threshold > 0.0 &&
@@ -17901,6 +17923,7 @@ void BlueStore::log_latency_fn(
1790117923
<< ", latency = " << l
1790217924
<< fn(l)
1790317925
<< dendl;
17926+
_add_slow_op_event();
1790417927
if (idx2 > l_bluestore_first && idx2 < l_bluestore_last) {
1790517928
logger->inc(idx2);
1790617929
}
@@ -18299,7 +18322,16 @@ void BlueStore::_log_alerts(osd_alert_list_t& alerts)
1829918322
} else if (!spillover_alert.empty()){
1830018323
spillover_alert.clear();
1830118324
}
18302-
18325+
if (cct->_conf->bluestore_slow_ops_warn_threshold) {
18326+
size_t qsize = _trim_slow_op_event_queue(mono_clock::now());
18327+
if (qsize >= cct->_conf->bluestore_slow_ops_warn_threshold) {
18328+
ostringstream ss;
18329+
ss << "observed slow operation indications in BlueStore";
18330+
alerts.emplace("BLUESTORE_SLOW_OP_ALERT", ss.str());
18331+
}
18332+
}
18333+
bdev->collect_alerts(alerts, "BLOCK");
18334+
bluefs->collect_alerts(alerts);
1830318335
if (!spurious_read_errors_alert.empty() &&
1830418336
cct->_conf->bluestore_warn_on_spurious_read_errors) {
1830518337
alerts.emplace(

0 commit comments

Comments
 (0)