Skip to content

Commit 103cd8e

Browse files
authored
Merge pull request ceph#57722 from sajibreadd/wip-62500
os/bluestore: Warning added for slow operations and stalled read
2 parents 3e4201c + 73b80a9 commit 103cd8e

File tree

10 files changed

+226
-9
lines changed

10 files changed

+226
-9
lines changed

doc/rados/operations/health-checks.rst

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -801,6 +801,98 @@ Or, to disable this alert on a specific OSD, run the following command:
801801

802802
ceph config set osd.123 bluestore_warn_on_spurious_read_errors false
803803

804+
BLOCK_DEVICE_STALLED_READ_ALERT
805+
_______________________________
806+
807+
There are certain BlueStore log messages that surface storage drive issues
808+
that can lead to performance degradation and can cause bad disk.
809+
810+
``read stalled read 0x29f40370000~100000 (buffered) since 63410177.290546s, timeout is 5.000000s``
811+
812+
However, this is difficult to spot as there's no discernible warning (a
813+
health warning or info in ``ceph health detail`` for example). More observations
814+
can be found here: https://tracker.ceph.com/issues/62500
815+
816+
As there can be false positive ``stalled read`` instances, a mechanism
817+
has been added for more reliability. If in last ``bdev_stalled_read_warn_lifetime``
818+
duration ``stalled read`` indications are found more than or equal to
819+
``bdev_stalled_read_warn_threshold`` for a given BlueStore block device, this
820+
warning will be reported in ``ceph health detail``.
821+
822+
By default value of ``bdev_stalled_read_warn_lifetime = 86400s`` and
823+
``bdev_stalled_read_warn_threshold = 1``. But user can configure it for
824+
individual OSDs.
825+
826+
To change this, run the following command:
827+
828+
.. prompt:: bash $
829+
830+
ceph config set global bdev_stalled_read_warn_lifetime 10
831+
ceph config set global bdev_stalled_read_warn_threshold 5
832+
833+
this may be done surgically for individual OSDs or a given mask
834+
835+
.. prompt:: bash $
836+
837+
ceph config set osd.123 bdev_stalled_read_warn_lifetime 10
838+
ceph config set osd.123 bdev_stalled_read_warn_threshold 5
839+
ceph config set class:ssd bdev_stalled_read_warn_lifetime 10
840+
ceph config set class:ssd bdev_stalled_read_warn_threshold 5
841+
842+
WAL_DEVICE_STALLED_READ_ALERT
843+
_____________________________
844+
845+
A similar warning like ``BLOCK_DEVICE_STALLED_READ_ALERT`` will be raised to
846+
identify ``stalled read`` instances on a given BlueStore OSD's ``WAL_DEVICE``.
847+
This warning can be configured via ``bdev_stalled_read_warn_lifetime`` and
848+
``bdev_stalled_read_warn_threshold`` parameters similarly described in the
849+
``BLOCK_DEVICE_STALLED_READ_ALERT`` warning section.
850+
851+
DB_DEVICE_STALLED_READ_ALERT
852+
____________________________
853+
854+
A similar warning like ``BLOCK_DEVICE_STALLED_READ_ALERT`` will be raised to
855+
identify ``stalled read`` instances on a given BlueStore OSD's ``WAL_DEVICE``.
856+
This warning can be configured via ``bdev_stalled_read_warn_lifetime`` and
857+
``bdev_stalled_read_warn_threshold`` parameters similarly described in the
858+
``BLOCK_DEVICE_STALLED_READ_ALERT`` warning section.
859+
860+
BLUESTORE_SLOW_OP_ALERT
861+
_______________________
862+
863+
There are certain BlueStore log messages that surface storage drive issues
864+
that can lead to performance degradation and can cause bad disk.
865+
866+
``log_latency_fn slow operation observed for _txc_committed_kv, latency = 12.028621219s, txc = 0x55a107c30f00``
867+
``log_latency_fn slow operation observed for upper_bound, latency = 6.25955s``
868+
``log_latency slow operation observed for submit_transaction..``
869+
870+
As there can be false positive ``slow ops`` instances, a mechanism has
871+
been added for more reliability. If in last ``bluestore_slow_ops_warn_lifetime``
872+
duration ``slow ops`` indications are found more than or equal to
873+
``bluestore_slow_ops_warn_threshold`` for a given BlueStore OSD, this warning
874+
will be reported in ``ceph health detail``.
875+
876+
By default value of ``bluestore_slow_ops_warn_lifetime = 86400s`` and
877+
``bluestore_slow_ops_warn_threshold = 1``. But user can configure it for
878+
individual OSDs.
879+
880+
To change this, run the following command:
881+
882+
.. prompt:: bash $
883+
884+
ceph config set global bluestore_slow_ops_warn_lifetime 10
885+
ceph config set global bluestore_slow_ops_warn_threshold 5
886+
887+
this may be done surgically for individual OSDs or a given mask
888+
889+
.. prompt:: bash $
890+
891+
ceph config set osd.123 bluestore_slow_ops_warn_lifetime 10
892+
ceph config set osd.123 bluestore_slow_ops_warn_threshold 5
893+
ceph config set class:ssd bluestore_slow_ops_warn_lifetime 10
894+
ceph config set class:ssd bluestore_slow_ops_warn_threshold 5
895+
804896
Device health
805897
-------------
806898

src/blk/BlockDevice.cc

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
#define dout_prefix *_dout << "bdev "
4343

4444
using std::string;
45+
using ceph::mono_clock;
4546

4647

4748
blk_access_mode_t buffermode(bool buffered)
@@ -191,3 +192,39 @@ bool BlockDevice::is_valid_io(uint64_t off, uint64_t len) const {
191192
}
192193
return ret;
193194
}
195+
196+
size_t BlockDevice::trim_stalled_read_event_queue(mono_clock::time_point cur_time) {
197+
std::lock_guard lock(stalled_read_event_queue_lock);
198+
auto warn_duration = std::chrono::seconds(cct->_conf->bdev_stalled_read_warn_lifetime);
199+
while (!stalled_read_event_queue.empty() &&
200+
((stalled_read_event_queue.front() < cur_time - warn_duration) ||
201+
(stalled_read_event_queue.size() > cct->_conf->bdev_stalled_read_warn_threshold))) {
202+
stalled_read_event_queue.pop();
203+
}
204+
return stalled_read_event_queue.size();
205+
}
206+
207+
void BlockDevice::add_stalled_read_event() {
208+
if (!cct->_conf->bdev_stalled_read_warn_threshold) {
209+
return;
210+
}
211+
auto cur_time = mono_clock::now();
212+
{
213+
std::lock_guard lock(stalled_read_event_queue_lock);
214+
stalled_read_event_queue.push(cur_time);
215+
}
216+
trim_stalled_read_event_queue(cur_time);
217+
}
218+
219+
void BlockDevice::collect_alerts(osd_alert_list_t& alerts, const std::string& device_name) {
220+
if (cct->_conf->bdev_stalled_read_warn_threshold) {
221+
size_t qsize = trim_stalled_read_event_queue(mono_clock::now());
222+
if (qsize >= cct->_conf->bdev_stalled_read_warn_threshold) {
223+
std::ostringstream ss;
224+
ss << "observed stalled read indications in "
225+
<< device_name << " device";
226+
alerts.emplace(device_name + "_DEVICE_STALLED_READ_ALERT", ss.str());
227+
}
228+
}
229+
}
230+

src/blk/BlockDevice.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@
2525
#include <set>
2626
#include <string>
2727
#include <vector>
28+
#include <queue>
2829

2930
#include "acconfig.h"
3031
#include "common/ceph_mutex.h"
3132
#include "include/common_fwd.h"
3233
#include "extblkdev/ExtBlkDevInterface.h"
34+
#include "osd/osd_types.h"
3335

3436
#if defined(HAVE_LIBAIO) || defined(HAVE_POSIXAIO)
3537
#include "aio/aio.h"
@@ -148,6 +150,8 @@ class BlockDevice {
148150
public:
149151
CephContext* cct;
150152
typedef void (*aio_callback_t)(void *handle, void *aio);
153+
void collect_alerts(osd_alert_list_t& alerts, const std::string& device_name);
154+
151155
private:
152156
ceph::mutex ioc_reap_lock = ceph::make_mutex("BlockDevice::ioc_reap_lock");
153157
std::vector<IOContext*> ioc_reap_queue;
@@ -164,12 +168,14 @@ class BlockDevice {
164168
pmem,
165169
#endif
166170
};
171+
std::queue <ceph::mono_clock::time_point> stalled_read_event_queue;
172+
ceph::mutex stalled_read_event_queue_lock = ceph::make_mutex("BlockDevice::stalled_read_event_queue_lock");
173+
size_t trim_stalled_read_event_queue(mono_clock::time_point cur_time);
167174
static block_device_t detect_device_type(const std::string& path);
168175
static block_device_t device_type_from_name(const std::string& blk_dev_name);
169176
static BlockDevice *create_with_type(block_device_t device_type,
170177
CephContext* cct, const std::string& path, aio_callback_t cb,
171178
void *cbpriv, aio_callback_t d_cb, void *d_cbpriv);
172-
173179
protected:
174180
uint64_t size = 0;
175181
uint64_t block_size = 0;
@@ -187,6 +193,7 @@ class BlockDevice {
187193
// of the drive. The zones 524-52155 are sequential zones.
188194
uint64_t conventional_region_size = 0;
189195
uint64_t zone_size = 0;
196+
void add_stalled_read_event();
190197

191198
public:
192199
aio_callback_t aio_callback;

src/blk/kernel/KernelDevice.cc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1325,6 +1325,7 @@ int KernelDevice::read(uint64_t off, uint64_t len, bufferlist *pbl,
13251325
<< " since " << start1 << ", timeout is "
13261326
<< age
13271327
<< "s" << dendl;
1328+
add_stalled_read_event();
13281329
}
13291330
if (r < 0) {
13301331
if (ioc->allow_eio && is_expected_ioerr(-errno)) {
@@ -1398,6 +1399,7 @@ int KernelDevice::direct_read_unaligned(uint64_t off, uint64_t len, char *buf)
13981399
<< " since " << start1 << ", timeout is "
13991400
<< age
14001401
<< "s" << dendl;
1402+
add_stalled_read_event();
14011403
}
14021404

14031405
if (r < 0) {
@@ -1461,6 +1463,7 @@ int KernelDevice::read_random(uint64_t off, uint64_t len, char *buf,
14611463
<< " (buffered) since " << start1 << ", timeout is "
14621464
<< age
14631465
<< "s" << dendl;
1466+
add_stalled_read_event();
14641467
}
14651468
} else {
14661469
//direct and aligned read
@@ -1471,6 +1474,7 @@ int KernelDevice::read_random(uint64_t off, uint64_t len, char *buf,
14711474
<< " (direct) since " << start1 << ", timeout is "
14721475
<< age
14731476
<< "s" << dendl;
1477+
add_stalled_read_event();
14741478
}
14751479
if (r < 0) {
14761480
r = -errno;

src/common/options/global.yaml.in

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5391,6 +5391,18 @@ options:
53915391
desc: Enable health indication when spurious read errors are observed by OSD
53925392
default: true
53935393
with_legacy: true
5394+
- name: bluestore_slow_ops_warn_lifetime
5395+
type: uint
5396+
level: advanced
5397+
desc: A configurable duration for slow ops warning to be appeared if number of occurence pass `bluestore_slow_ops_warn_threshold` in `bluestore_slow_ops_warn_lifetime` seconds
5398+
default: 86400
5399+
with_legacy: true
5400+
- name: bluestore_slow_ops_warn_threshold
5401+
type: uint
5402+
level: advanced
5403+
desc: A configurable number for slow ops warning to be appeared if number of occurence pass `bluestore_slow_ops_warn_threshold` in `bluestore_slow_ops_warn_lifetime` seconds
5404+
default: 1
5405+
with_legacy: true
53945406
- name: bluestore_fsck_error_on_no_per_pool_omap
53955407
type: bool
53965408
level: advanced
@@ -6468,6 +6480,18 @@ options:
64686480
- aio
64696481
- spdk
64706482
- pmem
6483+
- name: bdev_stalled_read_warn_lifetime
6484+
type: uint
6485+
level: advanced
6486+
desc: A configurable duration for stalled read warning to be appeared if number of stalled read occurence pass `bdev_stalled_read_warn_threshold` in `bdev_stalled_read_warn_lifetime` seconds
6487+
default: 86400
6488+
with_legacy: true
6489+
- name: bdev_stalled_read_warn_threshold
6490+
type: uint
6491+
level: advanced
6492+
desc: A configurable number for stalled read warning to be appeared if number of stalled read occurence pass `bdev_stalled_read_warn_threshold` in `bdev_stalled_read_warn_lifetime` seconds
6493+
default: 1
6494+
with_legacy: true
64716495
- name: bluestore_cleaner_sleep_interval
64726496
type: float
64736497
level: advanced

src/mon/PGMap.cc

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3238,6 +3238,14 @@ void PGMap::get_health_checks(
32383238
summary += " reporting legacy (not per-pool) BlueStore omap usage stats";
32393239
} else if (asum.first == "BLUESTORE_SPURIOUS_READ_ERRORS") {
32403240
summary += " have spurious read errors";
3241+
} else if (asum.first == "BLUESTORE_SLOW_OP_ALERT") {
3242+
summary += " experiencing slow operations in BlueStore";
3243+
} else if (asum.first == "BLOCK_DEVICE_STALLED_READ_ALERT") {
3244+
summary += " experiencing stalled read in block device of BlueStore";
3245+
} else if (asum.first == "WAL_DEVICE_STALLED_READ_ALERT") {
3246+
summary += " experiencing stalled read in wal device of BlueFS";
3247+
} else if (asum.first == "DB_DEVICE_STALLED_READ_ALERT") {
3248+
summary += " experiencing stalled read in db device of BlueFS";
32413249
}
32423250

32433251
auto& d = checks->add(asum.first, HEALTH_WARN, summary, asum.second.first);

src/os/bluestore/BlueFS.cc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4254,6 +4254,15 @@ bool BlueFS::debug_get_is_dev_dirty(FileWriter *h, uint8_t dev)
42544254
return h->dirty_devs[dev];
42554255
}
42564256

4257+
void BlueFS::collect_alerts(osd_alert_list_t& alerts) {
4258+
if (bdev[BDEV_DB]) {
4259+
bdev[BDEV_DB]->collect_alerts(alerts, "DB");
4260+
}
4261+
if (bdev[BDEV_WAL]) {
4262+
bdev[BDEV_WAL]->collect_alerts(alerts, "WAL");
4263+
}
4264+
}
4265+
42574266
int BlueFS::open_for_read(
42584267
std::string_view dirname,
42594268
std::string_view filename,

src/os/bluestore/BlueFS.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,7 @@ class BlueFS {
231231
WRITER_WAL,
232232
WRITER_SST,
233233
};
234+
void collect_alerts(osd_alert_list_t& alerts);
234235

235236
struct File : public RefCountedObject {
236237
MEMPOOL_CLASS_HELPERS();
@@ -433,7 +434,6 @@ class BlueFS {
433434
FileRef file;
434435
explicit FileLock(FileRef f) : file(std::move(f)) {}
435436
};
436-
437437
private:
438438
PerfCounters *logger = nullptr;
439439

src/os/bluestore/BlueStore.cc

Lines changed: 36 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12234,7 +12234,7 @@ int BlueStore::_do_read(
1223412234
int BlueStore::_verify_csum(OnodeRef& o,
1223512235
const bluestore_blob_t* blob, uint64_t blob_xoffset,
1223612236
const bufferlist& bl,
12237-
uint64_t logical_offset) const
12237+
uint64_t logical_offset)
1223812238
{
1223912239
int bad;
1224012240
uint64_t bad_csum;
@@ -17914,13 +17914,34 @@ int BlueStore::_merge_collection(
1791417914
return r;
1791517915
}
1791617916

17917+
size_t BlueStore::_trim_slow_op_event_queue(mono_clock::time_point cur_time) {
17918+
ceph_assert(ceph_mutex_is_locked(qlock));
17919+
auto warn_duration = std::chrono::seconds(cct->_conf->bluestore_slow_ops_warn_lifetime);
17920+
while (!slow_op_event_queue.empty() &&
17921+
((slow_op_event_queue.front() < cur_time - warn_duration) ||
17922+
(slow_op_event_queue.size() > cct->_conf->bluestore_slow_ops_warn_threshold))) {
17923+
slow_op_event_queue.pop();
17924+
}
17925+
return slow_op_event_queue.size();
17926+
}
17927+
17928+
void BlueStore::_add_slow_op_event() {
17929+
if (!cct->_conf->bluestore_slow_ops_warn_threshold) {
17930+
return;
17931+
}
17932+
std::lock_guard lock(qlock);
17933+
auto cur_time = mono_clock::now();
17934+
slow_op_event_queue.push(cur_time);
17935+
_trim_slow_op_event_queue(cur_time);
17936+
}
17937+
1791717938
void BlueStore::log_latency(
1791817939
const char* name,
1791917940
int idx,
1792017941
const ceph::timespan& l,
1792117942
double lat_threshold,
1792217943
const char* info,
17923-
int idx2) const
17944+
int idx2)
1792417945
{
1792517946
logger->tinc(idx, l);
1792617947
if (lat_threshold > 0.0 &&
@@ -17929,6 +17950,7 @@ void BlueStore::log_latency(
1792917950
<< ", latency = " << l
1793017951
<< info
1793117952
<< dendl;
17953+
_add_slow_op_event();
1793217954
if (idx2 > l_bluestore_first && idx2 < l_bluestore_last) {
1793317955
logger->inc(idx2);
1793417956
}
@@ -17941,7 +17963,7 @@ void BlueStore::log_latency_fn(
1794117963
const ceph::timespan& l,
1794217964
double lat_threshold,
1794317965
std::function<string (const ceph::timespan& lat)> fn,
17944-
int idx2) const
17966+
int idx2)
1794517967
{
1794617968
logger->tinc(idx, l);
1794717969
if (lat_threshold > 0.0 &&
@@ -17950,6 +17972,7 @@ void BlueStore::log_latency_fn(
1795017972
<< ", latency = " << l
1795117973
<< fn(l)
1795217974
<< dendl;
17975+
_add_slow_op_event();
1795317976
if (idx2 > l_bluestore_first && idx2 < l_bluestore_last) {
1795417977
logger->inc(idx2);
1795517978
}
@@ -18352,7 +18375,16 @@ void BlueStore::_log_alerts(osd_alert_list_t& alerts)
1835218375
} else if (!spillover_alert.empty()){
1835318376
spillover_alert.clear();
1835418377
}
18355-
18378+
if (cct->_conf->bluestore_slow_ops_warn_threshold) {
18379+
size_t qsize = _trim_slow_op_event_queue(mono_clock::now());
18380+
if (qsize >= cct->_conf->bluestore_slow_ops_warn_threshold) {
18381+
ostringstream ss;
18382+
ss << "observed slow operation indications in BlueStore";
18383+
alerts.emplace("BLUESTORE_SLOW_OP_ALERT", ss.str());
18384+
}
18385+
}
18386+
bdev->collect_alerts(alerts, "BLOCK");
18387+
bluefs->collect_alerts(alerts);
1835618388
if (!spurious_read_errors_alert.empty() &&
1835718389
cct->_conf->bluestore_warn_on_spurious_read_errors) {
1835818390
alerts.emplace(

0 commit comments

Comments
 (0)