Skip to content

Commit 52915db

Browse files
committed
osd: Optimised EC avoids ever reading more than K shards (if plugin supports it).
Plugins which support partial reads, should never need more than k shards to read the data, even if some shards have failed. However, rebalancing commonly requests k + m shards, as very frequently all shards are moved. If this occurs and all k + m shards are online, the read will be achieved by reading ALL shards rather than just reading k shards. This commit fixes that issue. The problem is that we don't want to change the API to the old EC, so we cannot update the plugin behaviour here. Instead, the EC code itself will reduce the number of shards it tells minimum_to_decode about. In a comment we note that bitset_set performance could be improved using _pdep_u64. This would require fiddly platform-specific code and would likely not show any performance improvements for most applications. The majority of the calls to this function will be with a bitset that has <=n set bits and will never enter this if statement. When there are >n bits set we are going to save one or more read I/Os, the cost of the for loop is insignificant vs this saving. I have left the comment in as a hint to future users of this function. Further notes were made in a review comment that are worth recording: - If performance is limited by the drives, then less read I/Os is a clear advantage. - If performance is limited by the network then less remote read I/Os is a clear advantage. - If performance is limited by the CPU then the CPU cost of M unnecessary remote read I/Os (messenger+bluestore) is almost certainly more than the cost of doing an extra encode operation to calculate the coding parities. - If performance is limited by system memory bandwidth the encode+crc generation has less overhead than the read+bluestore crc check+messenger overheads. Longer term this logic should probably be pushed into the plugins, in particular to give LRC the opportunity to optimize for locality of the shards. Reason for not doing this now is that it would be messy because the legacy EC code cannot support this optimization and LRC isn't yet optimizing for locality Signed-off-by: Alex Ainscow <[email protected]>
1 parent 9e6960d commit 52915db

File tree

3 files changed

+90
-1
lines changed

3 files changed

+90
-1
lines changed

src/common/bitset_set.h

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,29 @@ class bitset_set {
283283
return end();
284284
}
285285

286+
/** @return a const_iterator to the nth key or end if it does not exist.
287+
*
288+
* This is called "find_nth" rather an overloading find, as its clearer
289+
* what it is doing find(4) may imply "find(Key(4))"
290+
*/
291+
const_iterator find_nth(unsigned int n) const {
292+
for (size_t i = 0; i < word_count; ++i) {
293+
unsigned int bits_set = std::popcount(words[i]);
294+
if (bits_set > n) {
295+
uint64_t tmp = words[i];
296+
// This could be optimised with BMI _pdep_u64
297+
for (unsigned int j = 0; j < n; ++j) {
298+
// This clears the least significant bit that is set to 1.
299+
tmp &= tmp - 1;
300+
}
301+
return const_iterator(this,
302+
std::countr_zero(tmp) + i * bits_per_uint64_t);
303+
}
304+
n -= bits_set;
305+
}
306+
return end();
307+
}
308+
286309
/** @return number of keys in the container. O(1) complexity on most
287310
* modern CPUs.
288311
*/

src/osd/ECCommon.cc

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,8 +220,23 @@ int ECCommon::ReadPipeline::get_min_avail_to_read_shards(
220220

221221
read_request.shard_want_to_read.populate_shard_id_set(want);
222222

223-
int r = ec_impl->minimum_to_decode(want, have, need_set,
223+
int r = 0;
224+
auto kth_iter = want.find_nth(sinfo.get_k());
225+
if (kth_iter != want.end()) {
226+
// If we support partial reads, we are making the assumption that only
227+
// K shards need to be read to recover data. We opt here for minimising
228+
// the number of reads over minimising the amount of parity calculations
229+
// that are needed.
230+
shard_id_set want_for_plugin = want;
231+
shard_id_t kth = *kth_iter;
232+
want_for_plugin.erase_range(kth, sinfo.get_k_plus_m() - (int)kth);
233+
r = ec_impl->minimum_to_decode(want_for_plugin, have, need_set,
224234
need_sub_chunks.get());
235+
} else {
236+
r = ec_impl->minimum_to_decode(want, have, need_set,
237+
need_sub_chunks.get());
238+
}
239+
225240
if (r < 0) {
226241
dout(20) << "minimum_to_decode_failed r: " << r << "want: " << want
227242
<< " have: " << have << " need: " << need_set << dendl;

src/test/common/test_bitset_set.cc

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,3 +211,54 @@ TEST(bitset_set, fmt_formatting) {
211211
oss << bitset;
212212
EXPECT_EQ(using_fmt, oss.str());
213213
}
214+
215+
TEST(bitset_set, find_nth) {
216+
constexpr size_t range = 128;
217+
bitset_set<range, Key> bitset;
218+
219+
ASSERT_EQ(bitset.end(), bitset.find_nth(0) );
220+
ASSERT_EQ(bitset.end(), bitset.find_nth(1) );
221+
ASSERT_EQ(bitset.end(), bitset.find_nth(range) );
222+
223+
bitset.insert(0);
224+
ASSERT_EQ(Key(0), *bitset.find_nth(0) );
225+
ASSERT_EQ(bitset.end(), bitset.find_nth(1) );
226+
ASSERT_EQ(bitset.end(), bitset.find_nth(range) );
227+
228+
// Single bit set
229+
for (unsigned int i = 0; i < range; i++) {
230+
bitset.clear();
231+
bitset.insert(i);
232+
ASSERT_EQ(Key(i), *bitset.find_nth(0) );
233+
ASSERT_EQ(bitset.end(), bitset.find_nth(1) );
234+
ASSERT_EQ(bitset.end(), bitset.find_nth(range) );
235+
}
236+
237+
/* Alt bits set */
238+
bitset.clear();
239+
for (unsigned int i = 0; i < range; i += 2) {
240+
bitset.insert(i);
241+
}
242+
for (unsigned int i = 0; i < range / 2; i++) {
243+
ASSERT_EQ(Key(i * 2), *bitset.find_nth(i) );
244+
}
245+
ASSERT_EQ(bitset.end(), bitset.find_nth(range / 2) );
246+
247+
/* Other alt bits set */
248+
bitset.clear();
249+
for (unsigned int i = 1; i < range; i += 2) {
250+
bitset.insert(i);
251+
}
252+
for (unsigned int i = 0; i < range / 2; i++) {
253+
ASSERT_EQ(Key(i * 2 + 1), *bitset.find_nth(i) );
254+
}
255+
ASSERT_EQ(bitset.end(), bitset.find_nth(range / 2) );
256+
257+
/* All bits set */
258+
bitset.clear();
259+
bitset.insert_range(Key(0), range);
260+
for (unsigned int i = 0; i < range; i++) {
261+
ASSERT_EQ(Key(i), *bitset.find_nth(i) );
262+
}
263+
ASSERT_EQ(bitset.end(), bitset.find_nth(range) );
264+
}

0 commit comments

Comments
 (0)