Skip to content

Commit ac4e092

Browse files
bill-scalesaainscow
authored andcommitted
osd: Optimized EC incorrectly rolled backwards write
A bug in choose_acting in this scenario: * Current primary shard has been absent so has missed the latest few writes * All the recent writes are partial writes that have not updated shard X * All the recent writes have completed The authorative shard is chosen from the set of primary-capable shards that have the highest last epoch started, these have all got log entries for the recent writes. The get log shard is chosen from the set of shards that have the highest last epoch started, this chooses shard X because its furthest behind The primary shard last update is not less than get log shard last update so this if statement decides that it has a good enough log: if ((repeat_getlog != nullptr) && get_log_shard != all_info.end() && (info.last_update < get_log_shard->second.last_update) && pool.info.is_nonprimary_shard(get_log_shard->first.shard)) { We then proceed through peering using the primary log and the log from shard X. Neither have details about the recent writes which are then incorrectly rolled back. The if statement should be looking at last_update for the authorative shard rather than the get_log_shard, the code would then realize that it needs to get the log from the authorative shard first and then have a second pass where it gets the log from the get log shard. Peering would then have information about the partial writes (obtained from the authorative shards log) and could correctly roll these writes forward by deducing that the get_log_shard didn't have these log entries because they were partial writes. Signed-off-by: Bill Scales <[email protected]>
1 parent 6365803 commit ac4e092

File tree

1 file changed

+17
-20
lines changed

1 file changed

+17
-20
lines changed

src/osd/PeeringState.cc

Lines changed: 17 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2511,26 +2511,6 @@ bool PeeringState::choose_acting(pg_shard_t &get_log_shard_id,
25112511
auto get_log_shard = find_best_info(all_info, restrict_to_up_acting,
25122512
false, history_les_bound);
25132513

2514-
if ((repeat_getlog != nullptr) &&
2515-
get_log_shard != all_info.end() &&
2516-
(info.last_update < get_log_shard->second.last_update) &&
2517-
pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {
2518-
// Only EC pools with ec_optimizations enabled:
2519-
// Our log is behind that of the get_log_shard which is a
2520-
// non-primary shard and hence may have a sparse log,
2521-
// get a complete log from the auth_log_shard first then
2522-
// repeat this step in the state machine to work out what
2523-
// has to be rolled backwards
2524-
psdout(10) << "get_log_shard " << get_log_shard->first
2525-
<< " is ahead but is a non_primary shard" << dendl;
2526-
if (auth_log_shard != all_info.end()) {
2527-
psdout(10) << "auth_log_shard " << auth_log_shard->first
2528-
<< " selected instead" << dendl;
2529-
get_log_shard = auth_log_shard;
2530-
*repeat_getlog = true;
2531-
}
2532-
}
2533-
25342514
if ((auth_log_shard == all_info.end()) ||
25352515
(get_log_shard == all_info.end())) {
25362516
if (up != acting) {
@@ -2546,6 +2526,23 @@ bool PeeringState::choose_acting(pg_shard_t &get_log_shard_id,
25462526
return false;
25472527
}
25482528

2529+
if ((repeat_getlog != nullptr) &&
2530+
(info.last_update < auth_log_shard->second.last_update) &&
2531+
pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {
2532+
// Only EC pools with ec_optimizations enabled:
2533+
// Our log is behind that of the auth_log_shard and
2534+
// the get log shard is a non-primary shard and hence may have
2535+
// a sparse log, get a complete log from the auth_log_shard
2536+
// first then repeat this step in the state machine to work
2537+
// out what has to be rolled backwards
2538+
psdout(10) << "get_log_shard " << get_log_shard->first
2539+
<< " is ahead but is a non_primary shard" << dendl;
2540+
psdout(10) << "auth_log_shard " << auth_log_shard->first
2541+
<< " selected instead" << dendl;
2542+
get_log_shard = auth_log_shard;
2543+
*repeat_getlog = true;
2544+
}
2545+
25492546
ceph_assert(!auth_log_shard->second.is_incomplete());
25502547
get_log_shard_id = get_log_shard->first;
25512548

0 commit comments

Comments
 (0)