Skip to content

Commit 7491c6c

Browse files
BenHuddlestondaverigby
authored andcommitted
MB-34996: Do not abort prepares with infinite timeout
If we warmup or promote a replica to active then we create SyncWrite objects with an infinite timeout as these prepares MUST be committed as they could have been committed pre-warmup or by the old active. Currently, the code will attempt to abort these prepares if a subsequent topology change makes durability impossible. This breaks durability though as we MUST commit these prepares. Update the code to never attempt to abort a SyncWrite with an infinite timeout. Change-Id: I206ec581e827f1ca12ff5637c0e2fb9aabac4284 Reviewed-on: http://review.couchbase.org/111861 Reviewed-by: Dave Rigby <[email protected]> Tested-by: Build Bot <[email protected]>
1 parent b1c2721 commit 7491c6c

File tree

2 files changed

+49
-6
lines changed

2 files changed

+49
-6
lines changed

engines/ep/src/durability/active_durability_monitor.cc

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1178,16 +1178,28 @@ void ActiveDurabilityMonitor::State::setReplicationTopology(
11781178
}
11791179

11801180
// If durability is not possible for the new chains, then we should abort
1181-
// the in-flight SyncWrites so that the client can decide what to do. We
1182-
// have already reset the topology of the in flight SyncWrites so that they
1183-
// do not contain any invalid pointers post topology change.
1181+
// any in-flight SyncWrites that do not have an infinite timeout so that the
1182+
// client can decide what to do. We do not abort and infinite timeout
1183+
// SyncWrites as we MUST complete them as they exist due to a warmup or
1184+
// Passive->Active transition. We have already reset the topology of the in
1185+
// flight SyncWrites so that they do not contain any invalid pointers post
1186+
// topology change.
11841187
if (!(newFirstChain && newFirstChain->isDurabilityPossible() &&
11851188
(!newSecondChain || newSecondChain->isDurabilityPossible()))) {
11861189
// We can't use a for loop with iterators here because they will be
11871190
// modified to point to invalid memory as we use std::list.splice in
11881191
// removeSyncWrite.
1189-
while (!trackedWrites.empty()) {
1190-
toAbort.enqueue(*this, removeSyncWrite(trackedWrites.begin()));
1192+
auto itr = trackedWrites.begin();
1193+
while (itr != trackedWrites.end()) {
1194+
if (!itr->getDurabilityReqs().getTimeout().isInfinite()) {
1195+
// Grab the next itr before we overwrite ours to point to a
1196+
// different list.
1197+
auto next = std::next(itr);
1198+
toAbort.enqueue(*this, removeSyncWrite(trackedWrites.begin()));
1199+
itr = next;
1200+
} else {
1201+
itr++;
1202+
}
11911203
}
11921204
}
11931205

engines/ep/tests/module_tests/durability_monitor_test.cc

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2719,7 +2719,7 @@ TEST_P(ActiveDurabilityMonitorTest, HPSResetOnTopologyChange) {
27192719
*
27202720
* We have 1 replica that we failover. Topology is changed from
27212721
* {{active, replica1}} to {{active, undefined}}. In this case we should abort
2722-
* any in-flight SyncWrites and not throw assertions.
2722+
* any in-flight SyncWrites with a non-infinite and not throw assertions.
27232723
*/
27242724
TEST_P(ActiveDurabilityMonitorTest,
27252725
DurabilityImpossibleTopologyChangeAbortsInFlightSyncWrites) {
@@ -2742,6 +2742,37 @@ TEST_P(ActiveDurabilityMonitorTest,
27422742
EXPECT_EQ(1, getActiveDM().getNumAborted());
27432743
}
27442744

2745+
/**
2746+
* Failover scenario:
2747+
*
2748+
* We have 1 replica that we failover. Topology is changed from
2749+
* {{active, replica1}} to {{active, undefined}}. In this case we should not
2750+
* abort any in-flight SyncWrites with an infinite timeout as this breaks
2751+
* durability. We create SyncWrites with an infinite timeout at warmup and at
2752+
* promotion from replica to active as we MUST commit these SyncWrites.
2753+
*/
2754+
TEST_P(ActiveDurabilityMonitorTest,
2755+
DurabilityImpossibleTopologyChangeDoesNotAbortsInfiniteTimeoutSyncWrites) {
2756+
// To start, we have 1 chain with active and replica1
2757+
using namespace cb::durability;
2758+
addSyncWrite(1, Requirements{Level::Majority, Timeout::Infinity()});
2759+
{
2760+
SCOPED_TRACE("");
2761+
assertNumTrackedAndHPSAndHCS(1, 1, 0);
2762+
}
2763+
2764+
// Failover
2765+
EXPECT_NO_THROW(getActiveDM().setReplicationTopology(
2766+
nlohmann::json::array({{active, nullptr}})));
2767+
2768+
{
2769+
SCOPED_TRACE("");
2770+
assertNumTrackedAndHPSAndHCS(1, 1, 0);
2771+
}
2772+
EXPECT_EQ(0, getActiveDM().getNumCommitted());
2773+
EXPECT_EQ(0, getActiveDM().getNumAborted());
2774+
}
2775+
27452776
INSTANTIATE_TEST_CASE_P(AllBucketTypes,
27462777
ActiveDurabilityMonitorTest,
27472778
STParameterizedBucketTest::allConfigValues(),

0 commit comments

Comments
 (0)