MB-50874: Reset snap start if less than lastSeqno on new checkpoint creation

daverigby · daverigby · commit bfa0dd847026 · 2022-02-15T14:12:06.000Z
+Problem+ If a replica vBucket is promoted to active, and the last DCP message it received was a Snapshot Marker which had the first mutation de-duplicated, then the snapshot start of the newly-promoted active ends up greater than the active. Upon the next Flusher run (i.e. next mutation to the vBucket), the Flusher throws an exception when trying to fetch items which terminates KV-Engine (as exception is thrown on BG thread): CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:0 state:active snapshotStart:12 lastBySeqno:11 snapshotEnd:11 genSeqno:Yes checkpointList.size():2 +Details+ When streaming data from an Active to Replica vBucket, the extent of the Checkpoint is sent via DCP using a SnapshotMarker message, followed by N Mutation / Deletion messages. The snapshot marker may be discontinuous compared to the previous if any de-duplication occurred within the Checkpoint - for example if document "key" was written sufficient times in quick succession, one could end up with the following two Checkpoints on the active and subsequent DCP SnapshotMarker sent to the replica: CheckpointManager[0x108a03080] with numItems:6 checkpoints:2 Checkpoint[0x10891f000] with id:2 seqno:{1,10} snap:{0,10, visible:10} state:CHECKPOINT_CLOSED numCursors:1 type:Memory hcs:-- items:[ {10,mutation,cid:0x0: deduplicated_key,119,} {11,checkpoint_end,cid:0x1:checkpoint_end,119,[m]} ] Checkpoint[0x10891fa00] with id:3 seqno:{11,12} snap:{10,12, visible:12} state:CHECKPOINT_OPEN numCursors:1 type:Memory hcs:-- items:[ {11,checkpoint_start,cid:0x1:checkpoint_start,121,[m]} {12,mutation,cid:0x0:deduplicated_key,130,} ] Note how there are just two mutations remaining (at seqnos 10 and 12), and that there is a seqno "gap" at 11 (ignore meta-items which are not send over DCP). When this is replicated over DCP it will be sent as: * DCP_SNAPSHOT_MARKER(start:0, end:10, flags=CHK) * DCP_MUTATION(seqno:10, ...) * DCP_SNAPSHOT_MARKER(start:12, end:12, flags=CHK) * DCP_MUTATION(seqno:12, ...) Note that the second SnapshotMarker being flagged as "CHK" (Checkpoint) is essential - we need the replica to end up creating a new Checkpoint with the start and end controlled by the active - a SnapshotMarker without that flag is insufficient as it just extends the existing checkpoint, increasing the checkpoint end but leaving start unaffected. Once these messages are replicated over DCP the replica vBucket should have equivalent state as the active. However; if the last DCP_MUTATION is not received - for example if the active node is being failed over and the stream is closed before the DCP_MUTATION, then the state of the replica - crucially the Open checkpoint is as follows: Checkpoint[0x10cecde00] with id:2 seqno:{11,11} snap:{12,12, visible:12} state:CHECKPOINT_OPEN numCursors:0 type:Memory hcs:-- items:[ {11,checkpoint_start,cid:0x1:checkpoint_start,121,[m]} ] When this sequence occurs, the seqno range (11,11) in the open Checkpoint is less than the snapshot range (12,12). This is problematic as we have essentially broken an invariant on Checkpoints - that all items within them are between the snapshot start and end. This doesn't immediately cause a problem, but if this vBucket is converted to Active and starts accepting mutations itself, it will start generating seqnos from the last seqno received - 10 in this case. This results in the next mutation being assigned seqno 11, which when the flusher is woken and attempts to flush throws an exception on the BG thread and crashes KV-Engine. +Solution+ The cleanest way to solve this would be to ensure that the SnapshotMarker has a start equal to the start of the source Checkpoint - i.e. 11 instead of 12. That is indeed what has been done to address MB-50333 which is a SyncWrite variant of this issue. However that is a medium-sized change and affects the actual data sent over the wire, so more risky for a maitenance fix. Instead this patch takes a more targetted approach - when we create a new Checkpoint during the setvBucketState, we modify the start seqno if it is less than the lastSeqno. Change-Id: Icc6176a3634944800271be0d9d05949c63b29bf4 Reviewed-on: https://review.couchbase.org/c/kv_engine/+/170268 Well-Formed: Restriction Checker Tested-by: Build Bot <build@couchbase.com> Reviewed-by: Ben Huddleston <ben.huddleston@couchbase.com> Reviewed-by: Paolo Cocchi <paolo.cocchi@couchbase.com>
diff --git a/engines/ep/src/checkpoint_config.h b/engines/ep/src/checkpoint_config.h
@@ -62,6 +62,8 @@ class CheckpointConfig {
         return persistenceEnabled;
     }
 
+    static void addConfigChangeListener(EventuallyPersistentEngine& engine);
+
 protected:
     friend class CheckpointConfigChangeListener;
     friend class EventuallyPersistentEngine;
@@ -82,8 +84,6 @@ class CheckpointConfig {
         keepClosedCheckpoints = value;
     }
 
-    static void addConfigChangeListener(EventuallyPersistentEngine& engine);
-
 private:
     class ChangeListener;
 
diff --git a/engines/ep/src/checkpoint_manager.cc b/engines/ep/src/checkpoint_manager.cc
@@ -1239,6 +1239,11 @@ size_t CheckpointManager::getNumOpenChkItems() const {
     return getOpenCheckpoint_UNLOCKED(lh).getNumItems();
 }
 
+size_t CheckpointManager::getNumCheckpoints() const {
+    LockHolder lh(queueLock);
+    return checkpointList.size();
+}
+
 uint64_t CheckpointManager::checkOpenCheckpoint_UNLOCKED(const LockHolder& lh,
                                                          bool forceCreation,
                                                          bool timeBound) {
@@ -1456,13 +1461,30 @@ uint64_t CheckpointManager::createNewCheckpoint(bool force) {
     LockHolder lh(queueLock);
 
     const auto& openCkpt = getOpenCheckpoint_UNLOCKED(lh);
+    if (openCkpt.getNumItems() > 0 || force) {
+        addNewCheckpoint_UNLOCKED(openCkpt.getId() + 1);
+    }
 
-    if (openCkpt.getNumItems() == 0 && !force) {
-        return openCkpt.getId();
+    auto& openCkpt2 = getOpenCheckpoint_UNLOCKED(lh);
+
+    /* MB-50874: Ensure that the snapshot start of our newly-active
+     * checkpoint is not greater than CheckpointManager::lastBySeqno.
+     * Note in Neo this issue no longer occurs as the snap_start is sent
+     * correctly - see MB-50333.
+     */
+    if (static_cast<uint64_t>(lastBySeqno) <
+        openCkpt2.getSnapshotStartSeqno()) {
+        EP_LOG_INFO(
+                "CheckpointManager::createNewCheckpoint(): {} Found "
+                "lastBySeqno:{} less than snapStart:{}, adjusting snapStart to lastBySeqno + 1",
+                vbucketId,
+                lastBySeqno,
+                openCkpt2.getSnapshotStartSeqno(),
+                lastBySeqno + 1);
+        openCkpt2.setSnapshotStartSeqno(lastBySeqno + 1);
     }
 
-    addNewCheckpoint_UNLOCKED(openCkpt.getId() + 1);
-    return getOpenCheckpointId_UNLOCKED(lh);
+    return openCkpt2.getId();
 }
 
 uint64_t CheckpointManager::getPersistenceCursorPreChkId() {
@@ -1682,4 +1704,4 @@ FlushHandle::~FlushHandle() {
     }
     // Flush-success path
     manager.removeBackupPersistenceCursor();
-}
+}
diff --git a/engines/ep/src/checkpoint_manager.h b/engines/ep/src/checkpoint_manager.h
@@ -331,6 +331,9 @@ class CheckpointManager {
      */
     size_t getNumOpenChkItems() const;
 
+    /// @returns the number of Checkpoints this Manager has.
+    size_t getNumCheckpoints() const;
+
     /* WARNING! This method can return inaccurate counts - see MB-28431. It
      * at *least* can suffer from overcounting by at least 1 (in scenarios as
      * yet not clear).
diff --git a/engines/ep/tests/mock/mock_synchronous_ep_engine.cc b/engines/ep/tests/mock/mock_synchronous_ep_engine.cc
@@ -67,6 +67,7 @@ SynchronousEPEngine::SynchronousEPEngine(std::string extra_config)
 
     // checkpointConfig is needed by CheckpointManager (via EPStore).
     checkpointConfig = std::make_unique<CheckpointConfig>(*this);
+    CheckpointConfig::addConfigChangeListener(*this);
 
     dcpFlowControlManager_ = std::make_unique<DcpFlowControlManager>(*this);
 
diff --git a/engines/ep/tests/module_tests/dcp_reflection_test.cc b/engines/ep/tests/module_tests/dcp_reflection_test.cc
@@ -152,6 +152,9 @@ class DCPLoopbackStreamTest : public SingleThreadedKVBucketTest {
 
         void transferResponseMessage();
 
+        /// Inject a CloseStream message into the consumer side of the route.
+        void closeStreamAtConsumer();
+
         std::pair<ActiveStream*, MockPassiveStream*> getStreams();
 
         Vbid vbid;
@@ -485,6 +488,10 @@ void DCPLoopbackStreamTest::DcpRoute::transferResponseMessage() {
     }
 }
 
+void DCPLoopbackStreamTest::DcpRoute::closeStreamAtConsumer() {
+    this->consumer->closeStream(0, vbid, {});
+}
+
 std::pair<cb::engine_errc, uint64_t>
 DCPLoopbackStreamTest::DcpRoute::doStreamRequest(int flags) {
     // Do the add_stream
@@ -996,6 +1003,71 @@ TEST_F(DCPLoopbackStreamTest, MB_36948_SnapshotEndsOnPrepare) {
     EXPECT_EQ(2, replicaVB->checkpointManager->getVisibleSnapshotEndSeqno());
 }
 
+/**
+ * Regression test for mB-50874 - a scenario where a replica:
+ *    1. receives a DCP snapshot marker which has the first seqno de-duplicated
+ *    2. DCP stream is closed (e.g. ns_server failing over the active)
+ *    3. vbucket is promoted to active
+ *
+ * This results in a Checkpoint where the snapshot start - updated from
+ * SnapshotMarker at (1) - is greater than the lastBySeqno and this ends
+ * up throwing an exception in the Flusher when we next persist anything.
+ */
+TEST_F(DCPLoopbackStreamTest, MB50874_DeDuplicatedMutationsReplicaToActive) {
+    // We need a new checkpoint (MARKER_FLAG_CHK set) when the active node
+    // generates markers - reduce chkMaxItems to the minimum to simplify this.
+    engines[Node0]->getConfiguration().setChkMaxItems(MIN_CHECKPOINT_ITEMS);
+
+    // Setup - fill up the initial checkpoint, with items, so when we
+    // queue the next mutations a new checkpoints is created.
+    for (int i = 0; i < MIN_CHECKPOINT_ITEMS; i++) {
+        auto key = makeStoredDocKey("key_" + std::to_string(i));
+        ASSERT_EQ(ENGINE_SUCCESS, storeSet(key));
+    }
+    auto srcVB = engines[Node0]->getVBucket(vbid);
+    ASSERT_EQ(1, srcVB->checkpointManager->getNumCheckpoints());
+
+    // Now modify one more key, which should create a new Checkpoint.
+    auto key = makeStoredDocKey("deduplicated_key");
+    ASSERT_EQ(ENGINE_SUCCESS, storeSet(key));
+    // ... and modify again so we de-duplicate and have a seqno gap.
+    ASSERT_EQ(ENGINE_SUCCESS, storeSet(key));
+
+    // Sanity check our state - should have a 2nd checkpoint now.
+    ASSERT_EQ(2, srcVB->checkpointManager->getNumCheckpoints());
+
+    // Create a DCP connection between node0 and 1, and stream the initial
+    // marker and the 10 mutations.
+    auto route0_1 = createDcpRoute(Node0, Node1);
+    ASSERT_EQ(cb::engine_errc::success, route0_1.doStreamRequest().first);
+    route0_1.transferSnapshotMarker(
+            0, 10, MARKER_FLAG_MEMORY | MARKER_FLAG_CHK);
+    for (int i = 0; i < MIN_CHECKPOINT_ITEMS; i++) {
+        route0_1.transferMessage(DcpResponse::Event::Mutation);
+    }
+
+    // Test - transfer the snapshot marker (but no mutations), then close stream
+    // and promote to active; and try to accept a new mutation.
+    route0_1.transferSnapshotMarker(
+            12, 12, MARKER_FLAG_MEMORY | MARKER_FLAG_CHK);
+
+    route0_1.closeStreamAtConsumer();
+    engines[Node1]->getKVBucket()->setVBucketState(vbid, vbucket_state_active);
+
+    // Prior to the fix, this check fails.
+    auto& dstCkptMgr = *engines[Node1]->getVBucket(vbid)->checkpointManager;
+    EXPECT_LE(dstCkptMgr.getOpenSnapshotStartSeqno(),
+              dstCkptMgr.getHighSeqno() + 1)
+            << "Checkpoint start should be less than or equal to next seqno to "
+               "be assigned (highSeqno + 1)";
+
+    // Prior to the fix, this throws std::logic_error from
+    // CheckpointManager::queueDirty as lastBySeqno is outside snapshot range.
+    EXPECT_EQ(ENGINE_SUCCESS,
+              engines[Node1]->getKVBucket()->set(
+                      *makeCommittedItem(key, "value"), cookie));
+}
+
 TEST_F(DCPLoopbackStreamTest, MB_41255_dcp_delete_evicted_xattr) {
     auto k1 = makeStoredDocKey("k1");
     EXPECT_EQ(ENGINE_SUCCESS, storeSet(k1, true /*xattr*/));
@@ -1183,4 +1255,4 @@ TEST_P(DCPLoopbackSnapshots, testSnapshots) {
 
 INSTANTIATE_TEST_CASE_P(DCPLoopbackSnapshot,
                         DCPLoopbackSnapshots,
-                        ::testing::Range(1, 10), );
+                        ::testing::Range(1, 10), );

Original file line number	Diff line number	Diff line change
`@@ -62,6 +62,8 @@ class CheckpointConfig {`
`62`	`62`	`return persistenceEnabled;`
`63`	`63`	`}`
`64`	`64`
	`65`	`+ static void addConfigChangeListener(EventuallyPersistentEngine& engine);`
	`66`	`+`
`65`	`67`	`protected:`
`66`	`68`	`friend class CheckpointConfigChangeListener;`
`67`	`69`	`friend class EventuallyPersistentEngine;`
`@@ -82,8 +84,6 @@ class CheckpointConfig {`
`82`	`84`	`keepClosedCheckpoints = value;`
`83`	`85`	`}`
`84`	`86`
`85`		`- static void addConfigChangeListener(EventuallyPersistentEngine& engine);`
`86`		`-`
`87`	`87`	`private:`
`88`	`88`	`class ChangeListener;`
`89`	`89`