MB-35003: Set fail-over seqno to be the end seqno of previous complete snapshot

jimwwalker · daverigby · commit f40669692894 · 2019-08-23T13:26:19.000Z
This commit changes how the flushVBucket function updates the current snapshot range for pending/replica vbuckets only. Note the flusher is shared between active/pending/replica vbuckets and as such these changes are live for all vbucket states, however active vbuckets always overwrite the snapshot range with the high flushed seqno. This description first covers some examples of how the flusher managed the snapshot range prior to the change. *Note*: in all of these examples the current/final snapshot is partially received. ex1: Before this commit the flusher very much follows what the checkpoint manager tells it, e.g. a partially received snapshot {4,8} resulted in the following disk state: vbstate.range = {4,8} seqno index = 1,2,3,4,5,6 ex2: Before this commit multiple checkpoint snapshots could be flushed as a combined set of items, e.g. receipt of snapshots {1,3} and {4,8} followed by a flush results in the following disk state: vbstate.range = {1,8} seqno index = 1,2,3,4,5,6 ex3: A further important example is from MB-35003 itself, in that when the producer switches from 'backfill' to 'in-memory' the first in-memory snapshot is now tagged with the 'checkpoint' flag, this was introduced in MB-35001. Before MB-35001 a disk followed by memory snapshot looked as follows, here we have {1,3, disk} and {4,8, memory}. When a snapshot is received without the checkpoint flag, the snapshot items just enter the current checkpoint. Once the flusher is past the {1,3} snapshot, the subsequent snapshot is just an extension of the current one, tagging snapshots with the checkpoint flag means a new checkpoint is opened, which can yield a new outcome. With MB-35001, the addition of the checkpoint flag means the following is now a possible outcome of the flusher: vbstate.range = {5,8} seqno index = 1,2,3,4,5,6 An important outcome of this commit is what happens during replica to active promotion, and where we set the seqno of the fail-over table entry. The logic is as follows: if (highSeqno == vbstate.range.end) { newEntry.seqno = highseqno } else { newEntry.seqno = vbstate.range.start } With the examples above, promotion to active yields the following new fail-over entry seqno. ex1: 4 ex2: 1 ex3: 5 In all of these examples, because of the partial snapshot the fail-over entry seqo is always the start of a snapshot, and ex3 it's the start of an artificial checkpoint. This commit changes the outcome of all of these examples, instead of the start of the partial snapshot, the fail-over entry seqno will become the end of the last complete snapshot. To achieve this the flusher now gets more information about the set of items it is flushing. The checkpoint manager is changed so that the flusher receives. * The entire set of items to flush. * A vector of snapshot_range_t for individual checkpoint that makes up the set of items. As the flusher iterates through the set of items to flush, the seqno of each flushed item is compared against the end seqno of each snapshot. If there is a match, the flusher concludes it has all the items of that particular snapshot and it can now change the start seqno of the vbucket's range to be that of the completed snapshot. Each example from above now changes to have the following outcomes. ex1: vbstate.range = {3,8} seqno index = 1,2,3,4,5,6 fail-over = 3 ex2: vbstate.range = {3,8} seqno index = 1,2,3,4,5,6 fail-over = 3 ex3: vbstate.range = {3,8} seqno index = 1,2,3,4,5,6 fail-over = 3 A final notable difference that this commit makes is that once the flusher has absolutely flushed to the very end of the range, the state now looks as follows. vbstate.range = {8,8} seqno index = 1,2,3,4,5,6,7,8 I.e. start = end = high-seqno There seems to be no advantage to adapting the flusher further to maintain the prior behaviour, where in an absolutely flushed scenario we maintain the range.start to reflect some lower value. In all failure and fail-over scenarios, when the high-seqno matches the range.end, the range.start is not used. Change-Id: I54e3851378a9e19ad350fc17741fa19dfa69b2fa Reviewed-on: http://review.couchbase.org/113433 Reviewed-by: Dave Rigby <daver@couchbase.com> Tested-by: Dave Rigby <daver@couchbase.com>
diff --git a/docs/dcp/documentation/commands/snapshot-marker.md b/docs/dcp/documentation/commands/snapshot-marker.md
@@ -127,3 +127,47 @@ If data in this packet is malformed or incomplete then this error is returned.
 **(Disconnect)**
 
 If this message is sent to a connection that is not a consumer.
+
+### Implementation notes
+
+The implementation of DCP has lead to some inconsistencies in the way that the
+snapshot marker assigns the value of "Start Seqno" depending on the context.
+
+Note that [stream-request](stream-request.md) defines "Start Seqno" to be
+maximum sequence number that the client has received. A request with a start
+seqno number of X, means "I have X, please start my stream at the sequence
+number after X".
+
+#### Memory snapshot.start-seqno equals seqno of first transmitted seqno
+
+A stream which is transferring in-memory checkpoint data sets the
+`snapshot-marker.start-seqno` to the seqno of the first mutation that will be
+follow the marker. This matches with the semantics of stream-request where the
+start-seqno is something the client already has.
+
+Thus a client which performs a stream-request with a start-seqno of X, but due
+to de-duplication X+n is the first sequence number available (from memory), the
+client will receive:
+
+* TX `stream-request{start-seqno=X}`
+* RX `stream-request-response{success}`
+* RX `snapshot-marker{start=X+n, end=Y, flags=0x1}`
+* RX `mutation{seqno:X+n}`
+
+#### Disk snapshot-marker.start-seqno equals stream-request.start-seqno
+
+The difference here is when a stream-request has to backfill from disk, the
+`0x02 disk` snapshot marker has the start-seqno set to the clients requested
+start-seqno. The returned mutations are correct from the definition of
+stream-request but the snapshot-marker could be viewed as inconsistent with the
+stream-request definition and the in-memory case.
+
+This is not consistent with the in-memory case, example:
+
+* TX `stream-request{start-seqno=X}`
+* RX `stream-request-response{success}`
+* RX `snapshot-marker{start=X, end=Y, flags=0x2}`
+* RX `mutation{seqno:X+n}`
+
+Note: A stream could at any time switch from memory to disk if the client is
+deemed to be slow.
diff --git a/engines/ep/src/checkpoint_manager.cc b/engines/ep/src/checkpoint_manager.cc
@@ -833,9 +833,7 @@ void CheckpointManager::queueSetVBState(VBucket& vb) {
 
 CheckpointManager::ItemsForCursor CheckpointManager::getNextItemsForCursor(
         CheckpointCursor* cursor, std::vector<queued_item>& items) {
-    auto result = getItemsForCursor(
-            cursor, items, std::numeric_limits<size_t>::max());
-    return result;
+    return getItemsForCursor(cursor, items, std::numeric_limits<size_t>::max());
 }
 
 CheckpointManager::ItemsForCursor CheckpointManager::getItemsForCursor(
@@ -844,20 +842,20 @@ CheckpointManager::ItemsForCursor CheckpointManager::getItemsForCursor(
         size_t approxLimit) {
     LockHolder lh(queueLock);
     if (!cursorPtr) {
-        EP_LOG_WARN("getNextItemsForCursor(): Caller had a null cursor {}",
+        EP_LOG_WARN("getItemsForCursor(): Caller had a null cursor {}",
                     vbucketId);
-        return {0, 0};
+        return {};
     }
 
     auto& cursor = *cursorPtr;
 
     // Fetch whole checkpoints; as long as we don't exceed the approx item
     // limit.
-    ItemsForCursor result((*cursor.currentCheckpoint)->getSnapshotStartSeqno(),
-                          (*cursor.currentCheckpoint)->getSnapshotEndSeqno(),
-                          (*cursor.currentCheckpoint)->getCheckpointType());
+    ItemsForCursor result((*cursor.currentCheckpoint)->getCheckpointType(),
+                          (*cursor.currentCheckpoint)->getHighCompletedSeqno());
 
     size_t itemCount = 0;
+    bool enteredNewCp = true;
     while ((result.moreAvailable = incrCursor(cursor))) {
         // We only want to return items from contiguous checkpoints with the
         // same type. We should not return Memory checkpoint items followed by
@@ -868,12 +866,20 @@ CheckpointManager::ItemsForCursor CheckpointManager::getItemsForCursor(
             result.checkpointType) {
             break;
         }
+        if (enteredNewCp) {
+            result.ranges.push_back(
+                    {(*cursor.currentCheckpoint)->getSnapshotStartSeqno(),
+                     (*cursor.currentCheckpoint)->getSnapshotEndSeqno()});
+            enteredNewCp = false;
+        }
 
         queued_item& qi = *(cursor.currentPos);
         items.push_back(qi);
         itemCount++;
 
         if (qi->getOperation() == queue_op::checkpoint_end) {
+            enteredNewCp = true; // the next incrCuror will move to a new CP
+
             // Only move the HCS at checkpoint end (don't want to flush a
             // HCS mid-checkpoint).
             result.highCompletedSeqno =
@@ -883,8 +889,6 @@ CheckpointManager::ItemsForCursor CheckpointManager::getItemsForCursor(
             // our limit.
             if (itemCount >= approxLimit) {
                 // Reached our limit - don't want any more items.
-                result.range.setEnd(
-                        (*cursor.currentCheckpoint)->getSnapshotEndSeqno());
 
                 // However, we *do* want to move the cursor into the next
                 // checkpoint if possible; as that means the checkpoint we just
@@ -894,19 +898,23 @@ CheckpointManager::ItemsForCursor CheckpointManager::getItemsForCursor(
                 break;
             }
         }
-        // May have moved into a new checkpoint - update range.end.
-        result.range.setEnd((*cursor.currentCheckpoint)->getSnapshotEndSeqno());
     }
 
-    EP_LOG_DEBUG(
-            "CheckpointManager::getNextItemsForCursor() "
-            "cursor:{} result:{{#items:{} range:{{{}, {}}} "
-            "moreAvailable:{}}}",
-            cursor.name,
-            uint64_t(itemCount),
-            result.range.getStart(),
-            result.range.getEnd(),
-            result.moreAvailable ? "true" : "false");
+    if (globalBucketLogger->should_log(spdlog::level::debug)) {
+        std::stringstream ranges;
+        for (const auto& range : result.ranges) {
+            ranges << "{" << range.getStart() << "," << range.getEnd() << "}";
+        }
+        EP_LOG_DEBUG(
+                "CheckpointManager::getItemsForCursor() "
+                "cursor:{} result:{{#items:{} ranges:size:{} {} "
+                "moreAvailable:{}}}",
+                cursor.name,
+                uint64_t(itemCount),
+                result.ranges.size(),
+                ranges.str(),
+                result.moreAvailable ? "true" : "false");
+    }
 
     cursor.numVisits++;
 
diff --git a/engines/ep/src/checkpoint_manager.h b/engines/ep/src/checkpoint_manager.h
@@ -55,17 +55,16 @@ class CheckpointManager {
 public:
     typedef std::shared_ptr<Callback<Vbid>> FlusherCallback;
 
-    /// Return type of getItemsForCursor()
+    /// Return type of getNextItemsForCursor()
     struct ItemsForCursor {
-        ItemsForCursor(uint64_t start,
-                       uint64_t end,
-                       CheckpointType checkpointType = CheckpointType::Memory,
-                       boost::optional<uint64_t> highCompletedSeqno = {})
-            : range(start, end),
-              checkpointType(checkpointType),
+        ItemsForCursor() {
+        }
+        ItemsForCursor(CheckpointType checkpointType,
+                       boost::optional<uint64_t> highCompletedSeqno)
+            : checkpointType(checkpointType),
               highCompletedSeqno(highCompletedSeqno) {
         }
-        snapshot_range_t range;
+        std::vector<snapshot_range_t> ranges;
         bool moreAvailable = {false};
         CheckpointType checkpointType = CheckpointType::Memory;
 
diff --git a/engines/ep/src/ep_bucket.cc b/engines/ep/src/ep_bucket.cc
@@ -364,7 +364,8 @@ std::pair<bool, size_t> EPBucket::flushVBucket(Vbid vbid) {
         // a single flush.
         auto toFlush = vb->getItemsToPersist(flusherBatchSplitTrigger);
         auto& items = toFlush.items;
-        auto& range = toFlush.range;
+        // The range becomes initialised only when an item is flushed
+        boost::optional<snapshot_range_t> range;
         moreAvailable = toFlush.moreAvailable;
 
         KVStore* rwUnderlying = getRWUnderlying(vb->getId());
@@ -403,8 +404,6 @@ std::pair<bool, size_t> EPBucket::flushVBucket(Vbid vbid) {
             uint64_t maxSeqno = 0;
             auto minSeqno = std::numeric_limits<uint64_t>::max();
 
-            range.setStart(std::max(range.getStart(), vbstate.lastSnapStart));
-
             bool mustCheckpointVBState = false;
 
             Collections::VB::Flush collectionFlush(vb->getManifest());
@@ -492,6 +491,35 @@ std::pair<bool, size_t> EPBucket::flushVBucket(Vbid vbid) {
                     }
                     ++stats.flusher_todo;
 
+                    if (!range.is_initialized()) {
+                        range = snapshot_range_t{
+                                vbstate.lastSnapStart,
+                                toFlush.ranges.empty()
+                                        ? vbstate.lastSnapEnd
+                                        : toFlush.ranges.back().getEnd()};
+                    }
+
+                    // Is the item the end item of one of the ranges we're
+                    // flushing? Note all the work here only affects replica VBs
+                    auto itr = std::find_if(
+                            toFlush.ranges.begin(),
+                            toFlush.ranges.end(),
+                            [&item](auto& range) {
+                                return uint64_t(item->getBySeqno()) ==
+                                       range.getEnd();
+                            });
+
+                    // If this is the end item, we can adjust the start of our
+                    // flushed range, which would be used for failure purposes.
+                    // Primarily by bringing the start to be a consistent point
+                    // allows for promotion to active to set the fail-over table
+                    // to a consistent point.
+                    if (itr != toFlush.ranges.end()) {
+                        // Use std::max as the flusher is not visiting in seqno
+                        // order.
+                        range->setStart(
+                                std::max(range->getStart(), itr->getEnd()));
+                    }
                 } else {
                     // Item is the same key as the previous[1] one - don't need
                     // to flush to disk.
@@ -521,9 +549,9 @@ std::pair<bool, size_t> EPBucket::flushVBucket(Vbid vbid) {
 
                 // only update the snapshot range if items were flushed, i.e.
                 // don't appear to be in a snapshot when you have no data for it
-                if (items_flushed) {
-                    vbstate.lastSnapStart = range.getStart();
-                    vbstate.lastSnapEnd = range.getEnd();
+                if (range) {
+                    vbstate.lastSnapStart = range->getStart();
+                    vbstate.lastSnapEnd = range->getEnd();
                 }
                 // Track the lowest seqno written in spock and record it as
                 // the HLC epoch, a seqno which we can be sure the value has a
@@ -582,8 +610,8 @@ std::pair<bool, size_t> EPBucket::flushVBucket(Vbid vbid) {
             if (vb->rejectQueue.empty()) {
                 // only update the snapshot range if items were flushed, i.e.
                 // don't appear to be in a snapshot when you have no data for it
-                if (items_flushed) {
-                    vb->setPersistedSnapshot(range.getStart(), range.getEnd());
+                if (range) {
+                    vb->setPersistedSnapshot(*range);
                 }
                 uint64_t highSeqno = rwUnderlying->getLastPersistedSeqno(vbid);
                 if (highSeqno > 0 && highSeqno != vb->getPersistenceSeqno()) {
diff --git a/engines/ep/src/ep_types.cc b/engines/ep/src/ep_types.cc
@@ -117,3 +117,11 @@ std::ostream& operator<<(std::ostream& os, TransferVB transfer) {
     throw std::invalid_argument("operator<<(TransferVB) unknown value " +
                                 std::to_string(static_cast<int>(transfer)));
 }
+
+std::ostream& operator<<(std::ostream& os, const snapshot_range_t& range) {
+    return os << "{" << range.getStart() << "," << range.getEnd() << "}";
+}
+
+std::ostream& operator<<(std::ostream& os, const snapshot_info_t& info) {
+    return os << "start:" << info.start << ", range:" << info.range;
+}
diff --git a/engines/ep/src/ep_types.h b/engines/ep/src/ep_types.h
@@ -113,6 +113,8 @@ struct snapshot_range_t {
     uint64_t end;
 };
 
+std::ostream& operator<<(std::ostream&, const snapshot_range_t&);
+
 struct snapshot_info_t {
     snapshot_info_t(uint64_t start, snapshot_range_t range)
         : start(start), range(range) {
@@ -121,6 +123,8 @@ struct snapshot_info_t {
     snapshot_range_t range;
 };
 
+std::ostream& operator<<(std::ostream&, const snapshot_info_t&);
+
 /**
  * The following options can be specified
  * for retrieving an item for get calls
diff --git a/engines/ep/src/kv_bucket.cc b/engines/ep/src/kv_bucket.cc
@@ -2008,7 +2008,7 @@ void KVBucket::reset() {
             vb->ht.clear();
             vb->checkpointManager->clear(vb->getState());
             vb->resetStats();
-            vb->setPersistedSnapshot(0, 0);
+            vb->setPersistedSnapshot({0, 0});
             EP_LOG_INFO("KVBucket::reset(): Successfully flushed {}", vbid);
         }
     }
diff --git a/engines/ep/src/vbucket.cc b/engines/ep/src/vbucket.cc
@@ -209,8 +209,7 @@ VBucket::VBucket(Vbid i,
       initialState(initState),
       purge_seqno(purgeSeqno),
       takeover_backed_up(false),
-      persisted_snapshot_start(lastSnapStart),
-      persisted_snapshot_end(lastSnapEnd),
+      persistedRange(lastSnapStart, lastSnapEnd),
       receivingInitialDiskSnapshot(false),
       rollbackItemCount(0),
       hlc(maxCas,
@@ -240,17 +239,14 @@ VBucket::VBucket(Vbid i,
     setupSyncReplication(replTopology);
 
     EP_LOG_INFO(
-            "VBucket: created {} with state:{} "
-            "initialState:{} lastSeqno:{} lastSnapshot:{{{},{}}} "
-            "persisted_snapshot:{{{},{}}} max_cas:{} uuid:{} topology:{}",
+            "VBucket: created {} with state:{} initialState:{} lastSeqno:{} "
+            "persistedRange:{{{},{}}} max_cas:{} uuid:{} topology:{}",
             id,
             VBucket::toString(state),
             VBucket::toString(initialState),
             lastSeqno,
-            lastSnapStart,
-            lastSnapEnd,
-            persisted_snapshot_start,
-            persisted_snapshot_end,
+            persistedRange.getStart(),
+            persistedRange.getEnd(),
             getMaxCas(),
             failovers ? std::to_string(failovers->getLatestUUID()) : "<>",
             replicationTopology.rlock()->dump());
@@ -406,19 +402,13 @@ VBucket::ItemsToFlush VBucket::getItemsToPersist(size_t approxLimit) {
         auto _begin_ = std::chrono::steady_clock::now();
         auto ckptItems = checkpointManager->getItemsForPersistence(
                 result.items, ckptMgrLimit);
-        result.range = ckptItems.range;
+        result.ranges = std::move(ckptItems.ranges);
         result.highCompletedSeqno = ckptItems.highCompletedSeqno;
         ckptItemsAvailable = ckptItems.moreAvailable;
         stats.persistenceCursorGetItemsHisto.add(
                 std::chrono::duration_cast<std::chrono::microseconds>(
                         std::chrono::steady_clock::now() - _begin_));
-    } else {
-        // We haven't got sufficient remaining capacity to read items from
-        // CheckpoitnManager, therefore we must assume that there /could/
-        // more data to follow (leaving ckptItemsAvailable true). We also must
-        // ensure the valid snapshot range is returned
-        result.range = checkpointManager->getSnapshotInfo().range;
-    }
+    } // else result.ranges is empty, all items from rejectQueue
 
     // Check if there's any more items remaining.
     result.moreAvailable = !rejectQueue.empty() || ckptItemsAvailable;
@@ -2840,8 +2830,8 @@ void VBucket::postProcessRollback(const RollbackResult& rollbackResult,
                                   uint64_t prevHighSeqno) {
     failovers->pruneEntries(rollbackResult.highSeqno);
     checkpointManager->clear(*this, rollbackResult.highSeqno);
-    setPersistedSnapshot(rollbackResult.snapStartSeqno,
-                         rollbackResult.snapEndSeqno);
+    setPersistedSnapshot(
+            {rollbackResult.snapStartSeqno, rollbackResult.snapEndSeqno});
     incrRollbackItemCount(prevHighSeqno - rollbackResult.highSeqno);
     checkpointManager->setOpenCheckpointId(1);
     setReceivingInitialDiskSnapshot(false);
diff --git a/engines/ep/src/vbucket.h b/engines/ep/src/vbucket.h
@@ -207,15 +207,14 @@ class VBucket : public std::enable_shared_from_this<VBucket> {
         purge_seqno = to;
     }
 
-    void setPersistedSnapshot(uint64_t start, uint64_t end) {
+    void setPersistedSnapshot(const snapshot_range_t& range) {
         LockHolder lh(snapshotMutex);
-        persisted_snapshot_start = start;
-        persisted_snapshot_end = end;
+        persistedRange = range;
     }
 
     snapshot_range_t getPersistedSnapshot() const {
         LockHolder lh(snapshotMutex);
-        return {persisted_snapshot_start, persisted_snapshot_end};
+        return persistedRange;
     }
 
     uint64_t getMaxCas() const {
@@ -469,7 +468,7 @@ class VBucket : public std::enable_shared_from_this<VBucket> {
 
     struct ItemsToFlush {
         std::vector<queued_item> items;
-        snapshot_range_t range{0, 0};
+        std::vector<snapshot_range_t> ranges;
         bool moreAvailable = false;
         boost::optional<uint64_t> highCompletedSeqno = {};
     };
@@ -2281,8 +2280,7 @@ class VBucket : public std::enable_shared_from_this<VBucket> {
     /* snapshotMutex is used to update/read the pair {start, end} atomically,
        but not if reading a single field. */
     mutable std::mutex snapshotMutex;
-    uint64_t persisted_snapshot_start;
-    uint64_t persisted_snapshot_end;
+    snapshot_range_t persistedRange;
 
     /*
      * When a vbucket is in the middle of receiving the initial disk snapshot
diff --git a/engines/ep/tests/module_tests/checkpoint_test.cc b/engines/ep/tests/module_tests/checkpoint_test.cc
diff --git a/engines/ep/tests/module_tests/checkpoint_test.h b/engines/ep/tests/module_tests/checkpoint_test.h
diff --git a/engines/ep/tests/module_tests/dcp_reflection_test.cc b/engines/ep/tests/module_tests/dcp_reflection_test.cc
diff --git a/engines/ep/tests/module_tests/vbucket_test.cc b/engines/ep/tests/module_tests/vbucket_test.cc

Original file line number	Diff line number	Diff line change
`@@ -2008,7 +2008,7 @@ void KVBucket::reset() {`
`2008`	`2008`	`vb->ht.clear();`
`2009`	`2009`	`vb->checkpointManager->clear(vb->getState());`
`2010`	`2010`	`vb->resetStats();`
`2011`		`- vb->setPersistedSnapshot(0, 0);`
	`2011`	`+ vb->setPersistedSnapshot({0, 0});`
`2012`	`2012`	`EP_LOG_INFO("KVBucket::reset(): Successfully flushed {}", vbid);`
`2013`	`2013`	`}`
`2014`	`2014`	`}`