Skip to content

Commit 58937d7

Browse files
committed
MB-42306 [2/2]: Bias db_data_size by estimate of completed prepares
+Problem+ The couchstore fragmentation calculation is not taking into account completed Prepares. As such, auto-compaction is not run when expected, and hence the completed (no longer needed) prepares are not purged. This happens because couch_disk_data_size (size of "valid" data on disk) is calculated directly from couchstore's own count of how much data is in the current B-Tree root. While completed prepares can be purged when compaction runs (and hence are conceptually 'stale' data), completed Prepares are conceptually "valid" data from couchstore's POV - they are just documents with a different key prefix which happen to have a seqno below the high_completed_seqno. As such, couch_disk_data_size includes all prepares, outstanding and completed. +Solution+ Ideally KV-Engine would track the total size of all on-disk completed Prepares (whose which can be purged), and subtract that value from the couchstore-tracked couch_disk_data_size. However, that is costly to track accurately - we would have to perform B-Tree seqno scan from 0 to the highCompletedSeqno, accumulating the size of all prepares found, which is an O(N) operation where N = the number of completed prepares. As such it is not suitable for the ~1s polling which ns_server makes. Alternatively (and the approach taken here) we could ignore pending (incomplete) prepares entirely, and assume that all prepares are completed. This is based on the observation that prepares have a maximum of 65s timeout before they are aborted, and most will be Committed much sooner than that. This patch therefore: * Exposes new 'prepare_size' stats, which are the number of bytes of all on-disk prepares: - ep_db_prepare_size (from 'all' or 'diskinfo' group) - vb_XXX:db_prepare_size (from 'vbucket-details' group) - vb_XXX:prepare_size (from 'diskinfo detail' group) * Uses the newly-tracked on_disk_prepare_bytes to bias the raw couchstore-provided couch_disk_data_size stat. This means the following stats are all reduced by the size of the on-disk prepares: - ep_db_data_size (from 'all' or 'diskinfo' group) - vb_XXX:db_data_size (from 'vbucket-details' group) - vb_XXX:data_size (from 'diskinfo detail' group) Change-Id: I4da06fe17a68c96ddd03d027aed9696c5a20def8 Reviewed-on: http://review.couchbase.org/c/kv_engine/+/139312 Tested-by: Dave Rigby <[email protected]> Reviewed-by: James Harrison <[email protected]> Well-Formed: Build Bot <[email protected]>
1 parent 1af1507 commit 58937d7

File tree

7 files changed

+78
-24
lines changed

7 files changed

+78
-24
lines changed

engines/ep/docs/stats.org

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,7 @@ For introductory information on stats within Couchbase, start with the
242242
| | for this bucket |
243243
| ep_db_data_size | Total size of valid data in db files |
244244
| ep_db_file_size | Total size of the db files |
245+
| ep_db_prepare_size | Total size of SyncWrite prepares in db files |
245246
| ep_degraded_mode | True if the engine is either warming |
246247
| | up or data traffic is disabled |
247248
| ep_exp_pager_enabled | True if the expiry pager is enabled |
@@ -560,6 +561,7 @@ The stats below are listed for each vbucket.
560561
| pending writes | Total bytes of pending writes |
561562
| db_data_size | Total size of valid data on disk |
562563
| db_file_size | Total size of the db file |
564+
| db_prepare_size | Total size of SyncWrite prepares on disk |
563565
| high_seqno | The last seqno assigned by this vbucket |
564566
| purge_seqno | The last seqno purged by the compactor |
565567
| bloom_filter | Status of the vbucket's bloom filter |

engines/ep/src/couch-kvstore/couch-kvstore.cc

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,7 @@ CouchKVStore::CouchKVStore(KVStoreConfig& config,
355355
cachedDeleteCount.assign(numDbFiles, cb::RelaxedAtomic<size_t>(-1));
356356
cachedFileSize.assign(numDbFiles, cb::RelaxedAtomic<uint64_t>(0));
357357
cachedSpaceUsed.assign(numDbFiles, cb::RelaxedAtomic<uint64_t>(0));
358+
cachedOnDiskPrepareSize.assign(numDbFiles, 0);
358359
cachedVBStates.resize(numDbFiles);
359360

360361
initialize();
@@ -472,6 +473,7 @@ void CouchKVStore::reset(Vbid vbucketId) {
472473
cachedDeleteCount[vbucketId.get()] = 0;
473474
cachedFileSize[vbucketId.get()] = 0;
474475
cachedSpaceUsed[vbucketId.get()] = 0;
476+
cachedOnDiskPrepareSize[vbucketId.get()] = 0;
475477

476478
// Unlink the current revision and then increment it to ensure any
477479
// pending delete doesn't delete us. Note that the expectation is that
@@ -1242,6 +1244,7 @@ bool CouchKVStore::compactDBInternal(compaction_ctx* hook_ctx,
12421244
state->onDiskPrepares -= hook_ctx->stats.preparesPurged;
12431245
state->updateOnDiskPrepareBytes(
12441246
-int64_t(hook_ctx->stats.prepareBytesPurged));
1247+
cachedOnDiskPrepareSize[vbid.get()] = state->getOnDiskPrepareBytes();
12451248
}
12461249

12471250
logger.debug("INFO: created new couch db file, name:{} rev:{}",
@@ -2417,6 +2420,7 @@ couchstore_error_t CouchKVStore::saveDocs(Vbid vbid,
24172420
cachedFileSize[vbid.get()] = info.file_size;
24182421
cachedDeleteCount[vbid.get()] = info.deleted_count;
24192422
cachedDocCount[vbid.get()] = info.doc_count;
2423+
cachedOnDiskPrepareSize[vbid.get()] = state->getOnDiskPrepareBytes();
24202424

24212425
// Check seqno if we wrote documents
24222426
if (docs.size() > 0 && maxDBSeqno != info.last_sequence) {
@@ -2891,7 +2895,9 @@ size_t CouchKVStore::getNumPersistedDeletes(Vbid vbid) {
28912895

28922896
DBFileInfo CouchKVStore::getDbFileInfo(Vbid vbid) {
28932897
DbInfo info = getDbInfo(vbid);
2894-
return DBFileInfo{info.file_size, info.space_used};
2898+
return DBFileInfo{info.file_size,
2899+
info.space_used,
2900+
cachedOnDiskPrepareSize[vbid.get()]};
28952901
}
28962902

28972903
DBFileInfo CouchKVStore::getAggrDbFileInfo() {
@@ -2904,6 +2910,7 @@ DBFileInfo CouchKVStore::getAggrDbFileInfo() {
29042910
for (uint16_t vbid = 0; vbid < numDbFiles; vbid++) {
29052911
kvsFileInfo.fileSize += cachedFileSize[vbid].load();
29062912
kvsFileInfo.spaceUsed += cachedSpaceUsed[vbid].load();
2913+
kvsFileInfo.prepareBytes += cachedOnDiskPrepareSize[vbid].load();
29072914
}
29082915
return kvsFileInfo;
29092916
}
@@ -3223,6 +3230,7 @@ uint64_t CouchKVStore::prepareToDeleteImpl(Vbid vbid) {
32233230
cachedDeleteCount[vbid.get()] = 0;
32243231
cachedFileSize[vbid.get()] = 0;
32253232
cachedSpaceUsed[vbid.get()] = 0;
3233+
cachedOnDiskPrepareSize[vbid.get()] = 0;
32263234
return getDbRevision(vbid);
32273235
}
32283236

engines/ep/src/couch-kvstore/couch-kvstore.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -828,6 +828,9 @@ class CouchKVStore : public KVStore
828828
std::vector<cb::RelaxedAtomic<size_t>> cachedDeleteCount;
829829
std::vector<cb::RelaxedAtomic<uint64_t>> cachedFileSize;
830830
std::vector<cb::RelaxedAtomic<uint64_t>> cachedSpaceUsed;
831+
/// Size of on-disk prepares, indexed by vBucket RelaxedAtomic to allow
832+
/// stats access without lock.
833+
std::vector<cb::RelaxedAtomic<size_t>> cachedOnDiskPrepareSize;
831834

832835
/* pending file deletions */
833836
folly::Synchronized<std::queue<std::string>> pendingFileDeletions;

engines/ep/src/ep_bucket.cc

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1219,10 +1219,16 @@ ENGINE_ERROR_CODE EPBucket::getFileStats(const void* cookie,
12191219
getRWUnderlyingByShard(shardId)->getAggrDbFileInfo();
12201220
totalInfo.spaceUsed += dbInfo.spaceUsed;
12211221
totalInfo.fileSize += dbInfo.fileSize;
1222+
totalInfo.prepareBytes += dbInfo.prepareBytes;
12221223
}
12231224

1224-
add_casted_stat("ep_db_data_size", totalInfo.spaceUsed, add_stat, cookie);
1225+
add_casted_stat("ep_db_data_size",
1226+
totalInfo.getEstimatedLiveData(),
1227+
add_stat,
1228+
cookie);
12251229
add_casted_stat("ep_db_file_size", totalInfo.fileSize, add_stat, cookie);
1230+
add_casted_stat(
1231+
"ep_db_prepare_size", totalInfo.prepareBytes, add_stat, cookie);
12261232

12271233
return ENGINE_SUCCESS;
12281234
}
@@ -1244,10 +1250,14 @@ ENGINE_ERROR_CODE EPBucket::getPerVBucketDiskStats(const void* cookie,
12441250

12451251
checked_snprintf(
12461252
buf, sizeof(buf), "vb_%d:data_size", vbid.get());
1247-
add_casted_stat(buf, dbInfo.spaceUsed, add_stat, cookie);
1253+
add_casted_stat(
1254+
buf, dbInfo.getEstimatedLiveData(), add_stat, cookie);
12481255
checked_snprintf(
12491256
buf, sizeof(buf), "vb_%d:file_size", vbid.get());
12501257
add_casted_stat(buf, dbInfo.fileSize, add_stat, cookie);
1258+
checked_snprintf(
1259+
buf, sizeof(buf), "vb_%d:prepare_size", vbid.get());
1260+
add_casted_stat(buf, dbInfo.prepareBytes, add_stat, cookie);
12511261
} catch (std::exception& error) {
12521262
EP_LOG_WARN(
12531263
"DiskStatVisitor::visitBucket: Failed to build stat: "
@@ -1944,4 +1954,4 @@ bool EPBucket::isValidBucketDurabilityLevel(cb::durability::Level level) const {
19441954
return true;
19451955
}
19461956
folly::assume_unreachable();
1947-
}
1957+
}

engines/ep/src/ep_vb.cc

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -422,16 +422,12 @@ void EPVBucket::addStats(VBucketStatsDetailLevel detail,
422422
_addStats(detail, add_stat, c);
423423

424424
if (detail == VBucketStatsDetailLevel::Full) {
425-
uint64_t spaceUsed = 0;
426-
uint64_t fileSize = 0;
425+
DBFileInfo fileInfo;
427426

428427
// Only try to read disk if we believe the file has been created
429428
if (!isBucketCreation()) {
430429
try {
431-
DBFileInfo fileInfo =
432-
shard->getRWUnderlying()->getDbFileInfo(getId());
433-
spaceUsed = fileInfo.spaceUsed;
434-
fileSize = fileInfo.fileSize;
430+
fileInfo = shard->getRWUnderlying()->getDbFileInfo(getId());
435431
} catch (std::runtime_error& e) {
436432
EP_LOG_WARN(
437433
"VBucket::addStats: Exception caught during "
@@ -441,8 +437,9 @@ void EPVBucket::addStats(VBucketStatsDetailLevel detail,
441437
e.what());
442438
}
443439
}
444-
addStat("db_data_size", spaceUsed, add_stat, c);
445-
addStat("db_file_size", fileSize, add_stat, c);
440+
addStat("db_data_size", fileInfo.getEstimatedLiveData(), add_stat, c);
441+
addStat("db_file_size", fileInfo.fileSize, add_stat, c);
442+
addStat("db_prepare_size", fileInfo.prepareBytes, add_stat, c);
446443
}
447444
}
448445

engines/ep/src/kvstore.h

Lines changed: 36 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -145,14 +145,43 @@ class NoLookupCallback : public StatusCallback<CacheLookup> {
145145
};
146146

147147
struct DBFileInfo {
148-
DBFileInfo() :
149-
fileSize(0), spaceUsed(0) { }
148+
/// Total size of the file (what 'stat()' would return). Includes both
149+
/// current data (spaceUsed) plus any previous data which is no longer
150+
/// referenced in current file header.
151+
uint64_t fileSize = 0;
152+
153+
/// Total size of "current" data in the file - sum of all
154+
/// keys+metdata+values (included deleted docs) plus overheads to manage it
155+
/// (indexes such as B-Trees, headers etc).
156+
uint64_t spaceUsed = 0;
157+
158+
/// Total size of all SyncWrite prepares, both completed and pending.
159+
/// This can be used to adjust spaceUsed to give an estimate of how much
160+
/// data in the file is actually needed - completed prepares are no
161+
/// longer needed and can be purged during compaction - as such they can
162+
/// be considered part of the "Fragmented" count.
163+
uint64_t prepareBytes = 0;
150164

151-
DBFileInfo(uint64_t fileSize_, uint64_t spaceUsed_)
152-
: fileSize(fileSize_), spaceUsed(spaceUsed_) {}
153-
154-
uint64_t fileSize;
155-
uint64_t spaceUsed;
165+
/**
166+
* @returns An estimate of the number of bytes which are "live" data and
167+
* hence are not subject to being discarded during compactionn. This
168+
* is calculated as the size of the current data (spaceUsed), minus an
169+
* estimate of the size of completed prepares (which will be purged on
170+
* compaction).
171+
* Note: All prepared SyncWrites (completed and in-progress) are used as
172+
* an estimate for completed sync writes, given (a) it's difficult
173+
* to track exactly how any prepares have been completed and (b)
174+
* in general we expect the overwhelming majority of on-disk prepares
175+
* to be completed.
176+
*/
177+
uint64_t getEstimatedLiveData() const {
178+
if (spaceUsed > prepareBytes) {
179+
// Sanity check - if totalOnDiskPrepareSize is somehow larger than
180+
// spaceUsed then skip the adjustment.
181+
return spaceUsed - prepareBytes;
182+
}
183+
return spaceUsed;
184+
}
156185
};
157186

158187
enum scan_error_t {

engines/ep/tests/ep_testsuite.cc

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7512,9 +7512,11 @@ static enum test_result test_mb19687_fixed(EngineIface* h) {
75127512
// Using explicit initializer lists due to http://stackoverflow
75137513
// .com/questions/36557969/invalid-iterator-range-while-inserting
75147514
// -initializer-list-to-an-stdvector
7515-
eng_stats.insert(eng_stats.end(),
7516-
std::initializer_list<std::string>{"ep_db_data_size",
7517-
"ep_db_file_size"});
7515+
eng_stats.insert(
7516+
eng_stats.end(),
7517+
std::initializer_list<std::string>{"ep_db_data_size",
7518+
"ep_db_file_size",
7519+
"ep_db_prepare_size"});
75187520
eng_stats.insert(eng_stats.end(),
75197521
std::initializer_list<std::string>{"ep_flusher_state",
75207522
"ep_flusher_todo"});
@@ -7544,8 +7546,10 @@ static enum test_result test_mb19687_fixed(EngineIface* h) {
75447546
eng_stats.insert(eng_stats.end(), persistentConfig);
75457547

75467548
// 'diskinfo and 'diskinfo detail' keys should be present now.
7547-
statsKeys["diskinfo"] = {"ep_db_data_size", "ep_db_file_size"};
7548-
statsKeys["diskinfo detail"] = {"vb_0:data_size", "vb_0:file_size"};
7549+
statsKeys["diskinfo"] = {
7550+
"ep_db_data_size", "ep_db_file_size", "ep_db_prepare_size"};
7551+
statsKeys["diskinfo detail"] = {
7552+
"vb_0:data_size", "vb_0:file_size", "vb_0:prepare_size"};
75497553

75507554
// Add stats which are only available for persistent buckets:
75517555
static const char* persistence_stats[] = {
@@ -7562,6 +7566,7 @@ static enum test_result test_mb19687_fixed(EngineIface* h) {
75627566
auto& vb_details = statsKeys.at("vbucket-details 0");
75637567
vb_details.push_back("vb_0:db_data_size");
75647568
vb_details.push_back("vb_0:db_file_size");
7569+
vb_details.push_back("vb_0:db_prepare_size");
75657570

75667571
// Config variables only valid for persistent
75677572
auto& config_stats = statsKeys.at("config");

0 commit comments

Comments
 (0)