Blob direct write v1: write-path blob separation with partitioned files (reduced scope)#14535
Blob direct write v1: write-path blob separation with partitioned files (reduced scope)#14535xingbowang wants to merge 9 commits intofacebook:mainfrom
Conversation
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98766843. |
1 similar comment
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98766843. |
|
| Check | Count |
|---|---|
bugprone-unused-return-value |
2 |
cert-err58-cpp |
1 |
concurrency-mt-unsafe |
12 |
cppcoreguidelines-special-member-functions |
1 |
| Total | 16 |
Details
db/blob/blob_file_cache.cc (2 warning(s))
db/blob/blob_file_cache.cc:169:3: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]
db/blob/blob_file_cache.cc:211:3: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]
db/db_impl/db_impl_write.cc (1 warning(s))
db/db_impl/db_impl_write.cc:525:10: warning: class 'BlobWriteRollbackGuard' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]
db_stress_tool/db_stress_gflags.cc (1 warning(s))
db_stress_tool/db_stress_gflags.cc:497:19: warning: initialization of 'FLAGS_blob_direct_write_partitions_dummy' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
db_stress_tool/db_stress_tool.cc (12 warning(s))
db_stress_tool/db_stress_tool.cc:225:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:230:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:236:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:241:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:246:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:252:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:259:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:265:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:271:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:279:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:288:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:294:7: warning: function is not thread safe [concurrency-mt-unsafe]
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 60b4ea8 The final review has been written to Blob Direct Write v1 — Final Synthesized Code ReviewPR: Blob direct write v1: write-path blob separation with partitioned files SummaryThis PR introduces a new feature where large values (>= The implementation is architecturally sound and well-integrated across write, read, and flush paths. Test coverage is good for basic functionality and integration scenarios. However, several correctness and performance issues were identified that warrant attention before merge. HIGH SeverityH1:
|
4d24273 to
4dbcc48
Compare
4dbcc48 to
71b2c50
Compare
TransformBatch iterates the WriteBatch via Iterate(), which can return an error on a corrupted batch before the WAL verification path (MergeBatch -> WriteToWAL -> VerifyChecksum) has a chance to detect it. This caused DbKvChecksumTestMergedBatch tests to fail because the corruption error was returned early from WriteImpl instead of being set as a background error via the WAL write path. Add a quick scan of column families before calling TransformBatch. When no CF has blob_partition_manager (i.e. no BDW enabled), skip the transform entirely.
Summary
This PR introduces blob direct write v1, a reduced-scope write-path optimization where large values (>=
min_blob_size) are written directly to blob files duringPut()and replaced in the memtable with compactBlobIndexreferences. This avoids holding full values in memory until flush time.Motivation
In the existing BlobDB architecture, values are written to the WAL and memtable in their full form and separated into blob files only at flush time. This means:
Blob direct write addresses both: values leave the write path as small
BlobIndexreferences, and multiple partitions (configurable viablob_direct_write_partitions) allow concurrent blob writes with independent locks.Design (v1 — single-writer, WAL-disabled, reduced scope)
The v1 design intentionally keeps scope narrow for correctness and reviewability:
AddRecordcall flushes to the OS immediately.blob_direct_write_partitionsfiles using an atomic counter.New components
BlobFilePartitionManagerBlobWriteBatchTransformerWriteBatch::Handlerthat rewrites qualifyingPutvalues asBlobIndexentries before the batch enters the write group.Write path integration
DBImpl::WriteImplcallsBlobWriteBatchTransformer::TransformBatchbefore entering the writer group (for default write path), or before joining the batch group (for pipelined/unordered write).min_blob_sizeare written to a partition file; the key is stored with aBlobIndexin the transformed batch. A rollback guard marks blob bytes as initial garbage if the write fails.SwitchMemtable,RotateCurrentGenerationmoves active partitions into the next immutable batch.FlushMemTableToOutputFile/AtomicFlushMemTablesToOutputFilescallPrepareFlushAdditionsto seal partition files and collectBlobFileAddition+BlobFileGarbageentries registered to MANIFEST alongside the flush.CancelAllBackgroundWork,WaitForCompactwithclose_db=true) force-flush all CFs with active direct-write managers to ensure blob files are registered before close.Read path
MaybeResolveBlobForWritePathresolvesBlobIndexreferences found in memtable or immutable memtable viaBlobFilePartitionManager::ResolveBlobDirectWriteIndex, which first checks manifest-visible state and falls back to direct blob-file reads viaBlobFileCache.DBIter::BlobReaderis extended with aBlobFilePartitionManager*to resolve direct-write blob indexes during iteration. The unifiedResolveBlobDirectWriteIndexpath handles both manifest-visible and not-yet-flushed files.New options
enable_blob_direct_writefalseenable_blob_files = true. Not dynamically changeable.blob_direct_write_partitions1Feature incompatibilities (reduced v1 scope)
The following features are not supported when
enable_blob_direct_write = true, and are enforced both indb_stress_toolvalidation anddb_crashtest.pysanitization:Write model constraints:
threadsmust be 1 (single writer assumption)allow_concurrent_memtable_write= 0enable_pipelined_write= 0 (transformation done before batch group, but pipelined path supported with pre-transform)two_write_queues= 0unordered_write= 0 (transformation done before batch group, but unordered path supported with pre-transform)WAL and recovery:
disable_wal= 1 (required — WAL replay of unregistered blob files is out of v1 scope)best_efforts_recovery= 0reopen= 0 (no crash-restart with WAL replay)manual_wal_flush_one_in,sync_wal_one_in,lock_wal_one_in,get_sorted_wal_files_one_in,get_current_wal_file_one_in,track_and_verify_wals,rate_limit_auto_wal_flush,recycle_log_file_numBlob GC and dynamic options:
use_blob_db= 0 (stacked BlobDB not supported)allow_setting_blob_options_dynamically= 0enable_blob_garbage_collection= 0blob_compaction_readahead_size= 0blob_file_starting_level= 0Unsupported value types and APIs:
use_merge,use_full_merge_v1) — merge values pass through untransformeduse_put_entity_one_in,use_get_entity,use_multi_get_entity,use_attribute_group)use_timed_put_one_inuser_timestamp_size,persist_user_defined_timestamps,create_timestamped_snapshot_one_in)use_txn,use_optimistic_txn,test_multi_ops_txns,commit_bypass_memtable_one_in) — thoughWriteCommittedTxn::CommitInternalfalls back from bypass-memtable to normal path when BDW is activeIngestWriteBatchWithIndexreturnsNotSupportedinplace_update_support= 0Fault injection:
sync_fault_injection,write_fault_one_in,metadata_write_fault_one_in,read_fault_one_in,metadata_read_fault_one_in,open_*_fault_one_in)Infrastructure/snapshot APIs:
remote_compaction_worker_threads= 0test_secondary= 0backup_one_in= 0checkpoint_one_in= 0get_live_files_apis_one_in= 0ingest_external_file_one_in= 0ingest_wbwi_one_in= 0Tests
db/blob/db_blob_basic_test.cc: ~660 lines of new direct-write unit tests covering basic put/get, multi-partition, flush/compaction, recovery, and error injection.db/blob/blob_file_cache_test.cc: ~96 lines of new tests for direct-write blob file cache behavior.db/write_batch_test.cc: ~96 lines of tests for WriteBatch with blob index entries.utilities/transactions/transaction_test.cc: verifies transaction commit path falls back correctly with direct write enabled.db_stress_tool/: full stress test support with--enable_blob_direct_writeand--blob_direct_write_partitionsflags, integrated intodb_crashtest.pywith 10% random selection alongside regular blob params.Test Plan
Stress test: