Skip to content

Blob direct write v1: write-path blob separation with partitioned files (reduced scope)#14535

Open
xingbowang wants to merge 9 commits intofacebook:mainfrom
xingbowang:blob_direct_write_v1
Open

Blob direct write v1: write-path blob separation with partitioned files (reduced scope)#14535
xingbowang wants to merge 9 commits intofacebook:mainfrom
xingbowang:blob_direct_write_v1

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented Mar 30, 2026

Summary

This PR introduces blob direct write v1, a reduced-scope write-path optimization where large values (>= min_blob_size) are written directly to blob files during Put() and replaced in the memtable with compact BlobIndex references. This avoids holding full values in memory until flush time.

Motivation

In the existing BlobDB architecture, values are written to the WAL and memtable in their full form and separated into blob files only at flush time. This means:

  • Large values are held in memory twice (raw in memtable + blob file at flush)
  • Blob I/O is serialized through a single flush thread per column family

Blob direct write addresses both: values leave the write path as small BlobIndex references, and multiple partitions (configurable via blob_direct_write_partitions) allow concurrent blob writes with independent locks.

Design (v1 — single-writer, WAL-disabled, reduced scope)

The v1 design intentionally keeps scope narrow for correctness and reviewability:

  • Single writer thread assumption: no concurrent writes to the same partition file. One logical writer serializes the batch.
  • WAL-disabled: direct-write blob files are only registered in MANIFEST at flush time. WAL replay cannot recover unregistered blob references, so WAL is disabled for this v1.
  • Sync-on-write: each AddRecord call flushes to the OS immediately.
  • FIFO generation batching: each memtable switch creates one generation batch. Direct-write files for that memtable are sealed and registered atomically when the batch is flushed to MANIFEST.
  • Round-robin partitions: blob writes are distributed across blob_direct_write_partitions files using an atomic counter.

New components

Component Description
BlobFilePartitionManager Owns N partition files per CF. Manages open/seal/register lifecycle tied to memtable generations.
BlobWriteBatchTransformer A WriteBatch::Handler that rewrites qualifying Put values as BlobIndex entries before the batch enters the write group.

Write path integration

  1. DBImpl::WriteImpl calls BlobWriteBatchTransformer::TransformBatch before entering the writer group (for default write path), or before joining the batch group (for pipelined/unordered write).
  2. Values >= min_blob_size are written to a partition file; the key is stored with a BlobIndex in the transformed batch. A rollback guard marks blob bytes as initial garbage if the write fails.
  3. On SwitchMemtable, RotateCurrentGeneration moves active partitions into the next immutable batch.
  4. FlushMemTableToOutputFile / AtomicFlushMemTablesToOutputFiles call PrepareFlushAdditions to seal partition files and collect BlobFileAddition + BlobFileGarbage entries registered to MANIFEST alongside the flush.
  5. Shutdown paths (CancelAllBackgroundWork, WaitForCompact with close_db=true) force-flush all CFs with active direct-write managers to ensure blob files are registered before close.

Read path

  • Get/MultiGet: MaybeResolveBlobForWritePath resolves BlobIndex references found in memtable or immutable memtable via BlobFilePartitionManager::ResolveBlobDirectWriteIndex, which first checks manifest-visible state and falls back to direct blob-file reads via BlobFileCache.
  • Iterator: DBIter::BlobReader is extended with a BlobFilePartitionManager* to resolve direct-write blob indexes during iteration. The unified ResolveBlobDirectWriteIndex path handles both manifest-visible and not-yet-flushed files.

New options

Option Default Description
enable_blob_direct_write false Enable write-path blob separation for this CF. Requires enable_blob_files = true. Not dynamically changeable.
blob_direct_write_partitions 1 Number of parallel partition files per CF. Not dynamically changeable.

Feature incompatibilities (reduced v1 scope)

The following features are not supported when enable_blob_direct_write = true, and are enforced both in db_stress_tool validation and db_crashtest.py sanitization:

Write model constraints:

  • threads must be 1 (single writer assumption)
  • allow_concurrent_memtable_write = 0
  • enable_pipelined_write = 0 (transformation done before batch group, but pipelined path supported with pre-transform)
  • two_write_queues = 0
  • unordered_write = 0 (transformation done before batch group, but unordered path supported with pre-transform)

WAL and recovery:

  • disable_wal = 1 (required — WAL replay of unregistered blob files is out of v1 scope)
  • best_efforts_recovery = 0
  • reopen = 0 (no crash-restart with WAL replay)
  • All WAL-related stress features disabled: manual_wal_flush_one_in, sync_wal_one_in, lock_wal_one_in, get_sorted_wal_files_one_in, get_current_wal_file_one_in, track_and_verify_wals, rate_limit_auto_wal_flush, recycle_log_file_num

Blob GC and dynamic options:

  • use_blob_db = 0 (stacked BlobDB not supported)
  • allow_setting_blob_options_dynamically = 0
  • enable_blob_garbage_collection = 0
  • blob_compaction_readahead_size = 0
  • blob_file_starting_level = 0

Unsupported value types and APIs:

  • Merge (use_merge, use_full_merge_v1) — merge values pass through untransformed
  • Entity APIs (use_put_entity_one_in, use_get_entity, use_multi_get_entity, use_attribute_group)
  • use_timed_put_one_in
  • User-defined timestamps (user_timestamp_size, persist_user_defined_timestamps, create_timestamped_snapshot_one_in)
  • Transactions (use_txn, use_optimistic_txn, test_multi_ops_txns, commit_bypass_memtable_one_in) — though WriteCommittedTxn::CommitInternal falls back from bypass-memtable to normal path when BDW is active
  • IngestWriteBatchWithIndex returns NotSupported
  • inplace_update_support = 0

Fault injection:

  • All write/read/metadata fault injection disabled (sync_fault_injection, write_fault_one_in, metadata_write_fault_one_in, read_fault_one_in, metadata_read_fault_one_in, open_*_fault_one_in)

Infrastructure/snapshot APIs:

  • remote_compaction_worker_threads = 0
  • test_secondary = 0
  • backup_one_in = 0
  • checkpoint_one_in = 0
  • get_live_files_apis_one_in = 0
  • ingest_external_file_one_in = 0
  • ingest_wbwi_one_in = 0

Tests

  • db/blob/db_blob_basic_test.cc: ~660 lines of new direct-write unit tests covering basic put/get, multi-partition, flush/compaction, recovery, and error injection.
  • db/blob/blob_file_cache_test.cc: ~96 lines of new tests for direct-write blob file cache behavior.
  • db/write_batch_test.cc: ~96 lines of tests for WriteBatch with blob index entries.
  • utilities/transactions/transaction_test.cc: verifies transaction commit path falls back correctly with direct write enabled.
  • db_stress_tool/: full stress test support with --enable_blob_direct_write and --blob_direct_write_partitions flags, integrated into db_crashtest.py with 10% random selection alongside regular blob params.

Test Plan

make -j128 db_blob_basic_test && ./db_blob_basic_test
make -j128 blob_file_cache_test && ./blob_file_cache_test
make -j128 write_batch_test && ./write_batch_test
make -j128 transaction_test && ./transaction_test
make -j128 check

Stress test:

python3 tools/db_crashtest.py blackbox --enable_blob_direct_write=1 \
  --enable_blob_files=1 --blob_direct_write_partitions=4 \
  --disable_wal=1 --threads=1

@meta-cla meta-cla bot added the CLA Signed label Mar 30, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 30, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98766843.

1 similar comment
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 30, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98766843.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 30, 2026

⚠️ clang-tidy: 16 warning(s) on changed lines

Completed in 1334.7s.

Summary by check

Check Count
bugprone-unused-return-value 2
cert-err58-cpp 1
concurrency-mt-unsafe 12
cppcoreguidelines-special-member-functions 1
Total 16

Details

db/blob/blob_file_cache.cc (2 warning(s))
db/blob/blob_file_cache.cc:169:3: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]
db/blob/blob_file_cache.cc:211:3: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]
db/db_impl/db_impl_write.cc (1 warning(s))
db/db_impl/db_impl_write.cc:525:10: warning: class 'BlobWriteRollbackGuard' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]
db_stress_tool/db_stress_gflags.cc (1 warning(s))
db_stress_tool/db_stress_gflags.cc:497:19: warning: initialization of 'FLAGS_blob_direct_write_partitions_dummy' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
db_stress_tool/db_stress_tool.cc (12 warning(s))
db_stress_tool/db_stress_tool.cc:225:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:230:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:236:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:241:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:246:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:252:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:259:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:265:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:271:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:279:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:288:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:294:7: warning: function is not thread safe [concurrency-mt-unsafe]

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 30, 2026

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 60b4ea8


The final review has been written to review-findings.md. Here is the complete review:


Blob Direct Write v1 — Final Synthesized Code Review

PR: Blob direct write v1: write-path blob separation with partitioned files
Author: xingbowang
Scope: 52 files changed, 3621 insertions(+), 139 deletions(-)
Review method: 9-agent parallel review with adversarial debate and manual verification


Summary

This PR introduces a new feature where large values (>= min_blob_size) are written directly to blob files during Put() and replaced in memtables with compact BlobIndex references. The design uses partitioned blob files with round-robin assignment, generation batching tied to memtable switches, and a rollback-as-garbage pattern for failed writes. V1 constraints: single writer thread, WAL disabled at WriteOptions level, buffer flush per record.

The implementation is architecturally sound and well-integrated across write, read, and flush paths. Test coverage is good for basic functionality and integration scenarios. However, several correctness and performance issues were identified that warrant attention before merge.


HIGH Severity

H1: assert(false) fallback in MarkBlobWriteAsGarbage is a no-op in release builds

File: db/blob/blob_file_partition_manager.cc:521

When the rollback function cannot find the target file across partitions, current-generation sealed files, and pending generations, it hits assert(false) followed by a ROCKS_LOG_WARN. In release builds, assert(false) compiles away, making the function silently succeed without marking any garbage. This leaks garbage accounting — the blob bytes are written but never tracked as garbage, causing space amplification.

  assert(false);
  if (info_log_ != nullptr) {
    ROCKS_LOG_WARN(
        info_log_,
        "Could not match failed blob direct-write rollback for file #%" PRIu64,
        file_number);
  }

Recommendation: Return a Status error or at minimum use ROCKS_LOG_ERROR unconditionally. Consider whether this indicates a logic bug that should be a Corruption status.


H2: Flush failure leaves prepared blob generations in limbo

File: db/db_impl/db_impl_compaction_flush.cc:323-330

When flush_job.Run() fails after PrepareFlushAdditions() has already sealed deferred blob files, the code sets prepared_blob_generations = 0 and skips CommitPreparedGenerations(). This means the sealed generations remain in pending_generations_ but are never consumed. On the next successful flush, PrepareFlushAdditions is called again with a new num_generations count that does not account for the stale entries.

  if (!s.ok() || !unconsumed_additions.empty() ||
      !unconsumed_garbages.empty()) {
    prepared_blob_generations = 0;  // Skip commit — but entries stay in queue
    sealed_blob_numbers.clear();
  }

Consequence: The sealed blob files exist on disk with footers written, but are never registered in MANIFEST. They become orphaned. Subsequent flushes may encounter a count mismatch between expected and available generations.

Recommendation: Add explicit cleanup logic that either re-queues the sealed files for the next flush or marks them as garbage to be cleaned up.


H3: Read-path overhead for non-BDW databases — MaybeResolveBlobForWritePath called unconditionally

File: db/db_impl/db_impl.cc (multiple sites in GetImpl and MultiGetImpl)

Every Get() and MultiGet() call invokes MaybeResolveBlobForWritePath() after memtable and immutable memtable lookups, even when no column family has BDW enabled. The function does exit early when partition_mgr == nullptr, but the function call overhead (parameter setup, stack frame) is imposed on every read operation in every database.

Recommendation: Guard the call sites with if (partition_mgr != nullptr) before calling the function, to eliminate function call overhead entirely when BDW is not enabled for the column family.


H4: Write-path CF scan overhead — linear scan of all column families per write batch

File: db/db_impl/db_impl_write.cc:590-596

The maybe_transform_batch_for_blob_direct_write lambda performs a linear scan of all column families to check if any has a non-null blob_partition_manager(). This runs for every write batch even when BDW is never configured. For databases with many column families, this is O(n) per write.

Recommendation: Cache a DB-level has_any_bdw_cf_ flag, updated only when column families are created/dropped or options are changed. This reduces the check to O(1).


MEDIUM Severity

M1: IsValidBlobOffset short-circuit prevents unsigned underflow but relies on || evaluation order

File: db/blob/blob_log_format.h — Safe as written but fragile if refactored. Defensive style would separate into independent if blocks.

M2: Compression performed outside mutex, but may be wasted if partition state changes

File: db/blob/blob_file_partition_manager.cc:WriteBlob — Compression runs outside the lock (good), but if the partition's CF/compression type changed before lock acquisition, the compressed data is discarded. Wasted work for large values.

M3: Public API documentation incomplete for restrictions

File: include/rocksdb/advanced_options.h — Missing documentation for: single writer restriction, WAL requirement, MemPurge/timestamp incompatibility, IngestWriteBatchWithIndex restriction.

M4: ReadOnly/Secondary DB nullptr for blob_partition_mgr

Correct for normal operation (flushed data resolved via Version). Severity reduced from agent's "CRITICAL" — should be documented as known limitation.

M5: RefreshBlobFileReader does 3 non-atomic cache operations

File: db/blob/blob_file_cache.cc — Lookup -> Erase -> Insert window allows concurrent cache miss.

M6: Multiple heap allocations per write batch on BDW path

File: db/db_impl/db_impl_write.ccunordered_set, multiple vector allocations per batch. Consider SmallVector/InlineVector.


Suggestions

  • S1: Cache any_bdw flag at DB level for O(1) write-path check
  • S2: Clarify kDoFlushEachRecord is buffer flush, not fsync (verified: Flush() not fsync())
  • S3: PutEntitySerialized error path missing save.Rollback()
  • S4: Increase BDW stress test frequency beyond ~10%

Verified Non-Issues (7 False Positives from Agents)

  1. "Missing RotateCurrentGeneration call" — IS called at db_impl_write.cc:2911
  2. "ReadOnly DB broken" — Correctly passes nullptr; flushed data resolves via Version
  3. "IsValidBlobOffset unsigned underflow" — Protected by || short-circuit
  4. "Mutex held during compression" — Compression is OUTSIDE mutex
  5. "fsync per record" — It's Flush() (buffer flush), not fsync()
  6. "WAL not enforced at option level" — Correct for WriteOptions flag
  7. "Transaction DBs bypass BDW" — Intentionally excluded from v1 scope

Design Assessment

The overall architecture is well-conceived. Partitioned files, generation batching, rollback-as-garbage, skip-footer reads, and settings caching are all sound design choices. The v1 constraints are reasonable and the code is structured for incremental improvement.

Recommendation

Approve with required changes for H1 and H2. H3/H4 are performance improvements that should be addressed but are not blocking.

Required before merge:

  1. Fix MarkBlobWriteAsGarbage fallback (H1) — replace assert(false) with proper error handling
  2. Fix flush failure orphan handling (H2) — add cleanup for prepared but uncommitted generations

Strongly recommended:

  1. Add if (partition_mgr != nullptr) guards on read path (H3)
  2. Cache DB-level BDW flag for write path (H4)
  3. Improve public API documentation (M3)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@xingbowang xingbowang force-pushed the blob_direct_write_v1 branch from 4d24273 to 4dbcc48 Compare March 30, 2026 23:30
@xingbowang xingbowang force-pushed the blob_direct_write_v1 branch from 4dbcc48 to 71b2c50 Compare March 31, 2026 04:46
@xingbowang xingbowang changed the title Add blob direct write with partitioned blob files (v1) Blob direct write v1: write-path blob separation with partitioned files (reduced scope) Mar 31, 2026
TransformBatch iterates the WriteBatch via Iterate(), which can return
an error on a corrupted batch before the WAL verification path
(MergeBatch -> WriteToWAL -> VerifyChecksum) has a chance to detect it.
This caused DbKvChecksumTestMergedBatch tests to fail because the
corruption error was returned early from WriteImpl instead of being
set as a background error via the WAL write path.

Add a quick scan of column families before calling TransformBatch.
When no CF has blob_partition_manager (i.e. no BDW enabled), skip
the transform entirely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant