Blob direct write v1: write-path blob separation with partitioned files (reduced scope) by xingbowang · Pull Request #14535 · facebook/rocksdb

xingbowang · 2026-03-30T18:09:03Z

Summary

This PR introduces blob direct write v1, a reduced-scope write-path optimization where large values (>= min_blob_size) are written directly to blob files during Put() and replaced in the memtable with compact BlobIndex references. This avoids holding full values in memory until flush time.

Motivation

In the existing BlobDB architecture, values are written to the WAL and memtable in their full form and separated into blob files only at flush time. This means:

Large values are held in memory twice (raw in memtable + blob file at flush)
Blob I/O is serialized through a single flush thread per column family

Blob direct write addresses both: values leave the write path as small BlobIndex references, and multiple partitions (configurable via blob_direct_write_partitions) allow concurrent blob writes with independent locks.

Design (v1 — single-writer, WAL-disabled, reduced scope)

The v1 design intentionally keeps scope narrow for correctness and reviewability:

Single writer thread assumption: no concurrent writes to the same partition file. One logical writer serializes the batch.
WAL-disabled: direct-write blob files are only registered in MANIFEST at flush time. WAL replay cannot recover unregistered blob references, so WAL is disabled for this v1.
Sync-on-write: each AddRecord call flushes to the OS immediately.
FIFO generation batching: each memtable switch creates one generation batch. Direct-write files for that memtable are sealed and registered atomically when the batch is flushed to MANIFEST.
Round-robin partitions: blob writes are distributed across blob_direct_write_partitions files using an atomic counter.

New components

Component	Description
`BlobFilePartitionManager`	Owns N partition files per CF. Manages open/seal/register lifecycle tied to memtable generations.
`BlobWriteBatchTransformer`	A `WriteBatch::Handler` that rewrites qualifying `Put` values as `BlobIndex` entries before the batch enters the write group.

Write path integration

DBImpl::WriteImpl calls BlobWriteBatchTransformer::TransformBatch before entering the writer group (for default write path), or before joining the batch group (for pipelined/unordered write).
Values >= min_blob_size are written to a partition file; the key is stored with a BlobIndex in the transformed batch. A rollback guard marks blob bytes as initial garbage if the write fails.
On SwitchMemtable, RotateCurrentGeneration moves active partitions into the next immutable batch.
FlushMemTableToOutputFile / AtomicFlushMemTablesToOutputFiles call PrepareFlushAdditions to seal partition files and collect BlobFileAddition + BlobFileGarbage entries registered to MANIFEST alongside the flush.
Shutdown paths (CancelAllBackgroundWork, WaitForCompact with close_db=true) force-flush all CFs with active direct-write managers to ensure blob files are registered before close.

Read path

Get/MultiGet: MaybeResolveBlobForWritePath resolves BlobIndex references found in memtable or immutable memtable via BlobFilePartitionManager::ResolveBlobDirectWriteIndex, which first checks manifest-visible state and falls back to direct blob-file reads via BlobFileCache.
Iterator: DBIter::BlobReader is extended with a BlobFilePartitionManager* to resolve direct-write blob indexes during iteration. The unified ResolveBlobDirectWriteIndex path handles both manifest-visible and not-yet-flushed files.

New options

Option	Default	Description
`enable_blob_direct_write`	`false`	Enable write-path blob separation for this CF. Requires `enable_blob_files = true`. Not dynamically changeable.
`blob_direct_write_partitions`	`1`	Number of parallel partition files per CF. Not dynamically changeable.

Feature incompatibilities (reduced v1 scope)

The following features are not supported when enable_blob_direct_write = true, and are enforced both in db_stress_tool validation and db_crashtest.py sanitization:

Write model constraints:

threads must be 1 (single writer assumption)
allow_concurrent_memtable_write = 0
enable_pipelined_write = 0 (transformation done before batch group, but pipelined path supported with pre-transform)
two_write_queues = 0
unordered_write = 0 (transformation done before batch group, but unordered path supported with pre-transform)

WAL and recovery:

disable_wal = 1 (required — WAL replay of unregistered blob files is out of v1 scope)
best_efforts_recovery = 0
reopen = 0 (no crash-restart with WAL replay)
All WAL-related stress features disabled: manual_wal_flush_one_in, sync_wal_one_in, lock_wal_one_in, get_sorted_wal_files_one_in, get_current_wal_file_one_in, track_and_verify_wals, rate_limit_auto_wal_flush, recycle_log_file_num

Blob GC and dynamic options:

use_blob_db = 0 (stacked BlobDB not supported)
allow_setting_blob_options_dynamically = 0
enable_blob_garbage_collection = 0
blob_compaction_readahead_size = 0
blob_file_starting_level = 0

Unsupported value types and APIs:

Merge (use_merge, use_full_merge_v1) — merge values pass through untransformed
Entity APIs (use_put_entity_one_in, use_get_entity, use_multi_get_entity, use_attribute_group)
use_timed_put_one_in
User-defined timestamps (user_timestamp_size, persist_user_defined_timestamps, create_timestamped_snapshot_one_in)
Transactions (use_txn, use_optimistic_txn, test_multi_ops_txns, commit_bypass_memtable_one_in) — though WriteCommittedTxn::CommitInternal falls back from bypass-memtable to normal path when BDW is active
IngestWriteBatchWithIndex returns NotSupported
inplace_update_support = 0

Fault injection:

All write/read/metadata fault injection disabled (sync_fault_injection, write_fault_one_in, metadata_write_fault_one_in, read_fault_one_in, metadata_read_fault_one_in, open_*_fault_one_in)

Infrastructure/snapshot APIs:

remote_compaction_worker_threads = 0
test_secondary = 0
backup_one_in = 0
checkpoint_one_in = 0
get_live_files_apis_one_in = 0
ingest_external_file_one_in = 0
ingest_wbwi_one_in = 0

Tests

db/blob/db_blob_basic_test.cc: ~660 lines of new direct-write unit tests covering basic put/get, multi-partition, flush/compaction, recovery, and error injection.
db/blob/blob_file_cache_test.cc: ~96 lines of new tests for direct-write blob file cache behavior.
db/write_batch_test.cc: ~96 lines of tests for WriteBatch with blob index entries.
utilities/transactions/transaction_test.cc: verifies transaction commit path falls back correctly with direct write enabled.
db_stress_tool/: full stress test support with --enable_blob_direct_write and --blob_direct_write_partitions flags, integrated into db_crashtest.py with 10% random selection alongside regular blob params.

Test Plan

make -j128 db_blob_basic_test && ./db_blob_basic_test
make -j128 blob_file_cache_test && ./blob_file_cache_test
make -j128 write_batch_test && ./write_batch_test
make -j128 transaction_test && ./transaction_test
make -j128 check

Stress test:

python3 tools/db_crashtest.py blackbox --enable_blob_direct_write=1 \
  --enable_blob_files=1 --blob_direct_write_partitions=4 \
  --disable_wal=1 --threads=1

meta-codesync · 2026-03-30T18:18:32Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98766843.

meta-codesync · 2026-03-30T18:21:16Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98766843.

github-actions · 2026-03-30T18:28:23Z

⚠️ clang-tidy: 16 warning(s) on changed lines

Completed in 1334.7s.

Summary by check

Check	Count
`bugprone-unused-return-value`	2
`cert-err58-cpp`	1
`concurrency-mt-unsafe`	12
`cppcoreguidelines-special-member-functions`	1
Total	16

Details

db/blob/blob_file_cache.cc (2 warning(s))

db/blob/blob_file_cache.cc:169:3: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]
db/blob/blob_file_cache.cc:211:3: warning: the value returned by this function should not be disregarded; neglecting it may lead to errors [bugprone-unused-return-value]

db/db_impl/db_impl_write.cc (1 warning(s))

db/db_impl/db_impl_write.cc:525:10: warning: class 'BlobWriteRollbackGuard' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]

db_stress_tool/db_stress_gflags.cc (1 warning(s))

db_stress_tool/db_stress_gflags.cc:497:19: warning: initialization of 'FLAGS_blob_direct_write_partitions_dummy' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]

db_stress_tool/db_stress_tool.cc (12 warning(s))

db_stress_tool/db_stress_tool.cc:225:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:230:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:236:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:241:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:246:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:252:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:259:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:265:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:271:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:279:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:288:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:294:7: warning: function is not thread safe [concurrency-mt-unsafe]

github-actions · 2026-03-30T19:11:30Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 60b4ea8

The final review has been written to review-findings.md. Here is the complete review:

Blob Direct Write v1 — Final Synthesized Code Review

PR: Blob direct write v1: write-path blob separation with partitioned files
Author: xingbowang
Scope: 52 files changed, 3621 insertions(+), 139 deletions(-)
Review method: 9-agent parallel review with adversarial debate and manual verification

Summary

This PR introduces a new feature where large values (>= min_blob_size) are written directly to blob files during Put() and replaced in memtables with compact BlobIndex references. The design uses partitioned blob files with round-robin assignment, generation batching tied to memtable switches, and a rollback-as-garbage pattern for failed writes. V1 constraints: single writer thread, WAL disabled at WriteOptions level, buffer flush per record.

The implementation is architecturally sound and well-integrated across write, read, and flush paths. Test coverage is good for basic functionality and integration scenarios. However, several correctness and performance issues were identified that warrant attention before merge.

HIGH Severity

H1: `assert(false)` fallback in `MarkBlobWriteAsGarbage` is a no-op in release builds

File: db/blob/blob_file_partition_manager.cc:521

When the rollback function cannot find the target file across partitions, current-generation sealed files, and pending generations, it hits assert(false) followed by a ROCKS_LOG_WARN. In release builds, assert(false) compiles away, making the function silently succeed without marking any garbage. This leaks garbage accounting — the blob bytes are written but never tracked as garbage, causing space amplification.

  assert(false);
  if (info_log_ != nullptr) {
    ROCKS_LOG_WARN(
        info_log_,
        "Could not match failed blob direct-write rollback for file #%" PRIu64,
        file_number);
  }

Recommendation: Return a Status error or at minimum use ROCKS_LOG_ERROR unconditionally. Consider whether this indicates a logic bug that should be a Corruption status.

H2: Flush failure leaves prepared blob generations in limbo

File: db/db_impl/db_impl_compaction_flush.cc:323-330

When flush_job.Run() fails after PrepareFlushAdditions() has already sealed deferred blob files, the code sets prepared_blob_generations = 0 and skips CommitPreparedGenerations(). This means the sealed generations remain in pending_generations_ but are never consumed. On the next successful flush, PrepareFlushAdditions is called again with a new num_generations count that does not account for the stale entries.

  if (!s.ok() || !unconsumed_additions.empty() ||
      !unconsumed_garbages.empty()) {
    prepared_blob_generations = 0;  // Skip commit — but entries stay in queue
    sealed_blob_numbers.clear();
  }

Consequence: The sealed blob files exist on disk with footers written, but are never registered in MANIFEST. They become orphaned. Subsequent flushes may encounter a count mismatch between expected and available generations.

Recommendation: Add explicit cleanup logic that either re-queues the sealed files for the next flush or marks them as garbage to be cleaned up.

H3: Read-path overhead for non-BDW databases — `MaybeResolveBlobForWritePath` called unconditionally

File: db/db_impl/db_impl.cc (multiple sites in GetImpl and MultiGetImpl)

Every Get() and MultiGet() call invokes MaybeResolveBlobForWritePath() after memtable and immutable memtable lookups, even when no column family has BDW enabled. The function does exit early when partition_mgr == nullptr, but the function call overhead (parameter setup, stack frame) is imposed on every read operation in every database.

Recommendation: Guard the call sites with if (partition_mgr != nullptr) before calling the function, to eliminate function call overhead entirely when BDW is not enabled for the column family.

H4: Write-path CF scan overhead — linear scan of all column families per write batch

File: db/db_impl/db_impl_write.cc:590-596

The maybe_transform_batch_for_blob_direct_write lambda performs a linear scan of all column families to check if any has a non-null blob_partition_manager(). This runs for every write batch even when BDW is never configured. For databases with many column families, this is O(n) per write.

Recommendation: Cache a DB-level has_any_bdw_cf_ flag, updated only when column families are created/dropped or options are changed. This reduces the check to O(1).

MEDIUM Severity

M1: `IsValidBlobOffset` short-circuit prevents unsigned underflow but relies on `||` evaluation order

File: db/blob/blob_log_format.h — Safe as written but fragile if refactored. Defensive style would separate into independent if blocks.

M2: Compression performed outside mutex, but may be wasted if partition state changes

File: db/blob/blob_file_partition_manager.cc:WriteBlob — Compression runs outside the lock (good), but if the partition's CF/compression type changed before lock acquisition, the compressed data is discarded. Wasted work for large values.

M3: Public API documentation incomplete for restrictions

File: include/rocksdb/advanced_options.h — Missing documentation for: single writer restriction, WAL requirement, MemPurge/timestamp incompatibility, IngestWriteBatchWithIndex restriction.

M4: ReadOnly/Secondary DB `nullptr` for `blob_partition_mgr`

Correct for normal operation (flushed data resolved via Version). Severity reduced from agent's "CRITICAL" — should be documented as known limitation.

M5: `RefreshBlobFileReader` does 3 non-atomic cache operations

File: db/blob/blob_file_cache.cc — Lookup -> Erase -> Insert window allows concurrent cache miss.

M6: Multiple heap allocations per write batch on BDW path

File: db/db_impl/db_impl_write.cc — unordered_set, multiple vector allocations per batch. Consider SmallVector/InlineVector.

Suggestions

S1: Cache any_bdw flag at DB level for O(1) write-path check
S2: Clarify kDoFlushEachRecord is buffer flush, not fsync (verified: Flush() not fsync())
S3: PutEntitySerialized error path missing save.Rollback()
S4: Increase BDW stress test frequency beyond ~10%

Verified Non-Issues (7 False Positives from Agents)

"Missing RotateCurrentGeneration call" — IS called at db_impl_write.cc:2911
"ReadOnly DB broken" — Correctly passes nullptr; flushed data resolves via Version
"IsValidBlobOffset unsigned underflow" — Protected by || short-circuit
"Mutex held during compression" — Compression is OUTSIDE mutex
"fsync per record" — It's Flush() (buffer flush), not fsync()
"WAL not enforced at option level" — Correct for WriteOptions flag
"Transaction DBs bypass BDW" — Intentionally excluded from v1 scope

Design Assessment

The overall architecture is well-conceived. Partitioned files, generation batching, rollback-as-garbage, skip-footer reads, and settings caching are all sound design choices. The v1 constraints are reasonable and the code is structured for incremental improvement.

Recommendation

Approve with required changes for H1 and H2. H3/H4 are performance improvements that should be addressed but are not blocking.

Required before merge:

Fix MarkBlobWriteAsGarbage fallback (H1) — replace assert(false) with proper error handling
Fix flush failure orphan handling (H2) — add cleanup for prepared but uncommitted generations

Strongly recommended:

Add if (partition_mgr != nullptr) guards on read path (H3)
Cache DB-level BDW flag for write path (H4)
Improve public API documentation (M3)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

TransformBatch iterates the WriteBatch via Iterate(), which can return an error on a corrupted batch before the WAL verification path (MergeBatch -> WriteToWAL -> VerifyChecksum) has a chance to detect it. This caused DbKvChecksumTestMergedBatch tests to fail because the corruption error was returned early from WriteImpl instead of being set as a background error via the WAL write path. Add a quick scan of column families before calling TransformBatch. When no CF has blob_partition_manager (i.e. no BDW enabled), skip the transform entirely.

meta-cla bot added the CLA Signed label Mar 30, 2026

xingbowang force-pushed the blob_direct_write_v1 branch from 4d24273 to 4dbcc48 Compare March 30, 2026 23:30

xingbowang added 7 commits March 30, 2026 17:06

Implement reduced-scope blob direct write v1

caf5210

fix stale blob reader cache

b02dca6

improve feature incompatibility handling

88e9680

fix stress test sanitization and assert

3b1b807

fix stress test bugs

7cb1989

fix expect stat in stress test

8bff5c8

fix expect stat the right way

71b2c50

xingbowang force-pushed the blob_direct_write_v1 branch from 4dbcc48 to 71b2c50 Compare March 31, 2026 04:46

xingbowang changed the title ~~Add blob direct write with partitioned blob files (v1)~~ Blob direct write v1: write-path blob separation with partitioned files (reduced scope) Mar 31, 2026

xingbowang added 2 commits March 30, 2026 22:00

fix clang-format in db_blob_basic_test

793b6f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blob direct write v1: write-path blob separation with partitioned files (reduced scope)#14535

Blob direct write v1: write-path blob separation with partitioned files (reduced scope)#14535
xingbowang wants to merge 9 commits intofacebook:mainfrom
xingbowang:blob_direct_write_v1

xingbowang commented Mar 30, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 30, 2026

Uh oh!

meta-codesync bot commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xingbowang commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design (v1 — single-writer, WAL-disabled, reduced scope)

New components

Write path integration

Read path

New options

Feature incompatibilities (reduced v1 scope)

Tests

Test Plan

Uh oh!

meta-codesync bot commented Mar 30, 2026

Uh oh!

meta-codesync bot commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ clang-tidy: 16 warning(s) on changed lines

Summary by check

Details

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Claude Code Review

Blob Direct Write v1 — Final Synthesized Code Review

Summary

HIGH Severity

H1: assert(false) fallback in MarkBlobWriteAsGarbage is a no-op in release builds

H2: Flush failure leaves prepared blob generations in limbo

H3: Read-path overhead for non-BDW databases — MaybeResolveBlobForWritePath called unconditionally

H4: Write-path CF scan overhead — linear scan of all column families per write batch

MEDIUM Severity

M1: IsValidBlobOffset short-circuit prevents unsigned underflow but relies on || evaluation order

M2: Compression performed outside mutex, but may be wasted if partition state changes

M3: Public API documentation incomplete for restrictions

M4: ReadOnly/Secondary DB nullptr for blob_partition_mgr

M5: RefreshBlobFileReader does 3 non-atomic cache operations

M6: Multiple heap allocations per write batch on BDW path

Suggestions

Verified Non-Issues (7 False Positives from Agents)

Design Assessment

Recommendation

Required before merge:

Strongly recommended:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xingbowang commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

H1: `assert(false)` fallback in `MarkBlobWriteAsGarbage` is a no-op in release builds

H3: Read-path overhead for non-BDW databases — `MaybeResolveBlobForWritePath` called unconditionally

M1: `IsValidBlobOffset` short-circuit prevents unsigned underflow but relies on `||` evaluation order

M4: ReadOnly/Secondary DB `nullptr` for `blob_partition_mgr`

M5: `RefreshBlobFileReader` does 3 non-atomic cache operations