facebook · xingbowang · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026 · Mar 25, 2026
diff --git a/.claude/review-8bcfa9c5d-enforce-wbm-recovery/consensus.md b/.claude/review-8bcfa9c5d-enforce-wbm-recovery/consensus.md
@@ -0,0 +1,116 @@
+# Consensus Document: enforce_write_buffer_manager_during_recovery Review
+
+## Methodology
+5 specialized review agents analyzed this change independently, followed by a cross-review debate phase. Findings are classified by agreement level.
+
+---
+
+## HIGH Severity Findings (Must Fix)
+
+### H1: Double-Scheduling Bug — FlushScheduler Debug Assertion Crash
+**Reported by:** Correctness (C1), Design (D1), Test Coverage (T1)
+**Agreement:** 3/5 agents (HIGH confidence)
+
+**Description:** `InsertLogRecordToMemtable` has two independent paths that can schedule the same CF on `flush_scheduler_`:
+1. `WriteBatchInternal::InsertInto` → `CheckMemtableFull()` → `MarkFlushScheduled()` → `ScheduleWork(cfd)` (when per-CF memtable size limit hit)
+2. New WBM code → iterates all CFs → `ScheduleWork(cfd)` for any with `GetFirstSequenceNumber() != 0`
+
+The second path does NOT check `MarkFlushScheduled()`. If both trigger for the same CF:
+- **Debug builds:** `assert(checking_set_.count(cfd) == 0)` at `flush_scheduler.cc:18` **crashes**
+- **Release builds:** CF consumed twice, second time flushes an empty memtable, creating unnecessary SST
+
+**Reproduction:** Set `write_buffer_size` small enough that a CF hits per-CF limit, with WBM limit also exceeded.
+
+**Fix options:**
+1. Add `!cfd->mem()->ShouldScheduleFlush()` or check `flush_state_` in the WBM loop
+2. Use `MarkFlushScheduled()` as dedup guard (same as `CheckMemtableFull`)
+3. Add `!cfd->IsDropped()` check as well (minor optimization from C5)
+
+---
+
+## MEDIUM Severity Findings (Should Fix or Document)
+
+### M1: Default `true` Changes Behavior for Existing WBM Users
+**Reported by:** Design (D2), API (A1), Performance (P4)
+**Agreement:** 3/5 agents
+
+All existing users with a WriteBufferManager will get new flush-during-recovery behavior on upgrade. This also silently overrides `avoid_flush_during_recovery=true` once any WBM flush triggers (because `flushed=true` cascades to flush all remaining memtables at end of recovery).
+
+**Recommendation:** Either:
+- Default to `false` for backwards compatibility, OR
+- Keep `true` but enhance the release note to clearly call out the behavioral change
+
+### M2: Flushing ALL CFs Is Overly Aggressive
+**Reported by:** Design (D3), Performance (P2)
+**Agreement:** 2/5 agents
+
+The code schedules ALL CFs with non-empty memtables for flush when WBM triggers. The normal write path only flushes the current CF. This produces unnecessary small L0 files for CFs with minimal data. Acknowledged by TODO in code.
+
+**Recommendation:** Acceptable for initial implementation (code has TODO), but consider flushing only the largest CF or adding a minimum size threshold in a follow-up.
+
+### M3: Post-Recovery Compaction Storm
+**Reported by:** Performance (P3)
+**Agreement:** 1/5 agents (but logically follows from M2)
+
+Extra L0 files from aggressive flushing will trigger compactions after DB opens. Could cause write stalls if L0 count approaches `level0_stop_writes_trigger`.
+
+**Recommendation:** Document this behavior. The selective-CF-flush improvement (M2) would mitigate this.
+
+### M4: Missing Java JNI Bindings
+**Reported by:** API (A2)
+**Agreement:** 1/5 agents
+
+The sibling option `avoid_flush_during_recovery` has Java JNI bindings. The new option does not. Java users cannot configure it programmatically.
+
+**Recommendation:** Add Java JNI bindings (can be a follow-up PR).
+
+### M5: No Test for Multiple CFs / avoid_flush Interaction / Multiple WALs
+**Reported by:** Test Coverage (T4, T5, T7)
+**Agreement:** 1/5 agents
+
+Tests only cover single-CF, single-WAL scenarios. Missing coverage for:
+- Multiple CFs where only some have data
+- `avoid_flush_during_recovery` override behavior
+- Multiple WAL files during recovery
+
+**Recommendation:** Add at least a multi-CF test.
+
+---
+
+## LOW Severity Findings (Nice to Have)
+
+### L1: Crash Test Parameter Not a Lambda
+**Reported by:** Test Coverage (T8)
+`random.randint(0, 1)` evaluated once at module load, not per run. Should be `lambda: random.randint(0, 1)`.
+
+### L2: No SanitizeOptions for 2PC Interaction
+**Reported by:** API (A8)
+With `allow_2pc`, `avoid_flush_during_recovery` is forced false. The new option has no similar handling. Not a bug, but should be documented.
+
+### L3: Release Note Too Terse
+**Reported by:** API (A9)
+Should mention OOM prevention motivation and behavioral change for existing users.
+
+### L4: WBM buffer_size=0 Safe (Confirmed)
+**Reported by:** Correctness (C8), Performance (P6)
+Both reviewers independently confirmed: `ShouldFlush()` returns false when `buffer_size=0` because `enabled()` returns false. The header comment is misleading but pre-existing. **No issue.**
+
+### L5: Thread Safety Confirmed Safe
+**Reported by:** Correctness (C7)
+Recovery is single-threaded; no concurrency concerns.
+
+### L6: Dropped CFs Handled Correctly
+**Reported by:** Correctness (C5)
+`TakeNextColumnFamily` skips dropped CFs. Minor optimization: add `!cfd->IsDropped()` to WBM loop.
+
+---
+
+## Verification Plan
+
+| Finding | Action | Priority |
+|---------|--------|----------|
+| H1: Double-scheduling | Write reproducer test + fix | MUST FIX |
+| M1: Default value | Decision needed (author preference) | SHOULD DISCUSS |
+| M2: Flush-all-CFs | Accept with TODO (already documented) | ACCEPTABLE |
+| M4: Java JNI | Follow-up PR | OPTIONAL |
+| L1: Lambda fix | Trivial fix | NICE TO HAVE |
diff --git a/.claude/review-8bcfa9c5d-enforce-wbm-recovery/context.md b/.claude/review-8bcfa9c5d-enforce-wbm-recovery/context.md
@@ -0,0 +1,141 @@
+# Codebase Context for Review: enforce_write_buffer_manager_during_recovery
+
+## Feature Summary
+This change adds a new `enforce_write_buffer_manager_during_recovery` option (default: `true`) that enables WBM (WriteBufferManager) memory limit enforcement during WAL recovery. Without this, a recovering RocksDB instance sharing a WBM with other instances could consume unbounded memory during WAL replay, potentially causing OOM.
+
+## Commits Under Review
+1. `183493d07` - Core implementation: option definition, recovery logic, initial test
+2. `5357dfef0` - Remove MarkFlushScheduled() check from the WBM recovery code
+3. `f51556c9e` - Add multiple DB scenario in test
+4. `f239e5f71` - Fix test
+5. `388e46269` - Address comments
+6. `67ffb6d82` - Add stress test integration
+7. `8bcfa9c5d` - Address feedback
+
+## Files Changed
+- `db/db_impl/db_impl_open.cc` - Core logic (22 lines added in InsertLogRecordToMemtable)
+- `db/db_write_buffer_manager_test.cc` - Unit tests (191 lines)
+- `db_stress_tool/db_stress_common.h` - Flag declaration
+- `db_stress_tool/db_stress_gflags.cc` - Flag definition
+- `db_stress_tool/db_stress_test_base.cc` - Stress test integration
+- `include/rocksdb/options.h` - Public API (21 lines documentation)
+- `options/db_options.cc` - Option plumbing (10 lines)
+- `options/db_options.h` - Internal struct field
+- `options/options_helper.cc` - BuildDBOptions
+- `options/options_settable_test.cc` - Settable test
+- `tools/db_crashtest.py` - Crash test randomization
+- `unreleased_history/new_features/enforce_wbm_during_recovery.md` - Release note
+
+## How the Recovery Subsystem Works
+
+### WAL Recovery Flow (db_impl_open.cc)
+```
+DB::Open → Recover → RecoverLogFiles
+  → for each WAL:
+    → ProcessLogRecord (per WAL record)
+      → InsertLogRecordToMemtable
+        → WriteBatchInternal::InsertInto (inserts to memtable)
+          → CheckMemtableFull() — may schedule flush via flush_scheduler_
+            → calls ShouldScheduleFlush() && MarkFlushScheduled()
+        → NEW: WBM check — may schedule ALL non-empty CFs via flush_scheduler_
+      → MaybeWriteLevel0TableForRecovery
+        → while (flush_scheduler_.TakeNextColumnFamily()):
+          → WriteLevel0TableForRecovery (writes memtable to SST)
+          → cfd->CreateNewMemtable()
+  → MaybeFlushFinalMemtableOrRestoreActiveLogFiles
+    → if (flushed || !avoid_flush_during_recovery):
+      → flush ALL remaining non-empty memtables
+    → else: RestoreAliveLogFiles (keep WALs for future replay)
+```
+
+### Critical Interaction: flush_scheduler_ Double-Scheduling
+`WriteBatchInternal::InsertInto` (line 1583) can schedule a CF via `CheckMemtableFull()` when the per-CF memtable size limit is reached. This uses `MarkFlushScheduled()` as dedup — a CAS on the memtable's `flush_state_` from `FLUSH_REQUESTED` to `FLUSH_SCHEDULED`.
+
+The NEW WBM code (lines 1599-1607) iterates ALL CFs and schedules those with `GetFirstSequenceNumber() != 0`. It does NOT check `MarkFlushScheduled()` or `flush_state_`.
+
+**Potential double-scheduling scenario:**
+1. `InsertInto` processes a batch, memtable for CF1 hits per-CF limit
+2. `CheckMemtableFull()` calls `MarkFlushScheduled()` → true, schedules CF1
+3. After `InsertInto` returns, WBM `ShouldFlush()` returns true
+4. WBM code checks CF1: `GetFirstSequenceNumber() != 0` → true, schedules CF1 AGAIN
+5. `FlushScheduler::ScheduleWork` has debug assertion: `assert(checking_set_.count(cfd) == 0)` → **CRASH in debug builds**
+
+In release builds, CF1 would be consumed twice by `MaybeWriteLevel0TableForRecovery`:
+- First: flush the memtable, create new empty memtable
+- Second: flush the empty memtable → writes empty SST file
+
+### WriteBufferManager::ShouldFlush() Semantics
+```cpp
+bool ShouldFlush() const {
+  if (enabled()) {
+    if (mutable_memtable_memory_usage() > mutable_limit_) return true;  // 7/8 of buffer_size
+    if (memory_usage() >= buffer_size && mutable_memtable_memory_usage() >= buffer_size / 2) return true;
+  }
+  return false;
+}
+```
+- `mutable_limit_` = 7/8 of `buffer_size_`
+- Primary trigger: active memtable memory > 7/8 of buffer_size
+
+### avoid_flush_during_recovery Interaction
+At `MaybeFlushFinalMemtableOrRestoreActiveLogFiles` (line 1832):
+```cpp
+if (flushed || !immutable_db_options_.avoid_flush_during_recovery) {
+    // flush final memtable
+}
+```
+- If `flushed` is true (any mid-recovery flush happened), ALL remaining memtables get flushed regardless of `avoid_flush_during_recovery`
+- The new WBM enforcement triggers mid-recovery flushes via `flush_scheduler_`, setting `flushed = true`
+- This means: once WBM triggers a flush during recovery, `avoid_flush_during_recovery` is effectively overridden for the remainder
+
+### ColumnFamilyData::CreateNewMemtable (column_family.cc:1233)
+```cpp
+void ColumnFamilyData::CreateNewMemtable(SequenceNumber earliest_seq) {
+  if (mem_ != nullptr) {
+    delete mem_->Unref();  // Frees old memtable, WBM memory is decremented
+  }
+  SetMemtable(ConstructNewMemtable(GetLatestMutableCFOptions(), earliest_seq));
+  mem_->Ref();
+}
+```
+The old memtable is immediately deleted (not moved to immutable list), which decrements WBM memory accounting.
+
+### Normal Write Path WBM Handling (for comparison)
+In the normal write path (`PreprocessWrite` → `HandleWriteBufferManagerFlush`):
+- Uses `HandleWriteBufferManagerFlush()` which calls `SwitchMemtable()` for the CURRENT CF only
+- Then schedules a flush job via the background thread pool
+- Much more sophisticated: considers compaction pressure, write stalls, etc.
+
+The recovery path is simpler: schedules ALL non-empty CFs for flush (like `atomic_flush=false` path) and writes L0 files synchronously.
+
+## Key Invariants
+
+1. **FlushScheduler dedup**: Each CFD must be scheduled at most once on `flush_scheduler_` between consumption cycles. Enforced by debug assertion in `ScheduleWork`.
+
+2. **avoid_flush_during_recovery override**: Once `flushed = true`, all remaining memtables are flushed at end of recovery. The new option's docs correctly describe this behavior.
+
+3. **Memory accounting**: `CreateNewMemtable` deletes the old memtable immediately, decrementing WBM's memory. This is important — if the old memtable weren't freed, WBM would still report high usage.
+
+4. **Single-threaded recovery**: Recovery is single-threaded (no concurrent writes), so no mutex contention issues. The flush_scheduler_ is used without locks during recovery.
+
+5. **2PC interaction**: When `allow_2pc` is true, `avoid_flush_during_recovery` is forced to false (line 152). The new option doesn't have a similar override, meaning WBM enforcement could still apply even with 2PC.
+
+## Existing Conventions for New Options
+- Options use `ImmutableDBOptions` (not `MutableDBOptions`) for recovery-time settings
+- Must be plumbed through: `DBOptions` struct, `ImmutableDBOptions` struct/constructor, `BuildDBOptions`, option type map, settable test, `Dump()`
+- Should have stress test coverage via `db_crashtest.py`
+- Default value should be carefully considered for backwards compatibility
+
+## Potential Pitfalls
+
+1. **Double-scheduling bug**: As described above, if per-CF memtable limit AND WBM global limit trigger simultaneously, the same CF could be scheduled twice on `flush_scheduler_`, causing debug assertion failure.
+
+2. **Default value = true**: This changes behavior for ALL existing users on upgrade. Any user with a WBM configured will now get flush enforcement during recovery, which could produce more L0 files. This is a behavioral change that could surprise users.
+
+3. **Empty SST files**: If double-scheduling occurs in release builds, empty SST files could be created.
+
+4. **WBM with buffer_size=0**: The `ShouldFlush()` comment says "ShouldFlush() will always return true" when buffer_size=0. If a user creates a WBM with size 0, the new code would try to flush on every single WAL record during recovery.
+
+5. **No CF selection heuristic**: The TODO in the code acknowledges that flushing ALL CFs is suboptimal. CFs with very little data will get unnecessary L0 files.
+
+6. **Interaction with atomic_flush**: The new code doesn't check `immutable_db_options_.atomic_flush`. In normal write path, atomic flush has different handling. During recovery this might not matter since recovery is single-threaded, but it's worth examining.