Skip to content

Conversation

@georgeee
Copy link
Member

This PR implements a benchmark for DB usage.

It independently measures read and write performance of every DB imnplementation.
Allows to make informed decisions for various flows of working with data, keeping measurement of pure DB performance separate from performance of other subsystems (including serialization).

See db_benchmark/README.md for more details about the benchmark.

Explain how you tested your changes:

  • Executed the benchmark successfully

Checklist:

  • Dependency versions are unchanged
    • Notify Velocity team if dependencies must change in CI
  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? None

@georgeee georgeee added the oom label Nov 15, 2025
@georgeee
Copy link
Member Author

georgeee commented Nov 15, 2025

Result of the run with default parameters

Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
rocksdb_write 705_955.63us 158_922.00w 482.99w 482.99w 26.18%
rocksdb_read 125.97us 1_935.00w 16_399.56w 13.56w  
lmdb_write 2_696_045.27us 10_047.00w 12.96w 12.96w 100.00%
lmdb_read 96.56us 1_217.00w 16_387.19w 1.19w  
single_file_write 70_465.27us 19_047.38w 324.69w 324.69w 2.61%
single_file_read 362.82us 1_273.00w 73_738.82w 2.82w 0.01%
multi_file_write 83_925.40us 559.00w 2_048_384.01w 382.01w 3.11%
multi_file_read 184.54us 1_269.00w 32_772.38w 0.38w  

📊 Benchmark Analysis: Write vs Read Performance

Test Configuration:

  • 📦 Keys per block: 125
  • 💾 Value size: 131,072 bytes (128 KB)
  • 🔢 Blocks in DB: 800
  • Total data: ~100,000 keys, ~12.8 GB total

✍️ Write Performance Comparison

Speed (Time/Run - lower is better):

  1. 🥇 single_file_write: ~70ms - fastest option
  2. 🥈 multi_file_write: ~84ms - very close second
  3. ⚠️ rocksdb_write: ~706ms - 10x slower than single file
  4. 🔴 lmdb_write: ~2,696ms - significantly slower (38x slower than single file)

Memory Allocation:

  • 💚 multi_file_write: 559w - minimal minor heap allocation
  • ⚠️ rocksdb_write: 158,922w - high memory allocation
  • 🔴 multi_file_write (mjWd): 2,048,384w - very high major heap pressure from file operations

📖 Read Performance Comparison

Speed (Time/Run - all very fast):

  1. 🥇 lmdb_read: ~97μs - fastest
  2. 🥈 rocksdb_read: ~126μs - nearly identical
  3. multi_file_read: ~185μs - still excellent
  4. single_file_read: ~363μs - slowest but still sub-millisecond

All read operations are extremely fast (microsecond range vs millisecond writes).

🎯 Key Takeaways

LMDB: Terrible write performance (~2.7s per operation) but excellent read speed
Simple file I/O: Best write performance by far - ideal for large value storage
RocksDB: Balanced middle-ground but high memory usage on writes
Large values (128 KB): Simple file approaches dominate for write throughput

💡 Recommendation: For large-value workloads like this (128 KB per value):

  • Write-heavy → single_file_write is the clear winner
  • Read-heavy → LMDB or RocksDB provide faster lookups

Update after optimization of multi-file writing

📊 multi_file_write Benchmark Results

Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
multi_file_write 45.50ms 553.00w 24.03w 24.03w 100.00%

📈 Before vs After Comparison

Performance Gains:

  • ⏱️ Time: 83,925μs → 45,500μs (45.5ms)
  • 📉 Speedup: 1.84x faster 🎉
  • 💾 Memory (mWd): 559w → 553w (essentially unchanged)
  • 🔄 Major heap (mjWd): 2,048,384w → 24.03w
  • Major heap reduction: 99.999% reduction! 🔥

🏆 Updated Write Performance Rankings

  1. 🥇 multi_file_write (new): ~45.5ms - NEW CHAMPION
  2. 🥈 single_file_write: ~70ms (1.54x slower)
  3. ⚠️ rocksdb_write: ~706ms (15.5x slower)
  4. 🔴 lmdb_write: ~2,696ms (59x slower)

💡 What Changed?

The massive mjWd reduction (from 2M+ to 24w) suggests you eliminated file system churn or excessive allocations. This is a textbook example of optimization - you kept the speed advantage while making it vastly more GC-friendly.

New recommendation: For large-value (128 KB) write workloads, multi_file_write is now the clear winner - fastest write speed AND minimal heap pressure. 🎯

@georgeee
Copy link
Member Author

🚀 Smaller Values, Different Story

📊 Full Benchmark Results (New Parameters)

Test Configuration:

  • 📦 Keys per block: 32
  • 💾 Value size: 9,000 bytes (8.8 KB)
  • 🔢 Warmup blocks: 1,000
  • Total warmup: 32,000 keys
Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
rocksdb_write 4,033.82us 40,708.00w 20.57w 20.57w 0.42%
rocksdb_read 40.49us 1,924.00w 1,140.63w 13.63w -
lmdb_write 957,662.23us 2,596.00w 0.65w 0.65w 100.00%
lmdb_read 35.66us 1,206.00w 1,127.31w 0.31w -
single_file_write 1,957.77us 4,900.00w 38.59w 38.59w 0.20%
single_file_read 70.29us 1,262.00w 9,321.10w 0.10w -
multi_file_write 501.43us 269.00w 36,014.88w 12.88w 0.05%
multi_file_read 55.97us 1,258.00w 2,254.03w - -

Before vs After:

  • ⏱️ Time: 501.43us → 839.77us (1.67x slower) ⚠️
  • 💾 Memory (mWd): 269w → 274w (essentially unchanged)
  • 🔄 Major heap (mjWd): 36,014.88w → 6.49w
  • Major heap reduction: 99.98% reduction! 🔥

🏆 Write Performance Rankings (8.8 KB values)

  1. 🥇 multi_file_write (original): ~501us - fastest
  2. 🥈 multi_file_write (optimized): ~840us - better GC behavior
  3. 🥉 single_file_write: ~1,958us
  4. ⚠️ rocksdb_write: ~4,034us
  5. 🔴 lmdb_write: ~957,662us - still very slow

📖 Read Performance Rankings

  1. 🥇 lmdb_read: ~36us - fastest
  2. 🥈 rocksdb_read: ~40us
  3. multi_file_read: ~56us
  4. single_file_read: ~70us

💡 Key Observations

Compared to 128 KB value test:

  • 📉 All operations are significantly faster with smaller values (8.8 KB vs 128 KB)
  • 🔄 Trade-off emerged: Optimization reduced mjWd by 99.98% but slowed writes by 1.67x
  • 🎯 RocksDB becomes competitive at smaller value sizes (~4ms vs 706ms previously)
  • LMDB still struggles with writes but dominates reads

Optimization trade-off: The optimized version trades some speed for much better GC behavior. Depending on workload (GC pressure vs raw throughput), either version could be preferable.

@georgeee georgeee mentioned this pull request Nov 17, 2025
8 tasks
Sys.getenv "WARMUP_BLOCKS" |> Option.value_map ~default:800 ~f:int_of_string

(* Fixed seed for reproducibility *)
let random_seed = 42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to make this random if not user provided. And the test prints out the seed it's using every time.

let cached_value = lazy (generate_value ())

(* Get the cached value *)
let get_value () = Lazy.force cached_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using a same value for all blocks. I wonder if these backend would have some optimizations that make the performance better, it's better to use distinct values for distinct keys.

min_key + Random.State.int random_state (max_key - min_key + 1)

(* Database interface that all implementations must satisfy *)
module type Database = sig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth replace this implementation that stores string with implementation storing bytes to avoid any kind of wrappings that'll be done at bindings, end, so that we're not wasting time to do serialization/deserialization.

let start_key = block_num * Common.keys_per_block in
List.iteri values ~f:(fun i value ->
let key = start_key + i in
Rw.set ~env:t.env t.db key value )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're wasting time to commit on each key in the block. Is this intended?

let path = block_path t block_num in
(* Write all values directly to file without in-memory concatenation *)
Out_channel.with_file path ~binary:true ~f:(fun oc ->
List.iter values ~f:(Out_channel.output_string oc) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not fair competition. Files are written to a same handler in a same channel without close/open.

But say with LMDB, we're creating a txn per write.

let set_block db ~block_num values =
let start_key = block_num * Common.keys_per_block in
List.iteri values ~f:(fun i value ->
let key = start_key + i in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not batch operation.

Copy link
Member

@glyh glyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider use batch operations on DB to the very least

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants