Skip to content

Conversation

@mattisonchao
Copy link
Member

Motivation

In the current implementation, we will get terrible performance for iterator-related operations when we have many small DeleteRange ranges.
FYI: https://rocksdb.org/blog/2018/11/21/delete-range.html

It will make our putWithSequence method execute very slowly. (based on the current memtable size)

The reason is that the implementation of pebble(rocksDB) will create the fragmented range deletion tombstones for range deletion. Even Pebble will cache the tombstones, but it will invalidate when you get a new delete range operation. Therefore, we need to rebuild the deletion iterator after the new DeleteRange operation. And it will get worse along with more DeleteRange operations.
FYI: https://github.com/cockroachdb/pebble/blob/dbc1c128682f7efcdb76352432249780b12447f7/mem_table.go#L248-L251

Also, rocksdb article also mentioned that in Future Work section.

Modification

  • To support atomic operation for the user side, we can fallback the DeleteRange operation back to the normal delete if the actual keys are lower than DeleteRangeThreshold(default 100).

Next

  1. Introduce policies to make DeleteRangeThreshold configurable.
  2. Introduce policies to configure the memtable size. because the smaller size can trigger flush and compact the deletion tombstones.
  3. other small improvements.

"k8s.io/utils/pointer"
)

func BenchmarkGenerate100(b *testing.B) {
Copy link
Member Author

@mattisonchao mattisonchao Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the benchmark test. if you want to understand more details, you can check out to commit 8996085 and check the tests/deleterange_test.go . that is using the actual data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the before vs after results?

Copy link
Member Author

@mattisonchao mattisonchao Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from benchmark test

main branch

goos: darwin
goarch: arm64
pkg: github.com/streamnative/oxia/server/kv
cpu: Apple M1 Pro
BenchmarkDeleteRange
BenchmarkDeleteRange-10    	    8863	  34814860 ns/op
PASS

this branch

goos: darwin
goarch: arm64
pkg: github.com/streamnative/oxia/server/kv
cpu: Apple M1 Pro
BenchmarkDeleteRange
BenchmarkDeleteRange-10    	   10000	    848611 ns/op
PASS


@hangc0276
Copy link

@mattisonchao In BookKeeper, we also use deleteRange to delete keys, but we found RocksDB background compaction job will skip compacting the SST files whose key are deleted by deleteRange operation. apache/bookkeeper#4555

Would you please double check if the SST file are compacted?

@mattisonchao
Copy link
Member Author

mattisonchao commented Mar 21, 2025

In BookKeeper, we also use deleteRange to delete keys, but we found RocksDB background compaction job will skip compacting the SST files whose key are deleted by deleteRange operation. apache/bookkeeper#4555

Would you please double check if the SST file are compacted?

you are mentioning the disk size issue. the current problem is read performance issue.

@hangc0276
Copy link

In BookKeeper, we also use deleteRange to delete keys, but we found RocksDB background compaction job will skip compacting the SST files whose key are deleted by deleteRange operation. apache/bookkeeper#4555

Would you please double check if the SST file are compacted?

you are mentioning the disk size issue. the current problem is read performance issue.

@mattisonchao It will also impact the read performance

@mattisonchao
Copy link
Member Author

mattisonchao commented Mar 21, 2025

we are using in-memory implementation for benchmarking. Also, the data didn't exceed the memtable size. therefore, it has nothing with sst.

It will also impact the read performance

but I would like to understand why it will also impact the read performance.

@merlimat
Copy link
Collaborator

but I would like to understand why it will also impact the read performance.

In an LSM you start from the top and you go down the tree of SST files. With point get() request is usually fine: If you find a delete-rage tombstone in the tree exploration, you can say that the key does not exist anymore.

The problem is typically around handling iterators, because you have to remember the tombstones in the levels above the one you're currently exploring.

@mattisonchao
Copy link
Member Author

mattisonchao commented Mar 21, 2025

In an LSM you start from the top and you go down the tree of SST files. With point get() request is usually fine: If you find a delete-rage tombstone in the tree exploration, you can say that the key does not exist anymore.

Yes.

The problem is typically around handling iterators, because you have to remember the tombstones in the levels above the one you're currently exploring.

Yes, and it will construct a skyline to improve the read performance. but it's expensive to construct(maybe Pebble did cache for sst). and it seems like Pebble(Rocksdb) will make improvements to merge range when moving memtable to sst. and then the delete range in sst will be less.

but for the current case, the reason is we have many small ranges DeleteRange for memtable. and memtable iterator need to rebuild the fragment. (here's wasting time)


if err := batch.DeleteRange(delReq.StartInclusive, delReq.EndExclusive); err != nil {
return nil, errors.Wrap(err, "oxia db: failed to delete range")
var validKeys []string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to find a way to avoid building this key array, since it could be very big.

One strategy could be:

  1. start doing point delete operations
  2. when we cross the threshold, we continue with the delete range

@merlimat merlimat merged commit 272f42d into main Mar 21, 2025
7 checks passed
@merlimat merlimat deleted the improve.deleterange branch March 21, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants