Skip to content
This repository was archived by the owner on Jan 21, 2026. It is now read-only.

Commit 20a3be1

Browse files
authored
Merge branch 'topling:memtable_as_log_index' into memtable_as_log_index
2 parents 147271f + 6f01397 commit 20a3be1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+1552
-247
lines changed

README-zh_cn.md

Lines changed: 18 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,31 @@
11
## ToplingDB: 一个外存上的持久化 Key-Value 存储引擎
22
ToplingDB 由[北京拓扑岭科技有限公司](https://topling.cn)开发与维护,从 [RocksDB](https://github.com/facebook/rocksdb) 分叉而来,详情参考 [ToplingDB 分支名称约定](https://github.com/topling/toplingdb/wiki/ToplingDB-Branch-Name-Convention)
33

4+
## 快速开始
5+
ToplingDB 需要 C++17,推荐 gcc 8.3 以上,或者 clang 也行。
6+
7+
ToplingDB 比 RocksDB 快得多,您可以自己快速验证:
8+
### Compile & run db_bench
9+
```bash
10+
sudo yum -y install git libaio-devel gcc-c++ gflags-devel zlib-devel bzip2-devel libcurl-devel liburing-devel snappy-devel jemalloc-devel
11+
#sudo apt-get update -y && sudo apt-get install -y libjemalloc-dev libaio-dev libgflags-dev zlib1g-dev libbz2-dev libcurl4-gnutls-dev liburing-dev libsnappy-dev libbz2-dev liblz4-dev libzstd-dev
12+
git clone https://github.com/topling/toplingdb
13+
cd toplingdb
14+
make -j`nproc` db_bench DEBUG_LEVEL=0
15+
sudo make install PREFIX=/some/path # default is /usr/local
16+
```
17+
18+
以上编译命令执行后,运行 [db_bench.sh](db_bench.sh)(需要[端口 2011](https://github.com/topling/rockside/blob/master/sample-conf/db_bench_enterprise.yaml#L4 "内嵌的 http web 服务使用端口 2011")),然后使用 ToplingDB:[原生 C++](https://github.com/topling/rockside/wiki/101 "典型场景是从 rocksdb 迁移过来)"),也支持 [Java](https://github.com/topling/rockside/wiki/SidePlugin-Java-Binding "内置在本 github 仓库中")[Rust](https://github.com/topling/rust-toplingdb "另外的专门的 github 仓库")
19+
20+
## 简单介绍
421
ToplingDB 的子模块 **[rockside](https://github.com/topling/rockside)** 是 ToplingDB 的入口,详情参考 **[SidePlugin wiki](https://github.com/topling/rockside/wiki)**
522

623
ToplingDB 兼容 RocksDB API 的同时,增加了很多非常重要的功能与改进:
724
1. [SidePlugin](https://github.com/topling/rockside/wiki) 让用户可以通过 json/yaml 文件来定义 DB 配置
825
1. [内嵌 Http](https://github.com/topling/rockside/wiki/WebView) 让用户可以通过 Web 查看几乎所有 DB 信息,这是 [SidePlugin](https://github.com/topling/rockside/wiki) 的一个子功能
926
1. [内嵌 Http](https://github.com/topling/rockside/wiki/WebView) 让用户可以无需重启进程,[在线修改](https://github.com/topling/rockside/wiki/Online-Change-Options) 各种 db/cf 配置,包括修改 DB 元对象(例如 MemTabFactory, TableFactory, WriteBufferManager ...)
1027
1. 为提升性能和可扩展性而实施的很多重构与改进,例如 MemTable 的重构
28+
1. MemTable 可作为 WAL 的索引,消除 MemTable 到 L0 SST 的 Flush,减小写放大,对大尺寸 MemTable 很友好
1129
1. 对事务处理的改进,特别是 TransactionDB 中 Lock 的管理,热点代码有 5x 以上的性能提升
1230
1. MultiGet 中使用 fiber/coroutine + io_uring 实现了并发 IO,比 RocksDB 自身的异步 MultiGet 又快又简洁,相应的代码量要少 100 倍不止
1331
1. [去虚拟化](https://github.com/topling/rockside/wiki/Devirtualization-And-Key-Prefix-Cache-Principle),消除热点代码中的虚函数调用(主要是 Comparator),并且增加了 Key 前缀缓存,参考相应 [bechmarks](https://github.com/topling/rockside/wiki/Devirtualization-And-Key-Prefix-Cache-Benchmark)
@@ -52,30 +70,6 @@ toplingdb
5270

5371
为了简化编译流程,ToplingDB 在 Makefile 中会自动 clone 各个组件的 github 仓库,社区版用户可以成功 clone 公开的仓库,但克隆私有仓库(例如 topling-rocks)会失败,所以社区版用户编译出来的 ToplingDB 无法创建 Topling**Zip**Table,但可以读取 Topling**Zip**Table。
5472

55-
## 运行 db_bench
56-
ToplingDB 需要 C++17,推荐 gcc 8.3 以上,或者 clang 也行。
57-
58-
即便没有 Topling**Zip**Table,ToplingDB 也比 RocksDB 要快得多,您可以通过运行 db_bench 来验证性能:
59-
```bash
60-
sudo yum -y install git libaio-devel gcc-c++ gflags-devel zlib-devel bzip2-devel libcurl-devel liburing-devel
61-
#sudo apt-get update -y && sudo apt-get install -y libjemalloc-dev libaio-dev libgflags-dev zlib1g-dev libbz2-dev libcurl4-gnutls-dev liburing-dev libsnappy-dev libbz2-dev liblz4-dev libzstd-dev
62-
git clone https://github.com/topling/toplingdb
63-
cd toplingdb
64-
make -j`nproc` db_bench DEBUG_LEVEL=0
65-
cp sideplugin/rockside/src/topling/web/{style.css,index.html} ${/path/to/dbdir}
66-
cp sideplugin/rockside/sample-conf/db_bench_*.yaml .
67-
export LD_LIBRARY_PATH=`find sideplugin -name lib_shared`
68-
# change db_bench_community.yaml as your needs
69-
# 1. use default path(/dev/shm) if you have no fast disk(such as a cloud server)
70-
# 2. change max_background_compactions to your cpu core num
71-
# 3. if you have github repo topling-rocks permissions, you can use db_bench_enterprise.yaml
72-
# 4. use db_bench_community.yaml is faster than upstream RocksDB
73-
# 5. use db_bench_enterprise.yaml is much faster than db_bench_community.yaml
74-
# command option -json can accept json and yaml files, here use yaml file for more human readable
75-
./db_bench -json=db_bench_community.yaml -num=10000000 -disable_wal=true -value_size=20 -benchmarks=fillrandom,readrandom -batch_size=10
76-
# you can access http://127.0.0.1:2011 to see webview
77-
# you can see this db_bench is much faster than RocksDB
78-
```
7973
## 可配置的功能
8074
为了性能和简化,ToplingDB 默认禁用了一些 RocksDB 的功能:
8175

@@ -86,8 +80,6 @@ export LD_LIBRARY_PATH=`find sideplugin -name lib_shared`
8680
宽列 | TOPLINGDB_WITH_WIDE_COLUMNS
8781
华而不实的功能 | TOPLINGDB_WITH_FABRICATED_COMPLEXITY
8882

89-
**注意**: SidePlugin 暂不支持动态创建 ColumnFamily,混用 SidePlugin 和动态创建 ColumnFamily时,动态创建的 ColumnFamily 不能在 Web 中展示
90-
9183
为了启用这些功能,需要为 make 命令显式添加 `EXTRA_CXXFLAGS="-D${MACRO_1} -D${MACRO_2} ..."`,例如编译带动态创建 ColumnFamily 的 rocksdbjava:
9284
```
9385
make -j`nproc` EXTRA_CXXFLAGS='-DROCKSDB_DYNAMIC_CREATE_CF' rocksdbjava

README.md

Lines changed: 28 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,39 @@
11
## [中文版](README-zh_cn.md)
22
## ToplingDB: A Persistent Key-Value Store for External Storage
3-
ToplingDB is developed and maintained by [Topling Inc](https://topling.cn). It is built with [RocksDB](https://github.com/facebook/rocksdb). See [ToplingDB Branch Name Convention](https://github.com/topling/toplingdb/wiki/ToplingDB-Branch-Name-Convention).
3+
ToplingDB is developed and maintained by [Topling Inc](https://topling.cn). See [ToplingDB Branch Name Convention](https://github.com/topling/toplingdb/wiki/ToplingDB-Branch-Name-Convention).
44

5-
ToplingDB's submodule **[rockside](https://github.com/topling/rockside)** is the entry point of ToplingDB, see **[SidePlugin wiki](https://github.com/topling/rockside/wiki)**.
5+
## Quick Start
6+
ToplingDB requires C++17, gcc 8.3 or newer is recommended, clang also works.
7+
8+
ToplingDB is forked form [RocksDB](https://github.com/facebook/rocksdb), much faster than RocksDB, try it by yourself:
9+
### Compile & run db_bench
10+
```bash
11+
sudo yum -y install git libaio-devel gcc-c++ gflags-devel zlib-devel bzip2-devel libcurl-devel liburing-devel snappy-devel jemalloc-devel
12+
#sudo apt-get update -y && sudo apt-get install -y libjemalloc-dev libaio-dev libgflags-dev zlib1g-dev libbz2-dev libcurl4-gnutls-dev liburing-dev libsnappy-dev libbz2-dev liblz4-dev libzstd-dev
13+
git clone https://github.com/topling/toplingdb
14+
cd toplingdb
15+
make -j`nproc` db_bench DEBUG_LEVEL=0
16+
sudo make install PREFIX=/some/path # default is /usr/local
17+
```
18+
19+
After compile, you can run bundled [db_bench.sh](db_bench.sh)(need [port 2011](https://github.com/topling/rockside/blob/master/sample-conf/db_bench_enterprise.yaml#L4 "use port 2011 for embeded http server")), then use ToplingDB [in C++](https://github.com/topling/sideplugin-wiki-en/wiki/101 "maybe migrate from rocksdb"), or in [Java](https://github.com/topling/sideplugin-wiki-en/wiki/SidePlugin-Java-Binding "Bundled in this repo"), [Rust](https://github.com/topling/rust-toplingdb "A seperated repo").
20+
21+
## Introduction
22+
ToplingDB's submodule **[rockside](https://github.com/topling/rockside)** is the entry point of ToplingDB, see **[SidePlugin wiki](https://github.com/topling/sideplugin-wiki-en/wiki)**.
623

724
ToplingDB has much more key features than RocksDB:
8-
1. [SidePlugin](https://github.com/topling/rockside/wiki) enables users to write a json(or yaml) to define DB configs
9-
1. [Embedded Http Server](https://github.com/topling/rockside/wiki/WebView) enables users to view almost all DB info on web, this is a component of [SidePlugin](https://github.com/topling/rockside/wiki)
10-
1. [Embedded Http Server](https://github.com/topling/rockside/wiki/WebView) enables users to [online change](https://github.com/topling/rockside/wiki/Online-Change-Options) db/cf options and all db meta objects(such as MemTabFactory, TableFactory, WriteBufferManager ...) without restart the running process
25+
1. [SidePlugin](https://github.com/topling/sideplugin-wiki-en/wiki) enables users to write a json(or yaml) to define DB configs
26+
1. [Embedded Http Server](https://github.com/topling/sideplugin-wiki-en/wiki/WebView) enables users to view almost all DB info on web, this is a component of [SidePlugin](https://github.com/topling/sideplugin-wiki-en/wiki)
27+
1. [Embedded Http Server](https://github.com/topling/sideplugin-wiki-en/wiki/WebView) enables users to [online change](https://github.com/topling/sideplugin-wiki-en/wiki/Online-Change-Options) db/cf options and all db meta objects(such as MemTabFactory, TableFactory, WriteBufferManager ...) without restart the running process
1128
1. Many improvements and refactories on RocksDB, aimed for performance and extendibility
29+
1. memtable as wal log index, omit Flush MemTable to L0, reduce write amp, further improves for large MemTable
1230
1. Topling transaction lock management, 5x faster than rocksdb
1331
1. MultiGet with concurrent IO by fiber/coroutine + io_uring, much faster than RocksDB's async MultiGet
14-
1. Topling [de-virtualization](https://github.com/topling/rockside/wiki/Devirtualization-And-Key-Prefix-Cache-Principle), de-virtualize hotspot (virtual) functions, and key prefix caches, [bechmarks](https://github.com/topling/rockside/wiki/Devirtualization-And-Key-Prefix-Cache-Benchmark)
32+
1. Topling [de-virtualization](https://github.com/topling/sideplugin-wiki-en/wiki/Devirtualization-And-Key-Prefix-Cache-Principle), de-virtualize hotspot (virtual) functions, and key prefix caches, [bechmarks](https://github.com/topling/sideplugin-wiki-en/wiki/Devirtualization-And-Key-Prefix-Cache-Benchmark)
1533
1. Topling zero copy for point search(Get/MultiGet) and Iterator
1634
1. Topling memtable as log index, omit memtable flush to L0
1735
1. Builtin SidePlugin**s** for existing RocksDB components(Cache, Comparator, TableFactory, MemTableFactory...)
18-
1. Builtin Prometheus metrics support, this is based on [Embedded Http Server](https://github.com/topling/rockside/wiki/WebView)
36+
1. Builtin Prometheus metrics support, this is based on [Embedded Http Server](https://github.com/topling/sideplugin-wiki-en/wiki/WebView)
1937
1. Many bugfixes for RocksDB, a small part of such fixes was [Pull Requested](https://github.com/facebook/rocksdb/pulls?q=is%3Apr+author%3Arockeet) to [upstream RocksDB](https://github.com/facebook/rocksdb)
2038

2139
## ToplingDB cloud native DB services
@@ -48,38 +66,14 @@ toplingdb
4866
[ToplingDB](https://github.com/topling/toplingdb) | public | Top repository, forked from [RocksDB](https://github.com/facebook/rocksdb) with our fixes, refactories and enhancements
4967
[rockside](https://github.com/topling/rockside) | public | This is a submodule, contains:<ul><li>SidePlugin framework and Builtin SidePlugin**s**</li><li>Embedded Http Server and Prometheus metrics</li></ul>
5068
[cspp-wbwi<br>(**W**rite**B**atch**W**ith**I**ndex)](https://github.com/topling/cspp-wbwi) | public | With CSPP and carefully coding, **CSPP_WBWI** is 20x faster than rocksdb SkipList based WBWI
51-
[cspp-memtable](https://github.com/topling/cspp-memtable) | public | (**CSPP** is **C**rash **S**afe **P**arallel **P**atricia trie) MemTab, which outperforms SkipList on all aspects: 3x lower memory usage, 7x single thread performance, perfect multi-thread scaling)
52-
[topling-sst](https://github.com/topling/topling-sst) | public | 1. [SingleFastTable](https://github.com/topling/rockside/wiki/SingleFastTable)(designed for L0 and L1)<br/> 2. VecAutoSortTable(designed for MyTopling bulk_load).<br/> 3. Deprecated [ToplingFastTable](https://github.com/topling/rockside/wiki/ToplingFastTable), CSPPAutoSortTable
69+
[cspp-memtable](https://github.com/topling/cspp-memtable/blob/memtable_as_log_index/README_EN.md) | public | (**CSPP** is **C**rash **S**afe **P**arallel **P**atricia trie) MemTab, which outperforms SkipList on all aspects: 3x lower memory usage, 7x single thread performance, perfect multi-thread scaling)
70+
[topling-sst](https://github.com/topling/topling-sst) | public | 1. [SingleFastTable](https://github.com/topling/sideplugin-wiki-en/wiki/SingleFastTable)(designed for L0 and L1)<br/> 2. VecAutoSortTable(designed for MyTopling bulk_load).<br/> 3. Deprecated [ToplingFastTable](https://github.com/topling/sideplugin-wiki-en/wiki/ToplingFastTable), CSPPAutoSortTable
5371
[topling-dcompact](https://github.com/topling/topling-dcompact) | public | Distributed Compaction with general dcompact_worker application, offload compactions to elastic computing clusters, much more powerful than RocksDB's Remote Compaction
54-
[topling-rocks](https://github.com/topling/topling-rocks) | **private** | For build [Topling**Zip**Table](https://github.com/topling/rockside/wiki/ToplingZipTable), an SST implementation optimized for RAM and SSD space, aimed for L2+ level compaction, which uses topling dedicated searchable in-memory data compression algorithms
72+
[topling-rocks](https://github.com/topling/topling-rocks) | **private** | For build [Topling**Zip**Table](https://github.com/topling/sideplugin-wiki-en/wiki/ToplingZipTable), an SST implementation optimized for RAM and SSD space, aimed for L2+ level compaction, which uses topling dedicated searchable in-memory data compression algorithms
5573
[topling-zip_table_reader](https://github.com/topling/topling-zip_table_reader) | public | For read Topling**Zip**Table by community users, builder of Topling**Zip**Table is in [topling-rocks](https://github.com/topling/topling-rocks)
5674

5775
To simplify the compiling, repo**s** are auto cloned in ToplingDB's Makefile, community users will auto clone public repo successfully but fail to auto clone **private** repo, thus ToplingDB is built without **private** components, this is so called **community** version.
5876

59-
## Run db_bench
60-
ToplingDB requires C++17, gcc 8.3 or newer is recommended, clang also works.
61-
62-
Even without ToplingZipTable, ToplingDB is much faster than upstream RocksDB:
63-
```bash
64-
sudo yum -y install git libaio-devel gcc-c++ gflags-devel zlib-devel bzip2-devel libcurl-devel liburing-devel snappy-devel jemalloc-devel
65-
#sudo apt-get update -y && sudo apt-get install -y libjemalloc-dev libaio-dev libgflags-dev zlib1g-dev libbz2-dev libcurl4-gnutls-dev liburing-dev libsnappy-dev libbz2-dev liblz4-dev libzstd-dev
66-
git clone https://github.com/topling/toplingdb
67-
cd toplingdb
68-
make -j`nproc` db_bench DEBUG_LEVEL=0
69-
cp sideplugin/rockside/src/topling/web/{style.css,index.html} ${/path/to/dbdir}
70-
cp sideplugin/rockside/sample-conf/db_bench_*.yaml .
71-
export LD_LIBRARY_PATH=`find sideplugin -name lib_shared`
72-
# change db_bench_community.yaml as your needs
73-
# 1. use default path(/dev/shm) if you have no fast disk(such as a cloud server)
74-
# 2. change max_background_compactions to your cpu core num
75-
# 3. if you have github repo topling-rocks permissions, you can use db_bench_enterprise.yaml
76-
# 4. use db_bench_community.yaml is faster than upstream RocksDB
77-
# 5. use db_bench_enterprise.yaml is much faster than db_bench_community.yaml
78-
# command option -json can accept json and yaml files, here use yaml file for more human readable
79-
./db_bench -json=db_bench_community.yaml -num=10000000 -disable_wal=true -value_size=20 -benchmarks=fillrandom,readrandom -batch_size=10
80-
# you can access http://127.0.0.1:2011 to see webview
81-
# you can see this db_bench is much faster than RocksDB
82-
```
8377
## Configurable features
8478
For performance and simplicity, ToplingDB disabled some RocksDB features by default:
8579

@@ -90,8 +84,6 @@ User level timestamp on key | TOPLINGDB_WITH_TIMESTAMP
9084
Wide Columns | TOPLINGDB_WITH_WIDE_COLUMNS
9185
fabricated features for read | TOPLINGDB_WITH_FABRICATED_COMPLEXITY
9286

93-
**Note**: Dynamic creation of ColumnFamily is not supported by SidePlugin
94-
9587
To enable these features, add `-D${MACRO_NAME}` to var `EXTRA_CXXFLAGS`, such as build ToplingDB for java with dynamic ColumnFamily:
9688
```
9789
make -j`nproc` EXTRA_CXXFLAGS='-DROCKSDB_DYNAMIC_CREATE_CF' rocksdbjava

db/arena_wrapped_db_iter.cc

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,38 @@ Status Iterator::RefreshKeepSnapshot(bool keep_iter_pos) {
3333
return Refresh(reinterpret_cast<Snapshot*>(KEEP_SNAPSHOT), keep_iter_pos);
3434
}
3535

36+
Slice Iterator::NextWithKey() { return IterNextWithKeyImpl(this); }
37+
Slice Iterator::PrevWithKey() { return IterPrevWithKeyImpl(this); }
38+
39+
Slice Iterator::SeekToFirstWithKey() {
40+
SeekToFirst();
41+
if (Valid())
42+
return key();
43+
else
44+
return Slice(nullptr, 0);
45+
}
46+
Slice Iterator::SeekToLastWithKey() {
47+
SeekToLast();
48+
if (Valid())
49+
return key();
50+
else
51+
return Slice(nullptr, 0);
52+
}
53+
Slice Iterator::SeekWithKey(const Slice& target) {
54+
Seek(target);
55+
if (Valid())
56+
return key();
57+
else
58+
return Slice(nullptr, 0);
59+
}
60+
Slice Iterator::SeekForPrevWithKey(const Slice& target) {
61+
SeekForPrev(target);
62+
if (Valid())
63+
return key();
64+
else
65+
return Slice(nullptr, 0);
66+
}
67+
3668
ArenaWrappedDBIter::ArenaWrappedDBIter() {
3769
// do nothing
3870
}

db/arena_wrapped_db_iter.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,9 @@ class ArenaWrappedDBIter final : public Iterator {
7171
}
7272
void Next() override { db_iter_->Next(); }
7373
void Prev() override { db_iter_->Prev(); }
74+
Slice NextWithKey() override { return db_iter_->NextWithKey(); }
75+
Slice PrevWithKey() override { return db_iter_->PrevWithKey(); }
76+
7477
ROCKSDB_FLATTEN
7578
Slice key() const override { return db_iter_->key(); }
7679
ROCKSDB_FLATTEN

db/blob/blob_fetcher.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ class BlobFetcherCopyReadOptions : public BlobFetcher {
4040
const ReadOptions read_options_copy_;
4141
public:
4242
BlobFetcherCopyReadOptions(const Version* v, const ReadOptions& ro)
43-
: BlobFetcher(v, read_options_copy_), read_options_copy_(ro) {}
43+
: BlobFetcher(v, read_options_copy_),
44+
read_options_copy_(ro, ReadOptions::BooleanDontCopyTrue()) {}
4445
};
4546

4647
} // namespace ROCKSDB_NAMESPACE

db/c.cc

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1556,6 +1556,37 @@ void rocksdb_batched_multi_get_cf(rocksdb_t* db,
15561556
delete[] statuses;
15571557
}
15581558

1559+
ROCKSDB_LIBRARY_API
1560+
void rocksdb_batched_multi_get_cf_fast(rocksdb_t* db,
1561+
const rocksdb_readoptions_t* options,
1562+
rocksdb_column_family_handle_t* column_family,
1563+
size_t num_keys, const rocksdb_slice_t* keys_list,
1564+
rocksdb_pinnableslice_t** values, char** errs,
1565+
const bool sorted_input) {
1566+
PinnableSlice* value_slices = new PinnableSlice[num_keys];
1567+
Status* statuses = new Status[num_keys];
1568+
1569+
db->rep->MultiGet(options->rep, column_family->rep, num_keys, keys_list,
1570+
value_slices, statuses, sorted_input);
1571+
1572+
for (size_t i = 0; i < num_keys; ++i) {
1573+
if (statuses[i].ok()) {
1574+
values[i] = new (rocksdb_pinnableslice_t);
1575+
values[i]->rep = std::move(value_slices[i]);
1576+
errs[i] = nullptr;
1577+
} else {
1578+
values[i] = nullptr;
1579+
if (!statuses[i].IsNotFound()) {
1580+
errs[i] = strdup(statuses[i].ToString().c_str());
1581+
} else {
1582+
errs[i] = nullptr;
1583+
}
1584+
}
1585+
}
1586+
delete[] value_slices;
1587+
delete[] statuses;
1588+
}
1589+
15591590
unsigned char rocksdb_key_may_exist(rocksdb_t* db,
15601591
const rocksdb_readoptions_t* options,
15611592
const char* key, size_t key_len,

0 commit comments

Comments
 (0)