Skip to content

Commit 161ab2e

Browse files
committed
Working version
1 parent 4537541 commit 161ab2e

File tree

9 files changed

+375
-24
lines changed

9 files changed

+375
-24
lines changed

README.md

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# Sqlite Index Blaster for Python
2+
3+
This library provides API for creating huge Sqlite indexes at breakneck speeds for millions of records much faster than the official SQLite library by leaving out crash recovery.
4+
5+
This repo exploits a [lesser known feature of the Sqlite database file format](https://www.sqlite.org/withoutrowid.html) to store records as key-value pairs or documents or regular tuples.
6+
7+
This repo is a `pybind11` wrapper for the C++ lib at https://github.com/siara-cc/sqlite_blaster
8+
9+
# Statement of need
10+
11+
There are a number of choices available for fast insertion of records, such as Rocks DB, LMDB and MongoDB but even they are slow due to overheads of using logs or journals for providing durability. These overheads are significant for indexing huge datasets.
12+
13+
This library was created for inserting/updating billions of entries for arriving at word/phrase frequencies for building dictionaries for the [Unishox](https://github.com/siara-cc/Unishox) project using publicly available texts and conversations.
14+
15+
Furthermore, the other choices don't have the same number of IDEs or querying abilities of the most popular Sqlite data format.
16+
17+
# Applications
18+
19+
- Lightning fast index creation for huge datasets
20+
- Fast database indexing for embedded systems
21+
- Fast data set creation and loading for Data Science and Machine Learning
22+
23+
# Performance
24+
25+
The performance of this repo was compared with the Sqlite official library, LMDB and RocksDB under similar conditions of CPU, RAM and NVMe disk and the results are shown below:
26+
27+
![Performance](misc/performance.png?raw=true)
28+
29+
RocksDB performs much better than other choices and performs consistently for over billion entries, but it is quite slow initially.
30+
31+
The chart data can be found [here](https://github.com/siara-cc/sqlite_blaster/blob/master/SqliteBlasterPerformanceLineChart.xlsx?raw=true).
32+
33+
# Building and running tests
34+
35+
Clone this repo and run:
36+
37+
```sh
38+
python3 setup.py test
39+
```
40+
41+
Note: This only builds the module. To run tests, install `pytest` and run:
42+
43+
```sh
44+
pip3 install pytest
45+
pytest
46+
```
47+
48+
To install the module, run:
49+
50+
```sh
51+
mkdir build
52+
cd build
53+
cmake ..
54+
make
55+
pip3 install ./sqlite_blaster_python
56+
```
57+
58+
# Getting started
59+
60+
Essentially, the library provides 4 methods `put_string()`, `get_string()`, `put_rec()`, `get_rec()` for inserting and retrieving records. Shown below are examples of how this library can be used to create a key-value store, or a document store or a regular table.
61+
62+
Note: The cache size is used as 40kb in these examples, but in real life 32mb or 64mb would be ideal. The higher this number, better the performance.
63+
64+
## Creating a Key-Value store
65+
66+
In this mode, a table is created with just 2 columns, `key` and `value` as shown below:
67+
68+
```python
69+
import sqlite_blaster_python
70+
71+
col_names = ["key", "value"]
72+
sqib = sqlite_blaster_python.sqlite_index_blaster(2, 1, col_names, "imain", 4096, 40000, "kv_idx.db")
73+
sqib.put_string("hello", "world")
74+
sqib.close()
75+
```
76+
77+
A file `kv_idx.db` is created and can be verified by opening it using `sqlite3` official console program:
78+
79+
```sh
80+
sqlite3 kv_idx.db ".dump"
81+
```
82+
83+
and the output would be:
84+
85+
```sql
86+
PRAGMA foreign_keys=OFF;
87+
BEGIN TRANSACTION;
88+
CREATE TABLE kv_index (key, value, PRIMARY KEY (key)) WITHOUT ROWID;
89+
INSERT INTO kv_index VALUES('hello','world');
90+
COMMIT;
91+
```
92+
93+
To retrieve the inserted values, use `get` method as shown below
94+
95+
```python
96+
import sqlite_blaster_python
97+
98+
col_names = ["key", "value"]
99+
sqib = sqlite_blaster_python.sqlite_index_blaster(2, 1, col_names, "imain", 4096, 40, "kv_idx.db")
100+
sqib.put_string("hello", "world")
101+
print("Value of hello is", sqib.get_string("hello", "not_found"))
102+
sqib.close()
103+
```
104+
105+
The second parameter to `get_string` is for specifying what value is to be returned when the 1st parameter could not be found in the database index.
106+
107+
## Creating a Document store
108+
109+
In this mode, a table is created with just 2 columns, `key` and `doc` as shown below:
110+
111+
```python
112+
import sqlite_blaster_python
113+
114+
json1 = '{"name": "Alice", "age": 25, "email": "[email protected]"}'
115+
json2 = '{"name": "George", "age": 32, "email": "[email protected]"}'
116+
117+
col_names = ["key", "doc"]
118+
sqib = sqlite_blaster_python.sqlite_index_blaster(2, 1, col_names, "doc_index", 4096, 40, "doc_store.db")
119+
sqib.put_string("primary_contact", json1)
120+
sqib.put_string("secondary_contact", json2)
121+
sqib.close()
122+
```
123+
124+
The index is created as `doc_store.db` and the json values can be queried using `sqlite3` console as shown below:
125+
126+
```sql
127+
SELECT json_extract(doc, '$.email') AS email
128+
FROM doc_index
129+
WHERE key = 'primary_contact';
130+
```
131+
132+
## Creating a regular table
133+
134+
This repo can be used to create regular tables with primary key(s) as shown below:
135+
136+
```python
137+
import sqlite_blaster_python
138+
139+
col_names = ["student_name", "age", "maths_marks", "physics_marks", "chemistry_marks", "average_marks"]
140+
sqib = sqlite_blaster_python.sqlite_index_blaster(6, 2, col_names, "student_marks", 4096, 40, "student_marks.db")
141+
142+
sqib.put_rec(["Robert", 19, 80, 69, 98, round((80+69+98)/3, 2)])
143+
sqib.put_rec(["Barry", 20, 82, 99, 83, round((82+99+83)/3, 2)])
144+
sqib.put_rec(["Elizabeth", 23, 84, 89, 74, round((84+89+74)/3, 2)])
145+
146+
sqib.get_rec(["Elizabeth", 23])
147+
148+
sqib.close()
149+
```
150+
151+
The index is created as `student_marks.db` and the data can be queried using `sqlite3` console as shown below:
152+
153+
```sql
154+
sqlite3 student_marks.db "select * from student_marks"
155+
Barry|20|82|99|83|88.0
156+
Elizabeth|23|84|89|74|82.33
157+
Robert|19|80|69|98|82.33
158+
```
159+
160+
## Constructor parameters of sqlite_index_blaster class
161+
162+
1. `total_col_count` - Total column count in the index
163+
2. `pk_col_count` - Number of columns to use as key. These columns have to be positioned at the beginning
164+
3. `col_names` - Column names to create the table
165+
4. `tbl_name` - Table (clustered index) name
166+
5. `block_sz` - Page size (must be one of 512, 1024, 2048, 4096, 8192, 16384, 32768 or 65536)
167+
6. `cache_sz` - Size of LRU cache in kilobytes. 32 or 64 mb would be ideal. Higher values lead to better performance
168+
7. `fname` - Name of the Sqlite database file
169+
170+
# Limitations
171+
172+
- No crash recovery. If the insertion process is interruped, the database would be unusable.
173+
- The record length cannot change for update. Updating with lesser or greater record length is not implemented yet.
174+
- Deletes are not implemented yet. This library is intended primarily for fast inserts.
175+
- Support for concurrent inserts not implemented yet.
176+
- The regular ROWID table of Sqlite is not implemented.
177+
- Key lengths are limited depending on page size as shown in the table below. This is just because the source code does not implement support for longer keys. However, this is considered sufficient for most practical purposes.
178+
179+
| **Page Size** | **Max Key Length** |
180+
| ------------- | ------------------ |
181+
| 512 | 35 |
182+
| 1024 | 99 |
183+
| 2048 | 227 |
184+
| 4096 | 484 |
185+
| 8192 | 998 |
186+
| 16384 | 2026 |
187+
| 32768 | 4082 |
188+
| 65536 | 8194 |
189+
190+
# License
191+
192+
Sqlite Index Blaster and its command line tools are dual-licensed under the MIT license and the AGPL-3.0. Users may choose one of the above.
193+
194+
- The MIT License
195+
- The GNU Affero General Public License v3 (AGPL-3.0)
196+
197+
# Credits
198+
199+
- The template for developing this Python binding was taken from the `pybind` repo https://github.com/pybind/cmake_example
200+
- `ChatGPT` was used in quickly figuring out the intricacies of `pybind11`
201+
202+
# Support
203+
204+
If you face any problem, create issue in this website, or write to the author (Arundale Ramanathan) at [email protected].

kv_idx.db

8 KB
Binary file not shown.

src/main.cpp

Lines changed: 139 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,114 @@
77

88
namespace py = pybind11;
99

10+
#define SQIB_UNHANDLED_TYPE_PASSED 1;
11+
#define SQIB_MALFORMED_REC 2;
12+
13+
static const uint8_t int_type_from_len[] = {0, 1, 2, 3, 4, 0, 5, 0, 6};
14+
int fill_col_arr(py::list py_args, void **col_arr, size_t *col_lens, uint8_t *col_types) {
15+
int col_idx = 0;
16+
int est_len = 0;
17+
for (auto item : py_args) {
18+
if (py::isinstance<py::str>(item)) {
19+
std::string s = py::cast<std::string>(item);
20+
col_arr[col_idx] = (void *) malloc(s.length() + 1);
21+
memcpy(col_arr[col_idx], s.c_str(), s.length());
22+
col_lens[col_idx] = s.length();
23+
col_types[col_idx] = 13;
24+
//col_arr[col_idx][s.length()] = '\0';
25+
est_len += s.length();
26+
} else if (py::isinstance<py::int_>(item)) {
27+
int i = py::cast<int>(item);
28+
col_arr[col_idx] = (void *) malloc(sizeof(int));
29+
memcpy(col_arr[col_idx], &i, sizeof(int));
30+
col_lens[col_idx] = sizeof(int);
31+
col_types[col_idx] = int_type_from_len[sizeof(int)];
32+
est_len += sizeof(int);
33+
} else if (py::isinstance<py::float_>(item)) {
34+
double d = py::cast<double>(item);
35+
col_arr[col_idx] = (void *) malloc(sizeof(double));
36+
memcpy(col_arr[col_idx], &d, sizeof(double));
37+
col_lens[col_idx] = sizeof(double);
38+
col_types[col_idx] = 7;
39+
est_len += sizeof(double);
40+
}
41+
col_idx++;
42+
}
43+
if (col_idx != py_args.size())
44+
throw SQIB_UNHANDLED_TYPE_PASSED;
45+
return est_len;
46+
}
47+
48+
py::list get_values(sqlite_index_blaster& self, uint8_t *rec, int rec_len) {
49+
py::list result;
50+
int8_t vlen;
51+
int col_type_or_len, col_len, col_type;
52+
int hdr_len = self.read_vint32(rec, &vlen);
53+
int hdr_pos = vlen;
54+
uint8_t *data_ptr = rec + hdr_len;
55+
col_len = vlen = 0;
56+
do {
57+
data_ptr += col_len;
58+
hdr_pos += vlen;
59+
if (data_ptr - rec > rec_len)
60+
break;
61+
if (hdr_pos >= hdr_len)
62+
break;
63+
col_type_or_len = self.read_vint32(rec + hdr_pos, &vlen);
64+
col_len = self.derive_data_len(col_type_or_len);
65+
col_type = self.derive_col_type(col_type_or_len);
66+
switch (col_type) {
67+
case SQLT_TYPE_NULL:
68+
case SQLT_TYPE_BLOB:
69+
case SQLT_TYPE_TEXT: {
70+
std::string str_val((const char *) data_ptr, col_len);
71+
result.append(py::str(str_val));
72+
}
73+
break;
74+
case SQLT_TYPE_INT0:
75+
result.append(py::int_(0));
76+
break;
77+
case SQLT_TYPE_INT1:
78+
result.append(py::int_(1));
79+
break;
80+
case SQLT_TYPE_INT8:
81+
result.append(py::int_(*data_ptr));
82+
break;
83+
case SQLT_TYPE_INT16: {
84+
int int_val = self.read_uint16(data_ptr);
85+
result.append(py::int_(int_val));
86+
}
87+
break;
88+
case SQLT_TYPE_INT24: {
89+
int32_t int_val = self.read_uint24(data_ptr);
90+
result.append(py::int_(int_val));
91+
}
92+
break;
93+
case SQLT_TYPE_INT32: {
94+
int32_t int_val = self.read_uint32(data_ptr);
95+
result.append(py::int_(int_val));
96+
}
97+
break;
98+
case SQLT_TYPE_INT48: {
99+
int int_val = self.read_int48(data_ptr);
100+
result.append(py::int_(int_val));
101+
}
102+
break;
103+
case SQLT_TYPE_INT64: {
104+
int64_t int_val = self.read_uint64(data_ptr);
105+
result.append(py::int_(int_val));
106+
}
107+
break;
108+
case SQLT_TYPE_REAL: {
109+
double dbl_val = self.read_double(data_ptr);
110+
result.append(PyFloat_FromDouble(dbl_val));
111+
}
112+
break;
113+
}
114+
} while (hdr_pos < hdr_len);
115+
return result;
116+
}
117+
10118
PYBIND11_MODULE(sqlite_blaster_python, m) {
11119
m.doc() = R"pbdoc(
12120
Pybind11 bindings for sqlite_blaster_python
@@ -22,12 +130,40 @@ PYBIND11_MODULE(sqlite_blaster_python, m) {
22130
)pbdoc";
23131
py::class_<sqlite_index_blaster>(m, "sqlite_index_blaster")
24132
.def(py::init<int, int,
25-
vector<string>, const char *,
133+
std::vector<std::string>, const char *,
26134
int, int,
27135
const char *>())
28-
.def("get_string", &sqlite_index_blaster::get_string)
136+
.def("close", &sqlite_index_blaster::close)
29137
.def("put_string", &sqlite_index_blaster::put_string)
30-
.def("close", &sqlite_index_blaster::close);
138+
.def("get_string", &sqlite_index_blaster::get_string)
139+
.def("put_rec", [](sqlite_index_blaster& self, py::list py_args) {
140+
int col_count = py_args.size();
141+
void *col_arr[col_count];
142+
size_t col_lens[col_count];
143+
uint8_t col_types[col_count];
144+
int est_len = fill_col_arr(py_args, col_arr, col_lens, col_types);
145+
uint8_t rec[est_len + col_count * 3 + 3];
146+
int rec_len = self.make_new_rec(rec, col_count, (const void **) col_arr, col_lens, col_types);
147+
return self.put(rec, -rec_len, NULL, 0);
148+
})
149+
.def("get_rec", [](sqlite_index_blaster& self, py::list py_args) {
150+
int col_count = py_args.size();
151+
void *col_arr[col_count];
152+
size_t col_lens[col_count];
153+
uint8_t col_types[col_count];
154+
int est_len = fill_col_arr(py_args, col_arr, col_lens, col_types);
155+
uint8_t rec[est_len + col_count * 3 + 3];
156+
int rec_len = self.make_new_rec(rec, col_count, (const void **) col_arr, col_lens, col_types);
157+
int out_len;
158+
py::list result;
159+
if (self.get(rec, -rec_len, &out_len)) {
160+
uint8_t *val = (uint8_t *) malloc(out_len);
161+
self.copy_value(val, &out_len);
162+
result = get_values(self, val, out_len);
163+
free(val);
164+
}
165+
return result;
166+
});
31167

32168
#ifdef VERSION_INFO
33169
m.attr("__version__") = MACRO_STRINGIFY(VERSION_INFO);

student_marks

8 KB
Binary file not shown.

student_marks.db

8 KB
Binary file not shown.

test_sqib.py

Lines changed: 0 additions & 13 deletions
This file was deleted.

tests/test_basic.py

Lines changed: 0 additions & 7 deletions
This file was deleted.

0 commit comments

Comments
 (0)