|
| 1 | +# Sqlite Index Blaster for Python |
| 2 | + |
| 3 | +This library provides API for creating huge Sqlite indexes at breakneck speeds for millions of records much faster than the official SQLite library by leaving out crash recovery. |
| 4 | + |
| 5 | +This repo exploits a [lesser known feature of the Sqlite database file format](https://www.sqlite.org/withoutrowid.html) to store records as key-value pairs or documents or regular tuples. |
| 6 | + |
| 7 | +This repo is a `pybind11` wrapper for the C++ lib at https://github.com/siara-cc/sqlite_blaster |
| 8 | + |
| 9 | +# Statement of need |
| 10 | + |
| 11 | +There are a number of choices available for fast insertion of records, such as Rocks DB, LMDB and MongoDB but even they are slow due to overheads of using logs or journals for providing durability. These overheads are significant for indexing huge datasets. |
| 12 | + |
| 13 | +This library was created for inserting/updating billions of entries for arriving at word/phrase frequencies for building dictionaries for the [Unishox](https://github.com/siara-cc/Unishox) project using publicly available texts and conversations. |
| 14 | + |
| 15 | +Furthermore, the other choices don't have the same number of IDEs or querying abilities of the most popular Sqlite data format. |
| 16 | + |
| 17 | +# Applications |
| 18 | + |
| 19 | +- Lightning fast index creation for huge datasets |
| 20 | +- Fast database indexing for embedded systems |
| 21 | +- Fast data set creation and loading for Data Science and Machine Learning |
| 22 | + |
| 23 | +# Performance |
| 24 | + |
| 25 | +The performance of this repo was compared with the Sqlite official library, LMDB and RocksDB under similar conditions of CPU, RAM and NVMe disk and the results are shown below: |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +RocksDB performs much better than other choices and performs consistently for over billion entries, but it is quite slow initially. |
| 30 | + |
| 31 | +The chart data can be found [here](https://github.com/siara-cc/sqlite_blaster/blob/master/SqliteBlasterPerformanceLineChart.xlsx?raw=true). |
| 32 | + |
| 33 | +# Building and running tests |
| 34 | + |
| 35 | +Clone this repo and run: |
| 36 | + |
| 37 | +```sh |
| 38 | +python3 setup.py test |
| 39 | +``` |
| 40 | + |
| 41 | +Note: This only builds the module. To run tests, install `pytest` and run: |
| 42 | + |
| 43 | +```sh |
| 44 | +pip3 install pytest |
| 45 | +pytest |
| 46 | +``` |
| 47 | + |
| 48 | +To install the module, run: |
| 49 | + |
| 50 | +```sh |
| 51 | +mkdir build |
| 52 | +cd build |
| 53 | +cmake .. |
| 54 | +make |
| 55 | +pip3 install ./sqlite_blaster_python |
| 56 | +``` |
| 57 | + |
| 58 | +# Getting started |
| 59 | + |
| 60 | +Essentially, the library provides 4 methods `put_string()`, `get_string()`, `put_rec()`, `get_rec()` for inserting and retrieving records. Shown below are examples of how this library can be used to create a key-value store, or a document store or a regular table. |
| 61 | + |
| 62 | +Note: The cache size is used as 40kb in these examples, but in real life 32mb or 64mb would be ideal. The higher this number, better the performance. |
| 63 | + |
| 64 | +## Creating a Key-Value store |
| 65 | + |
| 66 | +In this mode, a table is created with just 2 columns, `key` and `value` as shown below: |
| 67 | + |
| 68 | +```python |
| 69 | +import sqlite_blaster_python |
| 70 | + |
| 71 | +col_names = ["key", "value"] |
| 72 | +sqib = sqlite_blaster_python.sqlite_index_blaster(2, 1, col_names, "imain", 4096, 40000, "kv_idx.db") |
| 73 | +sqib.put_string("hello", "world") |
| 74 | +sqib.close() |
| 75 | +``` |
| 76 | + |
| 77 | +A file `kv_idx.db` is created and can be verified by opening it using `sqlite3` official console program: |
| 78 | + |
| 79 | +```sh |
| 80 | +sqlite3 kv_idx.db ".dump" |
| 81 | +``` |
| 82 | + |
| 83 | +and the output would be: |
| 84 | + |
| 85 | +```sql |
| 86 | +PRAGMA foreign_keys=OFF; |
| 87 | +BEGIN TRANSACTION; |
| 88 | +CREATE TABLE kv_index (key, value, PRIMARY KEY (key)) WITHOUT ROWID; |
| 89 | +INSERT INTO kv_index VALUES('hello','world'); |
| 90 | +COMMIT; |
| 91 | +``` |
| 92 | + |
| 93 | +To retrieve the inserted values, use `get` method as shown below |
| 94 | + |
| 95 | +```python |
| 96 | +import sqlite_blaster_python |
| 97 | + |
| 98 | +col_names = ["key", "value"] |
| 99 | +sqib = sqlite_blaster_python.sqlite_index_blaster(2, 1, col_names, "imain", 4096, 40, "kv_idx.db") |
| 100 | +sqib.put_string("hello", "world") |
| 101 | +print("Value of hello is", sqib.get_string("hello", "not_found")) |
| 102 | +sqib.close() |
| 103 | +``` |
| 104 | + |
| 105 | +The second parameter to `get_string` is for specifying what value is to be returned when the 1st parameter could not be found in the database index. |
| 106 | + |
| 107 | +## Creating a Document store |
| 108 | + |
| 109 | +In this mode, a table is created with just 2 columns, `key` and `doc` as shown below: |
| 110 | + |
| 111 | +```python |
| 112 | +import sqlite_blaster_python |
| 113 | + |
| 114 | +json1 = '{"name": "Alice", "age": 25, "email": "[email protected]"}' |
| 115 | +json2 = '{"name": "George", "age": 32, "email": "[email protected]"}' |
| 116 | + |
| 117 | +col_names = ["key", "doc"] |
| 118 | +sqib = sqlite_blaster_python.sqlite_index_blaster(2, 1, col_names, "doc_index", 4096, 40, "doc_store.db") |
| 119 | +sqib.put_string("primary_contact", json1) |
| 120 | +sqib.put_string("secondary_contact", json2) |
| 121 | +sqib.close() |
| 122 | +``` |
| 123 | + |
| 124 | +The index is created as `doc_store.db` and the json values can be queried using `sqlite3` console as shown below: |
| 125 | + |
| 126 | +```sql |
| 127 | +SELECT json_extract(doc, '$.email') AS email |
| 128 | +FROM doc_index |
| 129 | +WHERE key = 'primary_contact'; |
| 130 | +``` |
| 131 | + |
| 132 | +## Creating a regular table |
| 133 | + |
| 134 | +This repo can be used to create regular tables with primary key(s) as shown below: |
| 135 | + |
| 136 | +```python |
| 137 | +import sqlite_blaster_python |
| 138 | + |
| 139 | +col_names = ["student_name", "age", "maths_marks", "physics_marks", "chemistry_marks", "average_marks"] |
| 140 | +sqib = sqlite_blaster_python.sqlite_index_blaster(6, 2, col_names, "student_marks", 4096, 40, "student_marks.db") |
| 141 | + |
| 142 | +sqib.put_rec(["Robert", 19, 80, 69, 98, round((80+69+98)/3, 2)]) |
| 143 | +sqib.put_rec(["Barry", 20, 82, 99, 83, round((82+99+83)/3, 2)]) |
| 144 | +sqib.put_rec(["Elizabeth", 23, 84, 89, 74, round((84+89+74)/3, 2)]) |
| 145 | + |
| 146 | +sqib.get_rec(["Elizabeth", 23]) |
| 147 | + |
| 148 | +sqib.close() |
| 149 | +``` |
| 150 | + |
| 151 | +The index is created as `student_marks.db` and the data can be queried using `sqlite3` console as shown below: |
| 152 | + |
| 153 | +```sql |
| 154 | +sqlite3 student_marks.db "select * from student_marks" |
| 155 | +Barry|20|82|99|83|88.0 |
| 156 | +Elizabeth|23|84|89|74|82.33 |
| 157 | +Robert|19|80|69|98|82.33 |
| 158 | +``` |
| 159 | + |
| 160 | +## Constructor parameters of sqlite_index_blaster class |
| 161 | + |
| 162 | +1. `total_col_count` - Total column count in the index |
| 163 | +2. `pk_col_count` - Number of columns to use as key. These columns have to be positioned at the beginning |
| 164 | +3. `col_names` - Column names to create the table |
| 165 | +4. `tbl_name` - Table (clustered index) name |
| 166 | +5. `block_sz` - Page size (must be one of 512, 1024, 2048, 4096, 8192, 16384, 32768 or 65536) |
| 167 | +6. `cache_sz` - Size of LRU cache in kilobytes. 32 or 64 mb would be ideal. Higher values lead to better performance |
| 168 | +7. `fname` - Name of the Sqlite database file |
| 169 | + |
| 170 | +# Limitations |
| 171 | + |
| 172 | +- No crash recovery. If the insertion process is interruped, the database would be unusable. |
| 173 | +- The record length cannot change for update. Updating with lesser or greater record length is not implemented yet. |
| 174 | +- Deletes are not implemented yet. This library is intended primarily for fast inserts. |
| 175 | +- Support for concurrent inserts not implemented yet. |
| 176 | +- The regular ROWID table of Sqlite is not implemented. |
| 177 | +- Key lengths are limited depending on page size as shown in the table below. This is just because the source code does not implement support for longer keys. However, this is considered sufficient for most practical purposes. |
| 178 | + |
| 179 | + | **Page Size** | **Max Key Length** | |
| 180 | + | ------------- | ------------------ | |
| 181 | + | 512 | 35 | |
| 182 | + | 1024 | 99 | |
| 183 | + | 2048 | 227 | |
| 184 | + | 4096 | 484 | |
| 185 | + | 8192 | 998 | |
| 186 | + | 16384 | 2026 | |
| 187 | + | 32768 | 4082 | |
| 188 | + | 65536 | 8194 | |
| 189 | + |
| 190 | +# License |
| 191 | + |
| 192 | +Sqlite Index Blaster and its command line tools are dual-licensed under the MIT license and the AGPL-3.0. Users may choose one of the above. |
| 193 | + |
| 194 | +- The MIT License |
| 195 | +- The GNU Affero General Public License v3 (AGPL-3.0) |
| 196 | + |
| 197 | +# Credits |
| 198 | + |
| 199 | +- The template for developing this Python binding was taken from the `pybind` repo https://github.com/pybind/cmake_example |
| 200 | +- `ChatGPT` was used in quickly figuring out the intricacies of `pybind11` |
| 201 | + |
| 202 | +# Support |
| 203 | + |
| 204 | +If you face any problem, create issue in this website, or write to the author (Arundale Ramanathan) at [email protected]. |
0 commit comments