Skip to content

Commit e7fe789

Browse files
authored
docs & testing for v0.8.0 (#15)
1 parent b59983e commit e7fe789

File tree

14 files changed

+1366
-1195
lines changed

14 files changed

+1366
-1195
lines changed

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ endif()
1818
FetchContent_Declare(
1919
sqlite_zstd_vfs
2020
GIT_REPOSITORY https://github.com/mlin/sqlite_zstd_vfs.git
21-
GIT_TAG 048232c
21+
GIT_TAG 29029e8
2222
)
2323
FetchContent_MakeAvailable(sqlite_zstd_vfs)
2424
FetchContent_MakeAvailable(sqlitecpp)

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@
55
This [SQLite3 loadable extension](https://www.sqlite.org/loadext.html) adds features to the [ubiquitous](https://www.sqlite.org/mostdeployed.html) embedded RDBMS supporting applications in genome bioinformatics:
66

77
* genomic range indexing for overlap queries & joins
8-
* streaming storage compression (also available [standalone](https://github.com/mlin/sqlite_zstd_vfs))
98
* in-SQL utility functions, e.g. reverse-complement DNA, parse "chr1:2,345-6,789"
9+
* automatic streaming storage compression (also available [standalone](https://github.com/mlin/sqlite_zstd_vfs))
10+
* reading directly from HTTP(S) URLs (also available [standalone](https://github.com/mlin/sqlite_web_vfs))
1011
* pre-tuned settings for "big data"
1112

1213
This October 2020 poster discusses the context and long-run ambitions:

bindings/python/genomicsqlite/__init__.py

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -342,18 +342,41 @@ def _compact(dbfilename, argv):
342342
)
343343
parser.add_argument("-q", dest="quiet", action="store_true", help="suppress progress messages")
344344
args = parser.parse_args(argv)
345+
cfg = {"page_cache_MiB": 64}
346+
# reduced page_cache_MiB: memory usage goes unexpectedly high otherwise. Theory: SQLite
347+
# miscalculates the size of the page cache for the new database file when we use VACUUM INTO,
348+
# maybe due to some confusion involving the cache_size & page_size of the two database files.
349+
for k in ("zstd_level", "inner_page_KiB", "outer_page_KiB", "threads"):
350+
cfg[k] = vars(args)[k]
345351

346-
# open db (sniffing whether it's currently compressed or not)
352+
# sniff whether input db is currently compressed with zstd_vfs or not
347353
con = None
348354
web = dbfilename.startswith("http:") or dbfilename.startswith("https:")
349355
if not web:
350356
con = sqlite3.connect(f"file:{urllib.parse.quote(dbfilename)}?mode=ro", uri=True)
351-
if next(con.execute("PRAGMA application_id"))[0] == 0x7A737464:
357+
if next(con.execute("PRAGMA application_id"))[0] != 0x7A737464:
358+
# db is uncompressed: apply tuning PRAGMAs
359+
if not args.quiet:
360+
print(f"Opened uncompressed database {dbfilename}", file=sys.stderr)
361+
sys.stderr.flush()
362+
con.executescript(
363+
_execute1(
364+
con,
365+
"SELECT genomicsqlite_tuning_sql(?)",
366+
(json.dumps(cfg),),
367+
)
368+
)
369+
else:
370+
# db is zstd_vfs outer db; proceed to open using our own connect()
371+
con.close()
352372
con = None
353373
if not con:
354-
con = connect(dbfilename, read_only=True)
374+
if not args.quiet:
375+
print(f"Opening compressed database {dbfilename} ...", file=sys.stderr)
376+
sys.stderr.flush()
377+
con = connect(dbfilename, read_only=True, **cfg)
355378

356-
# VACUUM INTO to recompress
379+
# VACUUM INTO to [re]compress
357380
destfilename = args.out_filename
358381
if not destfilename:
359382
destfilename = dbfilename
@@ -371,9 +394,6 @@ def _compact(dbfilename, argv):
371394
sys.stderr.flush()
372395
if args.force and os.path.isfile(destfilename):
373396
os.unlink(destfilename)
374-
cfg = {}
375-
for k in ("zstd_level", "inner_page_KiB", "outer_page_KiB", "threads"):
376-
cfg[k] = vars(args)[k]
377397
con.executescript(vacuum_into_sql(con, destfilename, **cfg))
378398
con.close()
379399

docs/bindings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Language Bindings Guide
1+
# Writing Language Bindings
22

33
Thank you for considering a contribution to the language bindings available for the Genomics Extension!
44

docs/guide.md

Lines changed: 0 additions & 1159 deletions
This file was deleted.

docs/guide_db.md

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
# Programming Guide - Opening Compressed Databases
2+
3+
The Genomics Extension integrates with your programming language's existing SQLite3 bindings to provide a familiar experience wherever possible.
4+
5+
* Python: [sqlite3](https://docs.python.org/3/library/sqlite3.html)
6+
* Java/JVM: [sqlite-jdbc](https://github.com/xerial/sqlite-jdbc)
7+
* Rust: [rusqlite](https://github.com/rusqlite/rusqlite)
8+
* C++: [SQLiteCpp](https://github.com/SRombauts/SQLiteCpp) (optional, recommended) or directly using...
9+
* C: [SQLite C/C++ API](https://www.sqlite.org/cintro.html)
10+
11+
First complete the [installation instructions](index.md).
12+
13+
## Loading the extension
14+
15+
=== "Python"
16+
``` python3
17+
import sqlite3
18+
import genomicsqlite
19+
```
20+
21+
=== "Java"
22+
```java
23+
import java.sql.*;
24+
import net.mlin.genomicsqlite.GenomicSQLite;
25+
```
26+
27+
=== "Rust"
28+
```rust
29+
use genomicsqlite::ConnectionMethods;
30+
use rusqlite::{Connection, OpenFlags, params, NO_PARAMS};
31+
```
32+
33+
The `genomicsqlite::ConnectionMethods` trait makes available GenomicSQLite-specific methods for
34+
`rusqlite::Connection` (and `rusqlite::Transaction`). See [rustdoc](https://docs.rs/genomicsqlite)
35+
for some extra details.
36+
37+
=== "C++"
38+
``` c++
39+
#include <sqlite3.h>
40+
#include "SQLiteCpp/SQLiteCpp.h" // optional
41+
#include "genomicsqlite.h"
42+
43+
int main() {
44+
try {
45+
GENOMICSQLITE_CXX_INIT();
46+
} catch (std::runtime_error& exn) {
47+
// report exn.what()
48+
}
49+
...
50+
}
51+
```
52+
53+
Link the program to `sqlite3` and `genomicsqlite` libraries. Optionally, include
54+
[SQLiteCpp](https://github.com/SRombauts/SQLiteCpp) headers *before* `genomicsqlite.h` to use
55+
its more-convenient API; but *don't* link it, as the `genomicsqlite` library has it built-in.
56+
57+
GNU/Linux: to link the prebuilt `libgenomicsqlite.so` distributed from our GitHub Releases, you
58+
may have to compile your source with `CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0`. This is because the
59+
library is built against an old libstdc++ version to improve runtime compatibility. The
60+
function of this flag is explained in the libstdc++ docs on
61+
[Dual ABI](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). If you build
62+
`libgenomicsqlite.so` from source, then the flag will not be needed.
63+
64+
General note: GenomicSQLite C++ routines are liable to throw exceptions.
65+
66+
=== "C"
67+
``` c
68+
#include <sqlite3.h>
69+
#include "genomicsqlite.h"
70+
71+
int main() {
72+
char *zErrMsg = 0;
73+
int rc = GENOMICSQLITE_C_INIT(&zErrMsg);
74+
if (rc != SQLITE_OK) {
75+
/* report zErrMsg */
76+
sqlite3_free(zErrMsg);
77+
}
78+
...
79+
}
80+
```
81+
82+
Link the program to `sqlite3` and `genomicsqlite` libraries.
83+
84+
All GenomicSQLite C routines returning a `char*` string use the following convention. If the
85+
operation succeeds, then it's a nonempty, null-terminated string. Otherwise, it points to a
86+
null byte followed immediately by a nonempty, null-terminated error message. *In either case,*
87+
the caller must free the string with `sqlite3_free()`. NULL is returned only if out of memory.
88+
89+
## Opening a compressed database
90+
91+
**↪ GenomicSQLite Open:** create or open a compressed database, returning a connection object with various settings pre-tuned for large datasets.
92+
93+
=== "Python"
94+
``` python3
95+
dbconn = genomicsqlite.connect(
96+
db_filename,
97+
read_only=False,
98+
**kwargs # genomicsqlite + sqlite3.connect() arguments
99+
)
100+
assert isinstance(dbconn, sqlite3.Connection)
101+
```
102+
103+
=== "Java"
104+
```java
105+
java.util.Properties config = new java.util.Properties();
106+
config.setProperty("genomicsqlite.config_json", "{}");
107+
// Properties may originate from org.sqlite.SQLiteConfig.toProperties()
108+
// with genomicsqlite.config_json added in.
109+
110+
Connection dbconn = DriverManager.getConnection(
111+
"jdbc:genomicsqlite:" + dbfileName,
112+
config
113+
);
114+
```
115+
116+
=== "Rust"
117+
```rust
118+
let dbconn: Connection = genomicsqlite::open(
119+
db_filename,
120+
OpenFlags::SQLITE_OPEN_CREATE | OpenFlags::SQLITE_OPEN_READ_WRITE,
121+
&json::object::Object::new() // tuning options
122+
)?;
123+
124+
```
125+
126+
=== "SQLiteCpp"
127+
``` c++
128+
std::unique_ptr<SQLite::Database> GenomicSQLiteOpen(
129+
const std::string &db_filename,
130+
int flags = 0,
131+
const std::string &config_json = "{}"
132+
);
133+
```
134+
135+
=== "C++"
136+
``` c++
137+
int GenomicSQLiteOpen(
138+
const std::string &db_filename,
139+
sqlite3 **ppDb,
140+
std::string &errmsg_out,
141+
int flags = 0, // as sqlite3_open_v2() e.g. SQLITE_OPEN_READONLY
142+
const std::string &config_json = "{}"
143+
) noexcept; // returns sqlite3_open_v2() code
144+
```
145+
146+
=== "C"
147+
``` c
148+
int genomicsqlite_open(
149+
const char *db_filename,
150+
sqlite3 **ppDb,
151+
char **pzErrMsg, /* if nonnull and an error occurs, set to error message
152+
* which caller should sqlite3_free() */
153+
int flags, /* as sqlite3_open_v2() e.g. SQLITE_OPEN_READONLY */
154+
const char *config_json /* JSON text (may be null) */
155+
); /* returns sqlite3_open_v2() code */
156+
```
157+
158+
Afterwards, all the usual SQLite3 API operations are available through the returned connection object, which should finally be closed in the usual way. The [storage compression layer](https://github.com/mlin/sqlite_zstd_vfs) operates transparently underneath.
159+
160+
**❗ GenomicSQLite databases should *only* be opened using this routine.** If a program opens an existing GenomicSQLite database using a generic SQLite3 API, it will find a valid database whose schema is that of the compression layer instead of the intended application's. Writing into that schema might effectively corrupt the database!
161+
162+
### Tuning options
163+
164+
The aforementioned tuned settings can be further adjusted. Some bindings (e.g. C/C++) receive these options as the text of a JSON object with keys and values, while others admit individual arguments to the Open routine.
165+
166+
* **threads = -1**: thread budget for compression, sort, and prefetching/decompression operations; -1 to match up to 8 host processors. Set 1 to disable all background processing.
167+
* **inner_page_KiB = 16**: [SQLite page size](https://www.sqlite.org/pragma.html#pragma_page_size) for new databases, any of {1, 2, 4, 8, 16, 32, 64}. Larger pages are more compressible, but increase random I/O cost.
168+
* **outer_page_KiB = 32**: compression layer page size for new databases, any of {1, 2, 4, 8, 16, 32, 64}. <br/>
169+
The default configuration (inner_page_KiB, outer_page_KiB) = (16,32) balances random access speed and compression. Try setting them to (8,16) to prioritize random access, or (64,2) to prioritize compression <small>(if compressed database will be <4TB)</small>.
170+
* **zstd_level = 6**: Zstandard compression level for newly written data (-7 to 22)
171+
* **unsafe_load = false**: set true to disable write transaction safety (see advice on bulk-loading below). <br/>
172+
**❗ A database written to unsafely is liable to be corrupted if the application crashes, or if there's a concurrent attempt to modify it.**
173+
* **page_cache_MiB = 1024**: database cache size. Use a large cache to avoid repeated decompression in successive and complex queries.
174+
* **immutable = false**: set true to slightly reduce overhead reading from a database file that won't be modified by this or any concurrent program, guaranteed.
175+
* **force_prefetch = false**: set true to enable background prefetching/decompression even if inner_page_KiB &lt; 16 (enabled by default only &ge; that, as it can be counterproductive below; YMMV)
176+
177+
The connection's potential memory usage can usually be budgeted as roughly the page cache size, plus the size of any uncommitted write transaction (unless unsafe_load), plus some safety factor. ❗However, this can *multiply by (threads+1)* during queries whose results are at least that large and must be re-sorted. That includes index creation, when the indexed columns total such size.
178+
179+
## genomicsqlite interactive shell
180+
181+
The Python package includes a `genomicsqlite` script that enters the [`sqlite3` interactive shell](https://sqlite.org/cli.html) on an existing compressed database. This is a convenient way to inspect and explore the data with *ad hoc* SQL queries, as one might use `grep` or `awk` on text files. With the Python package installed (`pip3 install genomicsqlite` or `conda install genomicsqlite`):
182+
183+
```
184+
$ genomicsqlite DB_FILENAME [--readonly]
185+
```
186+
187+
to enter the SQL prompt with the database open. Or, add an SQL statement (in quotes) to perform and exit. If you've installed the Python package but the script isn't found, set your `PATH` to include the `bin` directory with Python console scripts.
188+
189+
**Database compaction.** The utility has a subcommand to compress and defragment an existing database file (compressed or uncompressed), which can increase its compression level and optimize access to it.
190+
191+
```
192+
$ genomicsqlite DB_FILENAME --compact
193+
```
194+
195+
generates `DB_FILENAME.compact`; see its `--help` for additional options, in particular `--level`, `--inner-page-KiB` and `--outer-page-KiB` affect the output file size as discussed above.
196+
197+
Due to decompression overhead, the compaction procedure may be impractically slow if the database has big tables that weren't initially written in their primary key order. To prevent this, see below *Optimizing storage layout*.
198+
199+
## Reading databases over the web
200+
201+
The **GenomicSQLite Open** routine and the `genomicsqlite` shell also accept http: and https: URLs instead of local filenames, creating a connection to read the compressed file over the web directly. The database connection must be opened read-only in the appropriate manner for your language bindings (such as the flag `SQLITE_OPEN_READONLY`). The URL server must support [HTTP GET range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) requests, and the content must not change for the lifetime of the connection.
202+
203+
Under the hood, the extension uses [libcurl](https://curl.se/libcurl/) to send web requests for necessary portions of the database file as queries proceed, with adaptive batching & prefetching to balance the number and size of these requests. This works well for point lookups and queries that scan largely-contiguous slices of tables and indexes (a modest number thereof). It's less suitable for big multi-way joins and other aggressively random access patterns; in such cases, it'd be better to download the database file upfront to open locally.
204+
205+
* The above-described `genomicsqlite DB_FILENAME --compact` tool can optimize a file's suitability for web access.
206+
* Reading large databases over the web, budget an additional ~600MiB of memory for HTTP prefetch buffers.
207+
* To disable TLS certificate and hostname verification, set web_insecure = true in the GenomicSQLite configuration, or SQLITE_WEB_INSECURE=1 in the environment.
208+
* The HTTP driver writes log messages to standard error when requests fail or had to be retried, which can be disabled by setting configuration web_log = 0 or environment SQLITE_WEB_LOG=0; or increased up to 5 to log every request and other details.
209+
210+
## Advice for big data
211+
212+
### Writing large databases quickly
213+
214+
1. `sqlite3_config(SQLITE_CONFIG_MEMSTATUS, 0)` if available, to reduce overhead in SQLite3's allocation routines.
215+
1. Open database with unsafe_load = true to reduce transaction processing overhead (at aforementioned risk) for the connection's lifetime.
216+
1. Also open with the flag `SQLITE_OPEN_NOMUTEX`, if your application naturally serializes operations on the connection.
217+
1. Perform all of the following steps within one big SQLite transaction, committed at the end.
218+
1. Insert data rows reusing prepared, parameterized SQL statements.
219+
1. Process the rows in primary key order, if feasible (otherwise, see below *Optimizing storage layout*).
220+
1. Consider preparing data in producer thread(s), with a consumer thread executing insertion statements in a tight loop.
221+
1. Bind text/blob parameters using [`SQLITE_STATIC`](https://www.sqlite.org/c3ref/bind_blob.html) if suitable.
222+
1. Create secondary indexes, including genomic range indexes, only after loading all row data. Use [partial indexes](https://www.sqlite.org/partialindex.html) when they suffice.
223+
224+
### Optimizing storage layout
225+
226+
For multiple reasons mentioned so far, large tables should have their rows initially inserted in primary key order (or whatever order will promote access locality), ensuring they'll be stored as such in the database file; and tables should be written one-at-a-time. If it's inconvenient to process the input data in this way, the following procedure can help:
227+
228+
1. Create [*temporary* table(s)](https://sqlite.org/lang_createtable.html) with the same schema as the destination table(s), but omitting any PRIMARY KEY specifiers, UNIQUE constraints, or other indexes.
229+
2. Stream all the data into these temporary tables, which are fast to write and read, in whatever order is convenient.
230+
3. `INSERT INTO permanent_table SELECT * FROM temp_table ORDER BY colA, colB, ...` using the primary key (or other desired sort order) for each table.
231+
232+
The Genomics Extension automatically enables SQLite's [parallel, external merge-sorter](https://sqlite.org/src/file/src/vdbesort.c) to execute the last step efficiently. Ensure it's [configured](https://www.sqlite.org/tempfiles.html) to use a suitable storage subsystem for big temporary files.
233+
234+
### Compression guidelines
235+
236+
The [Zstandard](https://facebook.github.io/zstd/)-based [compression layer](https://github.com/mlin/sqlite_zstd_vfs) is effective at capturing the high compressibility of bioinformatics data. But, one should expect a general-purpose database to use extra space to keep everything organized, compared to a file format dedicated to one read-only schema. To set a rough expectation, the maintainers feel fairly satisfied if the database file size isn't more than double that of a bespoke compression format — especially if it includes useful indexes (which if well-designed, should be relatively incompressible).
237+
238+
The aforementioned zstd_level, threads, and page_size options all affect the compression time-space tradeoff, while enlarging the page cache can reduce decompression overhead (workload-dependent).
239+
240+
If you plan to delete or overwrite a significant amount of data in an existing database, issue [`PRAGMA secure_delete=ON`](https://www.sqlite.org/pragma.html#pragma_secure_delete) beforehand to keep the compressed file as small as possible. This works by causing SQLite to overwrite unused database pages with all zeroes, which the compression layer can then reduce to a negligible size.
241+
242+
With SQLite's row-major table [storage format](https://www.sqlite.org/fileformat.html), the first read of a lone cell usually entails decompressing at least its whole row, and there aren't any special column encodings for deltas, run lengths, etc. The "last mile" of optimization may therefore involve certain schema compromises, such as storing infrequently-accessed columns in a separate table to join when needed, or using application-layer encodings with [BLOB I/O](https://www.sqlite.org/c3ref/blob_open.html).

0 commit comments

Comments
 (0)