|
| 1 | +# Programming Guide - Opening Compressed Databases |
| 2 | + |
| 3 | +The Genomics Extension integrates with your programming language's existing SQLite3 bindings to provide a familiar experience wherever possible. |
| 4 | + |
| 5 | +* Python: [sqlite3](https://docs.python.org/3/library/sqlite3.html) |
| 6 | +* Java/JVM: [sqlite-jdbc](https://github.com/xerial/sqlite-jdbc) |
| 7 | +* Rust: [rusqlite](https://github.com/rusqlite/rusqlite) |
| 8 | +* C++: [SQLiteCpp](https://github.com/SRombauts/SQLiteCpp) (optional, recommended) or directly using... |
| 9 | +* C: [SQLite C/C++ API](https://www.sqlite.org/cintro.html) |
| 10 | + |
| 11 | +First complete the [installation instructions](index.md). |
| 12 | + |
| 13 | +## Loading the extension |
| 14 | + |
| 15 | +=== "Python" |
| 16 | + ``` python3 |
| 17 | + import sqlite3 |
| 18 | + import genomicsqlite |
| 19 | + ``` |
| 20 | + |
| 21 | +=== "Java" |
| 22 | + ```java |
| 23 | + import java.sql.*; |
| 24 | + import net.mlin.genomicsqlite.GenomicSQLite; |
| 25 | + ``` |
| 26 | + |
| 27 | +=== "Rust" |
| 28 | + ```rust |
| 29 | + use genomicsqlite::ConnectionMethods; |
| 30 | + use rusqlite::{Connection, OpenFlags, params, NO_PARAMS}; |
| 31 | + ``` |
| 32 | + |
| 33 | + The `genomicsqlite::ConnectionMethods` trait makes available GenomicSQLite-specific methods for |
| 34 | + `rusqlite::Connection` (and `rusqlite::Transaction`). See [rustdoc](https://docs.rs/genomicsqlite) |
| 35 | + for some extra details. |
| 36 | + |
| 37 | +=== "C++" |
| 38 | + ``` c++ |
| 39 | + #include <sqlite3.h> |
| 40 | + #include "SQLiteCpp/SQLiteCpp.h" // optional |
| 41 | + #include "genomicsqlite.h" |
| 42 | + |
| 43 | + int main() { |
| 44 | + try { |
| 45 | + GENOMICSQLITE_CXX_INIT(); |
| 46 | + } catch (std::runtime_error& exn) { |
| 47 | + // report exn.what() |
| 48 | + } |
| 49 | + ... |
| 50 | + } |
| 51 | + ``` |
| 52 | + |
| 53 | + Link the program to `sqlite3` and `genomicsqlite` libraries. Optionally, include |
| 54 | + [SQLiteCpp](https://github.com/SRombauts/SQLiteCpp) headers *before* `genomicsqlite.h` to use |
| 55 | + its more-convenient API; but *don't* link it, as the `genomicsqlite` library has it built-in. |
| 56 | + |
| 57 | + GNU/Linux: to link the prebuilt `libgenomicsqlite.so` distributed from our GitHub Releases, you |
| 58 | + may have to compile your source with `CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0`. This is because the |
| 59 | + library is built against an old libstdc++ version to improve runtime compatibility. The |
| 60 | + function of this flag is explained in the libstdc++ docs on |
| 61 | + [Dual ABI](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). If you build |
| 62 | + `libgenomicsqlite.so` from source, then the flag will not be needed. |
| 63 | + |
| 64 | + General note: GenomicSQLite C++ routines are liable to throw exceptions. |
| 65 | + |
| 66 | +=== "C" |
| 67 | + ``` c |
| 68 | + #include <sqlite3.h> |
| 69 | + #include "genomicsqlite.h" |
| 70 | + |
| 71 | + int main() { |
| 72 | + char *zErrMsg = 0; |
| 73 | + int rc = GENOMICSQLITE_C_INIT(&zErrMsg); |
| 74 | + if (rc != SQLITE_OK) { |
| 75 | + /* report zErrMsg */ |
| 76 | + sqlite3_free(zErrMsg); |
| 77 | + } |
| 78 | + ... |
| 79 | + } |
| 80 | + ``` |
| 81 | + |
| 82 | + Link the program to `sqlite3` and `genomicsqlite` libraries. |
| 83 | + |
| 84 | + All GenomicSQLite C routines returning a `char*` string use the following convention. If the |
| 85 | + operation succeeds, then it's a nonempty, null-terminated string. Otherwise, it points to a |
| 86 | + null byte followed immediately by a nonempty, null-terminated error message. *In either case,* |
| 87 | + the caller must free the string with `sqlite3_free()`. NULL is returned only if out of memory. |
| 88 | + |
| 89 | +## Opening a compressed database |
| 90 | + |
| 91 | +**↪ GenomicSQLite Open:** create or open a compressed database, returning a connection object with various settings pre-tuned for large datasets. |
| 92 | + |
| 93 | +=== "Python" |
| 94 | + ``` python3 |
| 95 | + dbconn = genomicsqlite.connect( |
| 96 | + db_filename, |
| 97 | + read_only=False, |
| 98 | + **kwargs # genomicsqlite + sqlite3.connect() arguments |
| 99 | + ) |
| 100 | + assert isinstance(dbconn, sqlite3.Connection) |
| 101 | + ``` |
| 102 | + |
| 103 | +=== "Java" |
| 104 | + ```java |
| 105 | + java.util.Properties config = new java.util.Properties(); |
| 106 | + config.setProperty("genomicsqlite.config_json", "{}"); |
| 107 | + // Properties may originate from org.sqlite.SQLiteConfig.toProperties() |
| 108 | + // with genomicsqlite.config_json added in. |
| 109 | + |
| 110 | + Connection dbconn = DriverManager.getConnection( |
| 111 | + "jdbc:genomicsqlite:" + dbfileName, |
| 112 | + config |
| 113 | + ); |
| 114 | + ``` |
| 115 | + |
| 116 | +=== "Rust" |
| 117 | + ```rust |
| 118 | + let dbconn: Connection = genomicsqlite::open( |
| 119 | + db_filename, |
| 120 | + OpenFlags::SQLITE_OPEN_CREATE | OpenFlags::SQLITE_OPEN_READ_WRITE, |
| 121 | + &json::object::Object::new() // tuning options |
| 122 | + )?; |
| 123 | + |
| 124 | + ``` |
| 125 | + |
| 126 | +=== "SQLiteCpp" |
| 127 | + ``` c++ |
| 128 | + std::unique_ptr<SQLite::Database> GenomicSQLiteOpen( |
| 129 | + const std::string &db_filename, |
| 130 | + int flags = 0, |
| 131 | + const std::string &config_json = "{}" |
| 132 | + ); |
| 133 | + ``` |
| 134 | + |
| 135 | +=== "C++" |
| 136 | + ``` c++ |
| 137 | + int GenomicSQLiteOpen( |
| 138 | + const std::string &db_filename, |
| 139 | + sqlite3 **ppDb, |
| 140 | + std::string &errmsg_out, |
| 141 | + int flags = 0, // as sqlite3_open_v2() e.g. SQLITE_OPEN_READONLY |
| 142 | + const std::string &config_json = "{}" |
| 143 | + ) noexcept; // returns sqlite3_open_v2() code |
| 144 | + ``` |
| 145 | + |
| 146 | +=== "C" |
| 147 | + ``` c |
| 148 | + int genomicsqlite_open( |
| 149 | + const char *db_filename, |
| 150 | + sqlite3 **ppDb, |
| 151 | + char **pzErrMsg, /* if nonnull and an error occurs, set to error message |
| 152 | + * which caller should sqlite3_free() */ |
| 153 | + int flags, /* as sqlite3_open_v2() e.g. SQLITE_OPEN_READONLY */ |
| 154 | + const char *config_json /* JSON text (may be null) */ |
| 155 | + ); /* returns sqlite3_open_v2() code */ |
| 156 | + ``` |
| 157 | + |
| 158 | +Afterwards, all the usual SQLite3 API operations are available through the returned connection object, which should finally be closed in the usual way. The [storage compression layer](https://github.com/mlin/sqlite_zstd_vfs) operates transparently underneath. |
| 159 | + |
| 160 | +**❗ GenomicSQLite databases should *only* be opened using this routine.** If a program opens an existing GenomicSQLite database using a generic SQLite3 API, it will find a valid database whose schema is that of the compression layer instead of the intended application's. Writing into that schema might effectively corrupt the database! |
| 161 | + |
| 162 | +### Tuning options |
| 163 | + |
| 164 | +The aforementioned tuned settings can be further adjusted. Some bindings (e.g. C/C++) receive these options as the text of a JSON object with keys and values, while others admit individual arguments to the Open routine. |
| 165 | + |
| 166 | +* **threads = -1**: thread budget for compression, sort, and prefetching/decompression operations; -1 to match up to 8 host processors. Set 1 to disable all background processing. |
| 167 | +* **inner_page_KiB = 16**: [SQLite page size](https://www.sqlite.org/pragma.html#pragma_page_size) for new databases, any of {1, 2, 4, 8, 16, 32, 64}. Larger pages are more compressible, but increase random I/O cost. |
| 168 | +* **outer_page_KiB = 32**: compression layer page size for new databases, any of {1, 2, 4, 8, 16, 32, 64}. <br/> |
| 169 | +The default configuration (inner_page_KiB, outer_page_KiB) = (16,32) balances random access speed and compression. Try setting them to (8,16) to prioritize random access, or (64,2) to prioritize compression <small>(if compressed database will be <4TB)</small>. |
| 170 | +* **zstd_level = 6**: Zstandard compression level for newly written data (-7 to 22) |
| 171 | +* **unsafe_load = false**: set true to disable write transaction safety (see advice on bulk-loading below). <br/> |
| 172 | + **❗ A database written to unsafely is liable to be corrupted if the application crashes, or if there's a concurrent attempt to modify it.** |
| 173 | +* **page_cache_MiB = 1024**: database cache size. Use a large cache to avoid repeated decompression in successive and complex queries. |
| 174 | +* **immutable = false**: set true to slightly reduce overhead reading from a database file that won't be modified by this or any concurrent program, guaranteed. |
| 175 | +* **force_prefetch = false**: set true to enable background prefetching/decompression even if inner_page_KiB < 16 (enabled by default only ≥ that, as it can be counterproductive below; YMMV) |
| 176 | + |
| 177 | +The connection's potential memory usage can usually be budgeted as roughly the page cache size, plus the size of any uncommitted write transaction (unless unsafe_load), plus some safety factor. ❗However, this can *multiply by (threads+1)* during queries whose results are at least that large and must be re-sorted. That includes index creation, when the indexed columns total such size. |
| 178 | + |
| 179 | +## genomicsqlite interactive shell |
| 180 | + |
| 181 | +The Python package includes a `genomicsqlite` script that enters the [`sqlite3` interactive shell](https://sqlite.org/cli.html) on an existing compressed database. This is a convenient way to inspect and explore the data with *ad hoc* SQL queries, as one might use `grep` or `awk` on text files. With the Python package installed (`pip3 install genomicsqlite` or `conda install genomicsqlite`): |
| 182 | + |
| 183 | +``` |
| 184 | +$ genomicsqlite DB_FILENAME [--readonly] |
| 185 | +``` |
| 186 | + |
| 187 | +to enter the SQL prompt with the database open. Or, add an SQL statement (in quotes) to perform and exit. If you've installed the Python package but the script isn't found, set your `PATH` to include the `bin` directory with Python console scripts. |
| 188 | + |
| 189 | +**Database compaction.** The utility has a subcommand to compress and defragment an existing database file (compressed or uncompressed), which can increase its compression level and optimize access to it. |
| 190 | + |
| 191 | +``` |
| 192 | +$ genomicsqlite DB_FILENAME --compact |
| 193 | +``` |
| 194 | + |
| 195 | +generates `DB_FILENAME.compact`; see its `--help` for additional options, in particular `--level`, `--inner-page-KiB` and `--outer-page-KiB` affect the output file size as discussed above. |
| 196 | + |
| 197 | +Due to decompression overhead, the compaction procedure may be impractically slow if the database has big tables that weren't initially written in their primary key order. To prevent this, see below *Optimizing storage layout*. |
| 198 | + |
| 199 | +## Reading databases over the web |
| 200 | + |
| 201 | +The **GenomicSQLite Open** routine and the `genomicsqlite` shell also accept http: and https: URLs instead of local filenames, creating a connection to read the compressed file over the web directly. The database connection must be opened read-only in the appropriate manner for your language bindings (such as the flag `SQLITE_OPEN_READONLY`). The URL server must support [HTTP GET range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) requests, and the content must not change for the lifetime of the connection. |
| 202 | + |
| 203 | +Under the hood, the extension uses [libcurl](https://curl.se/libcurl/) to send web requests for necessary portions of the database file as queries proceed, with adaptive batching & prefetching to balance the number and size of these requests. This works well for point lookups and queries that scan largely-contiguous slices of tables and indexes (a modest number thereof). It's less suitable for big multi-way joins and other aggressively random access patterns; in such cases, it'd be better to download the database file upfront to open locally. |
| 204 | + |
| 205 | +* The above-described `genomicsqlite DB_FILENAME --compact` tool can optimize a file's suitability for web access. |
| 206 | +* Reading large databases over the web, budget an additional ~600MiB of memory for HTTP prefetch buffers. |
| 207 | +* To disable TLS certificate and hostname verification, set web_insecure = true in the GenomicSQLite configuration, or SQLITE_WEB_INSECURE=1 in the environment. |
| 208 | +* The HTTP driver writes log messages to standard error when requests fail or had to be retried, which can be disabled by setting configuration web_log = 0 or environment SQLITE_WEB_LOG=0; or increased up to 5 to log every request and other details. |
| 209 | + |
| 210 | +## Advice for big data |
| 211 | + |
| 212 | +### Writing large databases quickly |
| 213 | + |
| 214 | +1. `sqlite3_config(SQLITE_CONFIG_MEMSTATUS, 0)` if available, to reduce overhead in SQLite3's allocation routines. |
| 215 | +1. Open database with unsafe_load = true to reduce transaction processing overhead (at aforementioned risk) for the connection's lifetime. |
| 216 | +1. Also open with the flag `SQLITE_OPEN_NOMUTEX`, if your application naturally serializes operations on the connection. |
| 217 | +1. Perform all of the following steps within one big SQLite transaction, committed at the end. |
| 218 | +1. Insert data rows reusing prepared, parameterized SQL statements. |
| 219 | + 1. Process the rows in primary key order, if feasible (otherwise, see below *Optimizing storage layout*). |
| 220 | + 1. Consider preparing data in producer thread(s), with a consumer thread executing insertion statements in a tight loop. |
| 221 | + 1. Bind text/blob parameters using [`SQLITE_STATIC`](https://www.sqlite.org/c3ref/bind_blob.html) if suitable. |
| 222 | +1. Create secondary indexes, including genomic range indexes, only after loading all row data. Use [partial indexes](https://www.sqlite.org/partialindex.html) when they suffice. |
| 223 | + |
| 224 | +### Optimizing storage layout |
| 225 | + |
| 226 | +For multiple reasons mentioned so far, large tables should have their rows initially inserted in primary key order (or whatever order will promote access locality), ensuring they'll be stored as such in the database file; and tables should be written one-at-a-time. If it's inconvenient to process the input data in this way, the following procedure can help: |
| 227 | + |
| 228 | +1. Create [*temporary* table(s)](https://sqlite.org/lang_createtable.html) with the same schema as the destination table(s), but omitting any PRIMARY KEY specifiers, UNIQUE constraints, or other indexes. |
| 229 | +2. Stream all the data into these temporary tables, which are fast to write and read, in whatever order is convenient. |
| 230 | +3. `INSERT INTO permanent_table SELECT * FROM temp_table ORDER BY colA, colB, ...` using the primary key (or other desired sort order) for each table. |
| 231 | + |
| 232 | +The Genomics Extension automatically enables SQLite's [parallel, external merge-sorter](https://sqlite.org/src/file/src/vdbesort.c) to execute the last step efficiently. Ensure it's [configured](https://www.sqlite.org/tempfiles.html) to use a suitable storage subsystem for big temporary files. |
| 233 | + |
| 234 | +### Compression guidelines |
| 235 | + |
| 236 | +The [Zstandard](https://facebook.github.io/zstd/)-based [compression layer](https://github.com/mlin/sqlite_zstd_vfs) is effective at capturing the high compressibility of bioinformatics data. But, one should expect a general-purpose database to use extra space to keep everything organized, compared to a file format dedicated to one read-only schema. To set a rough expectation, the maintainers feel fairly satisfied if the database file size isn't more than double that of a bespoke compression format — especially if it includes useful indexes (which if well-designed, should be relatively incompressible). |
| 237 | + |
| 238 | +The aforementioned zstd_level, threads, and page_size options all affect the compression time-space tradeoff, while enlarging the page cache can reduce decompression overhead (workload-dependent). |
| 239 | + |
| 240 | +If you plan to delete or overwrite a significant amount of data in an existing database, issue [`PRAGMA secure_delete=ON`](https://www.sqlite.org/pragma.html#pragma_secure_delete) beforehand to keep the compressed file as small as possible. This works by causing SQLite to overwrite unused database pages with all zeroes, which the compression layer can then reduce to a negligible size. |
| 241 | + |
| 242 | +With SQLite's row-major table [storage format](https://www.sqlite.org/fileformat.html), the first read of a lone cell usually entails decompressing at least its whole row, and there aren't any special column encodings for deltas, run lengths, etc. The "last mile" of optimization may therefore involve certain schema compromises, such as storing infrequently-accessed columns in a separate table to join when needed, or using application-layer encodings with [BLOB I/O](https://www.sqlite.org/c3ref/blob_open.html). |
0 commit comments