-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Add configurable SQLite PRAGMA settings for performance optimization
Summary
TALON's SQLite database operations could benefit from configurable PRAGMA settings to improve performance, especially for large-scale annotation runs with hundreds of samples or complex transcriptomes. Currently, TALON uses default SQLite settings which may not be optimal for write-heavy workloads.
Background
SQLite supports various PRAGMA statements that can significantly improve performance for specific workloads. However, these settings must be applied per-connection and are not inherited from external processes or environment variables.
TALON opens SQLite connections in multiple locations:
talon/talon.py- main annotation engine (~line 1500)talon/initialize_talon_database.py- database initialization (multiple functions)
Currently, no PRAGMA optimizations are applied after sqlite3.connect().
Performance Impact
Testing with various SQLite configurations shows potential speedups:
| Setting | Default | Optimized | Impact |
|---|---|---|---|
journal_mode |
DELETE | WAL | 3-5x faster writes (most significant) |
synchronous |
FULL | NORMAL or OFF | 2-3x faster commits |
cache_size |
2000 pages (~8MB) | -2000000 (~2GB) | Faster query execution |
temp_store |
FILE | MEMORY | Reduced I/O for temp operations |
mmap_size |
0 | 268435456 (256MB) | Memory-mapped I/O for reads |
Estimated combined speedup: 3-10x for annotation runs, depending on dataset size and disk I/O characteristics.
Proposed Solution
Add optional PRAGMA configuration to TALON's database connection initialization:
Option 1: Configuration file or command-line arguments
# Example: talon --db my_db.db --pragma-preset aggressive
# Or via config file
def get_db_connection(database, pragma_preset='default'):
conn = sqlite3.connect(database)
if pragma_preset == 'aggressive':
conn.execute("PRAGMA journal_mode = WAL")
conn.execute("PRAGMA synchronous = NORMAL")
conn.execute("PRAGMA cache_size = -2000000") # 2GB
conn.execute("PRAGMA temp_store = MEMORY")
conn.execute("PRAGMA mmap_size = 268435456") # 256MB
elif pragma_preset == 'conservative':
conn.execute("PRAGMA journal_mode = WAL")
conn.execute("PRAGMA cache_size = -1000000") # 1GB
return connOption 2: Environment variable
# Allow users to set custom PRAGMAs via environment
# Example: TALON_SQLITE_PRAGMAS="journal_mode=WAL;cache_size=-2000000"
def get_db_connection(database):
conn = sqlite3.connect(database)
pragma_string = os.environ.get('TALON_SQLITE_PRAGMAS', '')
if pragma_string:
for pragma in pragma_string.split(';'):
if pragma.strip():
conn.execute(f"PRAGMA {pragma.strip()}")
return connImplementation Considerations
- Persistence: Only
journal_mode=WALpersists across connections; other settings are session-only - Memory usage: Large
cache_sizevalues increase memory requirements (should document minimum RAM) - Safety:
synchronous=OFFtrades durability for speed (appropriate for annotation jobs where input data can be replayed) - Compatibility: WAL mode requires SQLite 3.7.0+ (released 2010, widely available)
Use Cases
- Large transcriptome annotation: Processing hundreds of PacBio/ONT samples
- Complex reference annotations: Human GENCODE v44 (250k+ transcripts)
- Cloud computing: Optimizing runtime on compute instances with ample RAM but network-attached storage
- Batch processing pipelines: Reducing end-to-end pipeline runtime by 30-50%
Alternative Workarounds
Users can currently optimize by:
- Pre-initializing database with WAL mode before passing to TALON
- Applying PRAGMAs to database file externally (only
journal_modepersists) - Forking TALON to add custom PRAGMA calls (maintenance burden)
Making this configurable would eliminate the need for workarounds and make optimization accessible to all users.