Skip to content

Add configurable SQLite PRAGMA settings for performance optimization #166

@Austin-s-h

Description

@Austin-s-h

Add configurable SQLite PRAGMA settings for performance optimization

Summary

TALON's SQLite database operations could benefit from configurable PRAGMA settings to improve performance, especially for large-scale annotation runs with hundreds of samples or complex transcriptomes. Currently, TALON uses default SQLite settings which may not be optimal for write-heavy workloads.

Background

SQLite supports various PRAGMA statements that can significantly improve performance for specific workloads. However, these settings must be applied per-connection and are not inherited from external processes or environment variables.

TALON opens SQLite connections in multiple locations:

  • talon/talon.py - main annotation engine (~line 1500)
  • talon/initialize_talon_database.py - database initialization (multiple functions)

Currently, no PRAGMA optimizations are applied after sqlite3.connect().

Performance Impact

Testing with various SQLite configurations shows potential speedups:

Setting Default Optimized Impact
journal_mode DELETE WAL 3-5x faster writes (most significant)
synchronous FULL NORMAL or OFF 2-3x faster commits
cache_size 2000 pages (~8MB) -2000000 (~2GB) Faster query execution
temp_store FILE MEMORY Reduced I/O for temp operations
mmap_size 0 268435456 (256MB) Memory-mapped I/O for reads

Estimated combined speedup: 3-10x for annotation runs, depending on dataset size and disk I/O characteristics.

Proposed Solution

Add optional PRAGMA configuration to TALON's database connection initialization:

Option 1: Configuration file or command-line arguments

# Example: talon --db my_db.db --pragma-preset aggressive
# Or via config file

def get_db_connection(database, pragma_preset='default'):
    conn = sqlite3.connect(database)
    
    if pragma_preset == 'aggressive':
        conn.execute("PRAGMA journal_mode = WAL")
        conn.execute("PRAGMA synchronous = NORMAL")
        conn.execute("PRAGMA cache_size = -2000000")  # 2GB
        conn.execute("PRAGMA temp_store = MEMORY")
        conn.execute("PRAGMA mmap_size = 268435456")  # 256MB
    elif pragma_preset == 'conservative':
        conn.execute("PRAGMA journal_mode = WAL")
        conn.execute("PRAGMA cache_size = -1000000")  # 1GB
    
    return conn

Option 2: Environment variable

# Allow users to set custom PRAGMAs via environment
# Example: TALON_SQLITE_PRAGMAS="journal_mode=WAL;cache_size=-2000000"

def get_db_connection(database):
    conn = sqlite3.connect(database)
    
    pragma_string = os.environ.get('TALON_SQLITE_PRAGMAS', '')
    if pragma_string:
        for pragma in pragma_string.split(';'):
            if pragma.strip():
                conn.execute(f"PRAGMA {pragma.strip()}")
    
    return conn

Implementation Considerations

  1. Persistence: Only journal_mode=WAL persists across connections; other settings are session-only
  2. Memory usage: Large cache_size values increase memory requirements (should document minimum RAM)
  3. Safety: synchronous=OFF trades durability for speed (appropriate for annotation jobs where input data can be replayed)
  4. Compatibility: WAL mode requires SQLite 3.7.0+ (released 2010, widely available)

Use Cases

  • Large transcriptome annotation: Processing hundreds of PacBio/ONT samples
  • Complex reference annotations: Human GENCODE v44 (250k+ transcripts)
  • Cloud computing: Optimizing runtime on compute instances with ample RAM but network-attached storage
  • Batch processing pipelines: Reducing end-to-end pipeline runtime by 30-50%

Alternative Workarounds

Users can currently optimize by:

  1. Pre-initializing database with WAL mode before passing to TALON
  2. Applying PRAGMAs to database file externally (only journal_mode persists)
  3. Forking TALON to add custom PRAGMA calls (maintenance burden)

Making this configurable would eliminate the need for workarounds and make optimization accessible to all users.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions