Skip to content

Compression algorithm analysis on real genomic data (HG00096, chr1, 4.7GB SAM) #34

@swetank18

Description

@swetank18

Motivation

Following the suggestion to benchmark compression on large real-world genomic data,
I ran a full comparison of all supported compression algorithms on chromosome 1 of
HG00096 from the 1000 Genomes Project (4.7GB SAM, 11.2M reads).

Environment

  • OS: Ubuntu 24.04
  • ROOT: 6.36.04
  • GCC: 13.3
  • Dataset: HG00096.bam (1000 Genomes Project, chr1 extracted)
  • SAM size: 4,776 MB | BAM size: 15 GB | Reads: 11,277,558

Conversion Benchmark

Algorithm Output Size Compression Ratio vs SAM Compression Ratio vs BAM Time
BAM (baseline) 15,000 MB 0.32x 1x
SAM (baseline) 4,776 MB 1x 0.32x
ZLIB (101) 1,021 MB 4.68x 14.7x 144s
LZ4 (404) 1,284 MB 3.72x 11.7x 213s
LZMA (505) 909 MB 5.26x 16.9x 192s
ZSTD (606) 923 MB 5.18x 16.6x 171s

Region Query Benchmark (1:1000000-2000000, 45,296 records)

Format Time Records
samtools (BAM) 0.07s 45,491
RAMTools ZLIB 0.86s 45,296
RAMTools LZ4 0.90s 45,296
RAMTools LZMA 0.92s 45,296
RAMTools ZSTD 0.92s 45,296

Key Findings

1. ZSTD is the best overall algorithm

ZSTD achieves near-LZMA compression (5.18x vs 5.26x) while being 12% faster.
It should be considered as the default instead of LZMA.

2. LZ4 underperforms expectations

LZ4 is typically the fastest algorithm but here it is both the slowest AND has
the worst compression ratio. This is likely because LZ4 is optimized for
decompression speed, not compression speed. For write-once genomic archives,
ZSTD or LZMA are better choices.

3. Region query speed gap vs samtools

RAMTools takes ~0.9s vs samtools 0.07s for the same region. This is a 12x gap.
Possible causes:

  • RNTuple page size is not optimized for genomic access patterns
  • The index granularity (one entry per ~1000 reads) may be too coarse
  • samtools uses a highly optimized BAI index with bin-based lookup

4. Record count discrepancy

RAMTools returns 45,296 records vs samtools 45,491 for the same region.
This 195-read difference needs investigation — likely secondary/supplementary
reads are being filtered differently.

5. Quality score compression opportunity

Quality scores are the largest field in a SAM file (~30% of data).
Current implementation stores them as encoded strings. Opportunities:

  • Illumina binning — reduce quality alphabet from 40 values to 8 bins
  • Reference-based quality compression — similar to CRAM
  • Quality score dropping — for applications that don't need per-base quality

6. Field-specific compression

Different fields have very different compressibility:

  • seq — 2-bit encoding already applied, good
  • qual — high entropy, hardest to compress, biggest opportunity
  • cigar — highly repetitive, compresses well
  • qname — structured patterns, could use delta encoding
  • pos — sorted integers, ideal for delta encoding

Proposed Optimizations

  1. Switch default to ZSTD — better speed/ratio tradeoff than LZMA
  2. Per-field compression — apply different algorithms per column
    (RNTuple supports this natively via field-level compression settings)
  3. Delta encoding for pos — sorted positions compress much better with delta
  4. Investigate index granularity — finer index could close the query speed gap
  5. Quality score binning — implement Illumina 8-bin compression as default option
  6. Investigate record count discrepancy — align filtering logic with samtools

References

  • CRAM compression paper: Hsi-Yang Fritz et al. (2011)
  • ZSTD vs LZMA benchmarks: Facebook engineering blog
  • RNTuple per-field compression: ROOT documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions