-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Motivation
Following the suggestion to benchmark compression on large real-world genomic data,
I ran a full comparison of all supported compression algorithms on chromosome 1 of
HG00096 from the 1000 Genomes Project (4.7GB SAM, 11.2M reads).
Environment
- OS: Ubuntu 24.04
- ROOT: 6.36.04
- GCC: 13.3
- Dataset: HG00096.bam (1000 Genomes Project, chr1 extracted)
- SAM size: 4,776 MB | BAM size: 15 GB | Reads: 11,277,558
Conversion Benchmark
| Algorithm | Output Size | Compression Ratio vs SAM | Compression Ratio vs BAM | Time |
|---|---|---|---|---|
| BAM (baseline) | 15,000 MB | 0.32x | 1x | — |
| SAM (baseline) | 4,776 MB | 1x | 0.32x | — |
| ZLIB (101) | 1,021 MB | 4.68x | 14.7x | 144s |
| LZ4 (404) | 1,284 MB | 3.72x | 11.7x | 213s |
| LZMA (505) | 909 MB | 5.26x | 16.9x | 192s |
| ZSTD (606) | 923 MB | 5.18x | 16.6x | 171s |
Region Query Benchmark (1:1000000-2000000, 45,296 records)
| Format | Time | Records |
|---|---|---|
| samtools (BAM) | 0.07s | 45,491 |
| RAMTools ZLIB | 0.86s | 45,296 |
| RAMTools LZ4 | 0.90s | 45,296 |
| RAMTools LZMA | 0.92s | 45,296 |
| RAMTools ZSTD | 0.92s | 45,296 |
Key Findings
1. ZSTD is the best overall algorithm
ZSTD achieves near-LZMA compression (5.18x vs 5.26x) while being 12% faster.
It should be considered as the default instead of LZMA.
2. LZ4 underperforms expectations
LZ4 is typically the fastest algorithm but here it is both the slowest AND has
the worst compression ratio. This is likely because LZ4 is optimized for
decompression speed, not compression speed. For write-once genomic archives,
ZSTD or LZMA are better choices.
3. Region query speed gap vs samtools
RAMTools takes ~0.9s vs samtools 0.07s for the same region. This is a 12x gap.
Possible causes:
- RNTuple page size is not optimized for genomic access patterns
- The index granularity (one entry per ~1000 reads) may be too coarse
- samtools uses a highly optimized BAI index with bin-based lookup
4. Record count discrepancy
RAMTools returns 45,296 records vs samtools 45,491 for the same region.
This 195-read difference needs investigation — likely secondary/supplementary
reads are being filtered differently.
5. Quality score compression opportunity
Quality scores are the largest field in a SAM file (~30% of data).
Current implementation stores them as encoded strings. Opportunities:
- Illumina binning — reduce quality alphabet from 40 values to 8 bins
- Reference-based quality compression — similar to CRAM
- Quality score dropping — for applications that don't need per-base quality
6. Field-specific compression
Different fields have very different compressibility:
seq— 2-bit encoding already applied, goodqual— high entropy, hardest to compress, biggest opportunitycigar— highly repetitive, compresses wellqname— structured patterns, could use delta encodingpos— sorted integers, ideal for delta encoding
Proposed Optimizations
- Switch default to ZSTD — better speed/ratio tradeoff than LZMA
- Per-field compression — apply different algorithms per column
(RNTuple supports this natively via field-level compression settings) - Delta encoding for pos — sorted positions compress much better with delta
- Investigate index granularity — finer index could close the query speed gap
- Quality score binning — implement Illumina 8-bin compression as default option
- Investigate record count discrepancy — align filtering logic with samtools
References
- CRAM compression paper: Hsi-Yang Fritz et al. (2011)
- ZSTD vs LZMA benchmarks: Facebook engineering blog
- RNTuple per-field compression: ROOT documentation