Compression algorithm analysis on real genomic data (HG00096, chr1, 4.7GB SAM)

## Motivation
Following the suggestion to benchmark compression on large real-world genomic data,
I ran a full comparison of all supported compression algorithms on chromosome 1 of
HG00096 from the 1000 Genomes Project (4.7GB SAM, 11.2M reads).

## Environment
- OS: Ubuntu 24.04
- ROOT: 6.36.04
- GCC: 13.3
- Dataset: HG00096.bam (1000 Genomes Project, chr1 extracted)
- SAM size: 4,776 MB | BAM size: 15 GB | Reads: 11,277,558

## Conversion Benchmark

| Algorithm | Output Size | Compression Ratio vs SAM | Compression Ratio vs BAM | Time |
|-----------|-------------|--------------------------|--------------------------|------|
| BAM (baseline) | 15,000 MB | 0.32x | 1x | — |
| SAM (baseline) | 4,776 MB | 1x | 0.32x | — |
| ZLIB (101) | 1,021 MB | 4.68x | 14.7x | 144s |
| LZ4 (404) | 1,284 MB | 3.72x | 11.7x | 213s |
| LZMA (505) | 909 MB | **5.26x** | **16.9x** | 192s |
| ZSTD (606) | 923 MB | 5.18x | 16.6x | **171s** |

## Region Query Benchmark (1:1000000-2000000, 45,296 records)

| Format | Time | Records |
|--------|------|---------|
| samtools (BAM) | **0.07s** | 45,491 |
| RAMTools ZLIB | 0.86s | 45,296 |
| RAMTools LZ4 | 0.90s | 45,296 |
| RAMTools LZMA | 0.92s | 45,296 |
| RAMTools ZSTD | 0.92s | 45,296 |

## Key Findings

### 1. ZSTD is the best overall algorithm
ZSTD achieves near-LZMA compression (5.18x vs 5.26x) while being 12% faster.
It should be considered as the default instead of LZMA.

### 2. LZ4 underperforms expectations
LZ4 is typically the fastest algorithm but here it is both the slowest AND has
the worst compression ratio. This is likely because LZ4 is optimized for
decompression speed, not compression speed. For write-once genomic archives,
ZSTD or LZMA are better choices.

### 3. Region query speed gap vs samtools
RAMTools takes ~0.9s vs samtools 0.07s for the same region. This is a 12x gap.
Possible causes:
- RNTuple page size is not optimized for genomic access patterns
- The index granularity (one entry per ~1000 reads) may be too coarse
- samtools uses a highly optimized BAI index with bin-based lookup

### 4. Record count discrepancy
RAMTools returns 45,296 records vs samtools 45,491 for the same region.
This 195-read difference needs investigation — likely secondary/supplementary
reads are being filtered differently.

### 5. Quality score compression opportunity
Quality scores are the largest field in a SAM file (~30% of data).
Current implementation stores them as encoded strings. Opportunities:
- **Illumina binning** — reduce quality alphabet from 40 values to 8 bins
- **Reference-based quality compression** — similar to CRAM
- **Quality score dropping** — for applications that don't need per-base quality

### 6. Field-specific compression
Different fields have very different compressibility:
- `seq` — 2-bit encoding already applied, good
- `qual` — high entropy, hardest to compress, biggest opportunity
- `cigar` — highly repetitive, compresses well
- `qname` — structured patterns, could use delta encoding
- `pos` — sorted integers, ideal for delta encoding

## Proposed Optimizations

1. **Switch default to ZSTD** — better speed/ratio tradeoff than LZMA
2. **Per-field compression** — apply different algorithms per column
   (RNTuple supports this natively via field-level compression settings)
3. **Delta encoding for pos** — sorted positions compress much better with delta
4. **Investigate index granularity** — finer index could close the query speed gap
5. **Quality score binning** — implement Illumina 8-bin compression as default option
6. **Investigate record count discrepancy** — align filtering logic with samtools

## References
- CRAM compression paper: Hsi-Yang Fritz et al. (2011)
- ZSTD vs LZMA benchmarks: Facebook engineering blog
- RNTuple per-field compression: ROOT documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression algorithm analysis on real genomic data (HG00096, chr1, 4.7GB SAM) #34

Motivation

Environment

Conversion Benchmark

Region Query Benchmark (1:1000000-2000000, 45,296 records)

Key Findings

1. ZSTD is the best overall algorithm

2. LZ4 underperforms expectations

3. Region query speed gap vs samtools

4. Record count discrepancy

5. Quality score compression opportunity

6. Field-specific compression

Proposed Optimizations

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Algorithm	Output Size	Compression Ratio vs SAM	Compression Ratio vs BAM	Time
BAM (baseline)	15,000 MB	0.32x	1x	—
SAM (baseline)	4,776 MB	1x	0.32x	—
ZLIB (101)	1,021 MB	4.68x	14.7x	144s
LZ4 (404)	1,284 MB	3.72x	11.7x	213s
LZMA (505)	909 MB	5.26x	16.9x	192s
ZSTD (606)	923 MB	5.18x	16.6x	171s

Format	Time	Records
samtools (BAM)	0.07s	45,491
RAMTools ZLIB	0.86s	45,296
RAMTools LZ4	0.90s	45,296
RAMTools LZMA	0.92s	45,296
RAMTools ZSTD	0.92s	45,296

Compression algorithm analysis on real genomic data (HG00096, chr1, 4.7GB SAM) #34

Description

Motivation

Environment

Conversion Benchmark

Region Query Benchmark (1:1000000-2000000, 45,296 records)

Key Findings

1. ZSTD is the best overall algorithm

2. LZ4 underperforms expectations

3. Region query speed gap vs samtools

4. Record count discrepancy

5. Quality score compression opportunity

6. Field-specific compression

Proposed Optimizations

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions