Skip to content

TheJacksonLaboratory/umi-bulk-rna-simulator

Repository files navigation

UMI RNA Simulator

Coverage Python PyPI License

A comprehensive Python package for simulating bulk RNA-seq data with Unique Molecular Identifiers (UMIs), including PCR amplification and sequencing errors.

Features

Gene Expression Simulation: Uses negative binomial distribution to model realistic gene expression levels with biological overdispersion
UMI Assignment: Randomly assigns unique molecular identifiers (8-12 bp) to each cDNA molecule
PCR Amplification: Models PCR cycles with configurable efficiency to simulate technical duplicates
Sequencing Errors: Introduces realistic substitution errors in both read sequences and UMIs
Target Read Count: Back-calculate initial molecules needed to achieve a specific number of final reads
Real Gene Annotations: Support for Ensembl GTF files to use real gene IDs and metadata
Real Genomic Sequences: Extract actual sequences from reference genome FASTA files
Memory Optimized: Efficient handling of large simulations with automatic cleanup
Paired-End Reads: Support for paired-end sequencing with configurable insert sizes
FASTQ Output: Generates standard gzipped FASTQ files with UMI tags
Pipeline Validation: Compare pipeline results against ground truth with statistical metrics and plots
Detailed Statistics: Provides comprehensive statistics including duplication rates and per-gene metrics
Full Test Suite: Comprehensive pytest test suite with 94% code coverage

Quick Start

Installation

pip install umi-rna-simulator

Or install from source:

git clone https://github.com/lloydm/umi-rna-simulator.git
cd umi-rna-simulator
pip install -e .

Basic Usage

# Command line
umi-simulator --genes 1000 --molecules 50000 -o output/sim

# Or as Python module
python -m umi_simulator --genes 1000 --molecules 50000 -o output/sim

Python API

from umi_simulator import BulkRNAUMISimulator

sim = BulkRNAUMISimulator(
    n_genes=1000,
    total_molecules=50000,
    pcr_cycles=10,
    random_seed=42
)

sim.run_simulation(output_prefix="output/my_sim")

Requirements

  • Python 3.10 or later
  • NumPy >= 1.20.0
  • SciPy >= 1.7.0

Documentation

Usage Examples

Paired-End Mode

Generate paired-end reads with configurable fragment length and insert size:

umi-simulator \
    --paired-end \
    --fragment-length 300 \
    --read-length 75 \
    --molecules 50000 \
    -o paired_sim

Creates:

  • paired_sim_R1.fastq.gz - Forward reads
  • paired_sim_R2.fastq.gz - Reverse reads

Using Real Genomic Data

umi-simulator \
    --gtf genes.gtf \
    --fasta genome.fa \
    --use-real-sequences \
    --molecules 100000 \
    -o real_genome

Target Read Count

# Simulator calculates required initial molecules
umi-simulator \
    --target-reads 1000000 \
    --pcr-cycles 10 \
    -o target_sim

Key Parameters

Parameter Default Description
--genes 100 Number of genes to simulate
--molecules 1000 Initial cDNA molecules
--target-reads None Target final read count (alternative to --molecules)
--umi-length 10 UMI barcode length (bp)
--read-length 100 Read length (bp)
--pcr-cycles 10 PCR amplification cycles
--pcr-efficiency 0.7 PCR efficiency (0-1)
--seq-error-rate 0.001 Read sequencing error rate
--umi-error-rate 0.002 UMI sequencing error rate
--paired-end False Generate paired-end reads
--fragment-length 300 Fragment length for PE (bp)
--umi-in-sequence False Prepend UMI to sequence (vs header only)
--gtf None GTF annotation file
--fasta None Genome FASTA file
--use-real-sequences False Extract real sequences from FASTA
-o, --output output Output file prefix
-s, --seed 42 Random seed

Output Files

Each simulation generates:

  1. FASTQ file(s): output.fastq.gz (or _R1.fastq.gz / _R2.fastq.gz for paired-end)
  2. Statistics: output_stats.txt - Simulation parameters and metrics
  3. Ground Truth: output_truth.txt - Per-gene molecule counts for validation

1. FASTQ File (<prefix>.fastq.gz)

Standard gzipped FASTQ format with UMI in the header:

@SIM:0:ENSG00000139618:BRCA2:UMI:ACGTACGTAC
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...

Header format:

  • Without GTF: @SIM:<read_number>:GENE<gene_id>:UMI:<umi_sequence>
  • With GTF: @SIM:<read_number>:<ensembl_id>:<gene_name>:UMI:<umi_sequence>

Paired-end mode: Creates two files <prefix>_R1.fastq.gz and <prefix>_R2.fastq.gz

2. Statistics File (<prefix>_stats.txt)

Contains detailed simulation statistics:

  • Total reads and unique UMIs
  • UMI duplication statistics
  • Amplification metrics
  • Top expressed genes
  • Duplication rate

3. Ground Truth File (<prefix>_truth.txt)

Tab-delimited file with true expression counts for all simulated genes:

gene_id	true_molecules	unique_umis	final_reads
0	12	12	6186
1	10	10	5136
2	6	6	3220

Columns:

  • gene_id: Gene identifier (numeric ID or Ensembl ID if using GTF)
  • true_molecules: Number of original cDNA molecules before PCR (ground truth)
  • unique_umis: Number of unique UMIs assigned to this gene
  • final_reads: Total reads after PCR amplification (with PCR duplicates; i.e., raw total reads)

Use cases:

  • Benchmarking deduplication algorithms
  • Evaluating expression quantification accuracy
  • Testing differential expression detection
  • Validating UMI-based counting methods

Simulation Pipeline

Step 1: Generate Molecule Counts

Uses negative binomial distribution to simulate realistic gene expression levels with overdispersion similar to real RNA-seq data.

Step 2: Assign UMIs

Each molecule receives a random UMI sequence. This establishes the "ground truth" for deduplication testing.

Step 3: Simulate PCR Amplification

Models PCR cycles where molecules are duplicated based on efficiency:

  • Each molecule has a probability (efficiency) of being amplified per cycle
  • Creates technical duplicates that share the same UMI

Step 4: Add Sequencing Errors

Introduces substitution errors to:

  • Read sequences: Simulates Illumina sequencing errors
  • UMI sequences: Tests robustness of UMI-based deduplication

Step 5: Generate Output

Creates FASTQ files with UMIs embedded in read headers, compatible with standard UMI-tools pipelines.

Example Workflow

1. Generate Simulated Data

# Generate simulation with real gene IDs
umi-simulator \
    --gtf genes.gtf \
    --fasta genome.fa \
    --use-real-sequences \
    --genes 5000 \
    --molecules 200000 \
    --output test_data

2. Process with UMI-tools

# Extract UMIs from header
umi_tools extract \
    -I test_data.fastq.gz \
    --bc-pattern=NNNNNNNNNN \
    --stdout test_data_extracted.fastq.gz

# Align reads with minimap2
minimap2 -ax map-ont \
    genome.fa \
    test_data_extracted.fastq.gz | \
    samtools view -bS | \
    samtools sort -o test_data_aligned.bam

samtools index test_data_aligned.bam

# Deduplicate
umi_tools dedup \
    -I test_data_aligned.bam \
    --output-stats=deduplicated \
    -S test_data_deduped.bam

# Quantify gene expression (e.g., featureCounts, RSEM, etc.)
featureCounts \
    -a genes.gtf \
    -o test_data_counts.txt \
    test_data_deduped.bam

Validating Pipeline Results

The package includes a comparison utility to validate your UMI deduplication pipeline against simulation ground truth:

# Compare featureCounts or RSEM output with simulation truth
umi-compare simulation_truth.txt pipeline_counts.txt

# Generate detailed per-gene report and/or generate scatter plot with regression line (requires matplotlib)
umi-compare simulation_truth.txt pipeline_counts.txt --report comparison.txt --plot comparison.png

Supported input formats:

  • featureCounts output (tab-delimited with gene ID in column 1 and counts in last column)
  • RSEM .genes.results files (expected_count column)
  • Auto-detects format based on file structure

Output metrics:

  • Pearson correlation coefficient
  • Mean absolute error (MAE)
  • Mean absolute percent error (MAPE)
  • Per-gene comparison (with --report option)
  • Scatter plot with regression line (with --plot option)

Example workflow:

# 1. Run simulation
umi-simulator --genes 1000 --molecules 50000 -o sim/data

# 2. Process with your pipeline (e.g., STAR + UMI-tools)
# ... your pipeline commands ...

# 3. Validate results
umi-compare sim/data_truth.txt results/featureCounts.txt --report validation.txt
  • --report writes a tab-delimited file with columns: Gene, Truth, Predicted, Diff, PctDiff for all genes in the truth set.
  • --plot saves a scatter plot (truth vs predicted) with a regression line and formula ($y = mx + b$, $R^2$) inside the plot. Example output:

Comparison Scatter Plot

Additional Details

Error Model

  • Simple substitution errors (no indels)
  • Independent error probability per base
  • Higher error rate for UMIs reflects lower quality in index reads
  • Errors applied after PCR to simulate sequencing stage

Expression Model

  • Negative binomial distribution (n=10, p=0.3) for realistic overdispersion
  • Normalized to specified total molecule count
  • Reflects the biological variability seen in real bulk RNA-seq
  • SPerformance and Scalability

Benchmarks

  • Small simulation (10K genes, 100K molecules): ~10 seconds
  • Medium simulation (20K genes, 1M molecules): ~60 seconds
  • Large simulation (20K genes, 5M molecules): ~5 minutes

Memory Usage

  • Memory-efficient design with automatic cleanup
  • ~1-2 GB RAM for typical simulations (1M molecules)
  • Scales linearly with final read count (after PCR)

Tips for Large Simulations

  • Use --target-reads to control final output size
  • Reduce --pcr-cycles or --pcr-efficiency to limit amplification
  • Enable real sequences only when needed (slower but more realistic)
  • Use smaller gene counts for initial testing

Customization

The BulkRNAUMISimulator class can be easily modified or extended to:

  • Use different expression distributions (Poisson, log-normal)
  • Add strand-specific information to reads
  • Include paired-end read simulation
  • Model GC-bias or length-dependent bias
  • Add quality-dependent error models
  • Implement custom PCR efficiency models
  • Add fragment size selection

Example programmatic usage:

from simulate_bulk_rna_umi import BulkRNAUMISimulator

# Create simulator instance
simulator = BulkRNAUMISimulator(
    n_genes=5000,
    target_reads=1000000,  # Back-calculate molecules
    umi_length=12,
    pcr_cycles=10,
    pcr_efficiency=0.75,
    random_seed=42
)

# Run simulation
simulator.run_simulation(output_prefix="custom_sim")

# Access internal data for analysis
print(f"Total molecules: {len(simulator.molecules)}")
print(f"Genes simulated: {len(simulator.gene_info)}")

Testing

Run the test suite:

# Install dev dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest tests/ -v

# With coverage report
pytest tests/ -v --cov=umi_simulator --cov-report=html

# Run integration tests
cd tests/integration
bash test_full_workflow_bacterial.sh    # Bacterial genome
bash test_full_workflow_mouse_se.sh     # Mouse chr19 single-end
bash test_full_workflow_mouse_pe.sh     # Mouse chr19 paired-end

License

MIT License - Feel free to modify and distribute

Citation

If you use this simulator in your research, please cite appropriately and mention the simulation parameters used.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors