UMI RNA Simulator

A comprehensive Python package for simulating bulk RNA-seq data with Unique Molecular Identifiers (UMIs), including PCR amplification and sequencing errors.

Features

✅ Gene Expression Simulation: Uses negative binomial distribution to model realistic gene expression levels with biological overdispersion
✅ UMI Assignment: Randomly assigns unique molecular identifiers (8-12 bp) to each cDNA molecule
✅ PCR Amplification: Models PCR cycles with configurable efficiency to simulate technical duplicates
✅ Sequencing Errors: Introduces realistic substitution errors in both read sequences and UMIs
✅ Target Read Count: Back-calculate initial molecules needed to achieve a specific number of final reads
✅ Real Gene Annotations: Support for Ensembl GTF files to use real gene IDs and metadata
✅ Real Genomic Sequences: Extract actual sequences from reference genome FASTA files
✅ Memory Optimized: Efficient handling of large simulations with automatic cleanup
✅ Paired-End Reads: Support for paired-end sequencing with configurable insert sizes
✅ FASTQ Output: Generates standard gzipped FASTQ files with UMI tags
✅ Pipeline Validation: Compare pipeline results against ground truth with statistical metrics and plots
✅ Detailed Statistics: Provides comprehensive statistics including duplication rates and per-gene metrics
✅ Full Test Suite: Comprehensive pytest test suite with 94% code coverage

Quick Start

Installation

pip install umi-rna-simulator

Or install from source:

git clone https://github.com/lloydm/umi-rna-simulator.git
cd umi-rna-simulator
pip install -e .

Basic Usage

# Command line
umi-simulator --genes 1000 --molecules 50000 -o output/sim

# Or as Python module
python -m umi_simulator --genes 1000 --molecules 50000 -o output/sim

Python API

from umi_simulator import BulkRNAUMISimulator

sim = BulkRNAUMISimulator(
    n_genes=1000,
    total_molecules=50000,
    pcr_cycles=10,
    random_seed=42
)

sim.run_simulation(output_prefix="output/my_sim")

Requirements

Python 3.10 or later
NumPy >= 1.20.0
SciPy >= 1.7.0

Documentation

Quick Start Guide - Get started in minutes
Installation - Detailed installation instructions
Examples - Example scripts for common use cases
Contributing - Guidelines for contributors
Changelog - Version history

Usage Examples

Paired-End Mode

Generate paired-end reads with configurable fragment length and insert size:

umi-simulator \
    --paired-end \
    --fragment-length 300 \
    --read-length 75 \
    --molecules 50000 \
    -o paired_sim

Creates:

paired_sim_R1.fastq.gz - Forward reads
paired_sim_R2.fastq.gz - Reverse reads

Using Real Genomic Data

umi-simulator \
    --gtf genes.gtf \
    --fasta genome.fa \
    --use-real-sequences \
    --molecules 100000 \
    -o real_genome

Target Read Count

# Simulator calculates required initial molecules
umi-simulator \
    --target-reads 1000000 \
    --pcr-cycles 10 \
    -o target_sim

Key Parameters

Parameter	Default	Description
`--genes`	100	Number of genes to simulate
`--molecules`	1000	Initial cDNA molecules
`--target-reads`	None	Target final read count (alternative to --molecules)
`--umi-length`	10	UMI barcode length (bp)
`--read-length`	100	Read length (bp)
`--pcr-cycles`	10	PCR amplification cycles
`--pcr-efficiency`	0.7	PCR efficiency (0-1)
`--seq-error-rate`	0.001	Read sequencing error rate
`--umi-error-rate`	0.002	UMI sequencing error rate
`--paired-end`	False	Generate paired-end reads
`--fragment-length`	300	Fragment length for PE (bp)
`--umi-in-sequence`	False	Prepend UMI to sequence (vs header only)
`--gtf`	None	GTF annotation file
`--fasta`	None	Genome FASTA file
`--use-real-sequences`	False	Extract real sequences from FASTA
`-o, --output`	output	Output file prefix
`-s, --seed`	42	Random seed

Output Files

Each simulation generates:

FASTQ file(s): output.fastq.gz (or _R1.fastq.gz / _R2.fastq.gz for paired-end)
Statistics: output_stats.txt - Simulation parameters and metrics
Ground Truth: output_truth.txt - Per-gene molecule counts for validation

1. FASTQ File (`<prefix>.fastq.gz`)

Standard gzipped FASTQ format with UMI in the header:

@SIM:0:ENSG00000139618:BRCA2:UMI:ACGTACGTAC
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...

Header format:

Without GTF: @SIM:<read_number>:GENE<gene_id>:UMI:<umi_sequence>
With GTF: @SIM:<read_number>:<ensembl_id>:<gene_name>:UMI:<umi_sequence>

Paired-end mode: Creates two files <prefix>_R1.fastq.gz and <prefix>_R2.fastq.gz

2. Statistics File (`<prefix>_stats.txt`)

Contains detailed simulation statistics:

Total reads and unique UMIs
UMI duplication statistics
Amplification metrics
Top expressed genes
Duplication rate

3. Ground Truth File (`<prefix>_truth.txt`)

Tab-delimited file with true expression counts for all simulated genes:

gene_id	true_molecules	unique_umis	final_reads
0	12	12	6186
1	10	10	5136
2	6	6	3220

Columns:

gene_id: Gene identifier (numeric ID or Ensembl ID if using GTF)
true_molecules: Number of original cDNA molecules before PCR (ground truth)
unique_umis: Number of unique UMIs assigned to this gene
final_reads: Total reads after PCR amplification (with PCR duplicates; i.e., raw total reads)

Use cases:

Benchmarking deduplication algorithms
Evaluating expression quantification accuracy
Testing differential expression detection
Validating UMI-based counting methods

Simulation Pipeline

Step 1: Generate Molecule Counts

Uses negative binomial distribution to simulate realistic gene expression levels with overdispersion similar to real RNA-seq data.

Step 2: Assign UMIs

Each molecule receives a random UMI sequence. This establishes the "ground truth" for deduplication testing.

Step 3: Simulate PCR Amplification

Models PCR cycles where molecules are duplicated based on efficiency:

Each molecule has a probability (efficiency) of being amplified per cycle
Creates technical duplicates that share the same UMI

Step 4: Add Sequencing Errors

Introduces substitution errors to:

Read sequences: Simulates Illumina sequencing errors
UMI sequences: Tests robustness of UMI-based deduplication

Step 5: Generate Output

Creates FASTQ files with UMIs embedded in read headers, compatible with standard UMI-tools pipelines.

Example Workflow

1. Generate Simulated Data

# Generate simulation with real gene IDs
umi-simulator \
    --gtf genes.gtf \
    --fasta genome.fa \
    --use-real-sequences \
    --genes 5000 \
    --molecules 200000 \
    --output test_data

2. Process with UMI-tools

# Extract UMIs from header
umi_tools extract \
    -I test_data.fastq.gz \
    --bc-pattern=NNNNNNNNNN \
    --stdout test_data_extracted.fastq.gz

# Align reads with minimap2
minimap2 -ax map-ont \
    genome.fa \
    test_data_extracted.fastq.gz | \
    samtools view -bS | \
    samtools sort -o test_data_aligned.bam

samtools index test_data_aligned.bam

# Deduplicate
umi_tools dedup \
    -I test_data_aligned.bam \
    --output-stats=deduplicated \
    -S test_data_deduped.bam

# Quantify gene expression (e.g., featureCounts, RSEM, etc.)
featureCounts \
    -a genes.gtf \
    -o test_data_counts.txt \
    test_data_deduped.bam

Validating Pipeline Results

The package includes a comparison utility to validate your UMI deduplication pipeline against simulation ground truth:

# Compare featureCounts or RSEM output with simulation truth
umi-compare simulation_truth.txt pipeline_counts.txt

# Generate detailed per-gene report and/or generate scatter plot with regression line (requires matplotlib)
umi-compare simulation_truth.txt pipeline_counts.txt --report comparison.txt --plot comparison.png

Supported input formats:

featureCounts output (tab-delimited with gene ID in column 1 and counts in last column)
RSEM .genes.results files (expected_count column)
Auto-detects format based on file structure

Output metrics:

Pearson correlation coefficient
Mean absolute error (MAE)
Mean absolute percent error (MAPE)
Per-gene comparison (with --report option)
Scatter plot with regression line (with --plot option)

Example workflow:

# 1. Run simulation
umi-simulator --genes 1000 --molecules 50000 -o sim/data

# 2. Process with your pipeline (e.g., STAR + UMI-tools)
# ... your pipeline commands ...

# 3. Validate results
umi-compare sim/data_truth.txt results/featureCounts.txt --report validation.txt

--report writes a tab-delimited file with columns: Gene, Truth, Predicted, Diff, PctDiff for all genes in the truth set.
--plot saves a scatter plot (truth vs predicted) with a regression line and formula ($y = mx + b$, $R^2$) inside the plot. Example output:

Additional Details

Error Model

Simple substitution errors (no indels)
Independent error probability per base
Higher error rate for UMIs reflects lower quality in index reads
Errors applied after PCR to simulate sequencing stage

Expression Model

Negative binomial distribution (n=10, p=0.3) for realistic overdispersion
Normalized to specified total molecule count
Reflects the biological variability seen in real bulk RNA-seq
SPerformance and Scalability

Benchmarks

Small simulation (10K genes, 100K molecules): ~10 seconds
Medium simulation (20K genes, 1M molecules): ~60 seconds
Large simulation (20K genes, 5M molecules): ~5 minutes

Memory Usage

Memory-efficient design with automatic cleanup
~1-2 GB RAM for typical simulations (1M molecules)
Scales linearly with final read count (after PCR)

Tips for Large Simulations

Use --target-reads to control final output size
Reduce --pcr-cycles or --pcr-efficiency to limit amplification
Enable real sequences only when needed (slower but more realistic)
Use smaller gene counts for initial testing

Customization

The BulkRNAUMISimulator class can be easily modified or extended to:

Use different expression distributions (Poisson, log-normal)
Add strand-specific information to reads
Include paired-end read simulation
Model GC-bias or length-dependent bias
Add quality-dependent error models
Implement custom PCR efficiency models
Add fragment size selection

Example programmatic usage:

from simulate_bulk_rna_umi import BulkRNAUMISimulator

# Create simulator instance
simulator = BulkRNAUMISimulator(
    n_genes=5000,
    target_reads=1000000,  # Back-calculate molecules
    umi_length=12,
    pcr_cycles=10,
    pcr_efficiency=0.75,
    random_seed=42
)

# Run simulation
simulator.run_simulation(output_prefix="custom_sim")

# Access internal data for analysis
print(f"Total molecules: {len(simulator.molecules)}")
print(f"Genes simulated: {len(simulator.gene_info)}")

Testing

Run the test suite:

# Install dev dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest tests/ -v

# With coverage report
pytest tests/ -v --cov=umi_simulator --cov-report=html

# Run integration tests
cd tests/integration
bash test_full_workflow_bacterial.sh    # Bacterial genome
bash test_full_workflow_mouse_se.sh     # Mouse chr19 single-end
bash test_full_workflow_mouse_pe.sh     # Mouse chr19 paired-end

License

MIT License - Feel free to modify and distribute

Citation

If you use this simulator in your research, please cite appropriately and mention the simulation parameters used.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scripts		scripts
tests		tests
umi_simulator		umi_simulator
.coveragerc		.coveragerc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

UMI RNA Simulator

Features

Quick Start

Installation

Basic Usage

Python API

Requirements

Documentation

Usage Examples

Paired-End Mode

Using Real Genomic Data

Target Read Count

Key Parameters

Output Files

1. FASTQ File (<prefix>.fastq.gz)

2. Statistics File (<prefix>_stats.txt)

3. Ground Truth File (<prefix>_truth.txt)

Simulation Pipeline

Step 1: Generate Molecule Counts

Step 2: Assign UMIs

Step 3: Simulate PCR Amplification

Step 4: Add Sequencing Errors

Step 5: Generate Output

Example Workflow

1. Generate Simulated Data

2. Process with UMI-tools

Validating Pipeline Results

Additional Details

Error Model

Expression Model

Benchmarks

Memory Usage

Tips for Large Simulations

Customization

Testing

License

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. FASTQ File (`<prefix>.fastq.gz`)

2. Statistics File (`<prefix>_stats.txt`)

3. Ground Truth File (`<prefix>_truth.txt`)

Packages