A comprehensive Python package for simulating bulk RNA-seq data with Unique Molecular Identifiers (UMIs), including PCR amplification and sequencing errors.
✅ Gene Expression Simulation: Uses negative binomial distribution to model realistic gene expression levels with biological overdispersion
✅ UMI Assignment: Randomly assigns unique molecular identifiers (8-12 bp) to each cDNA molecule
✅ PCR Amplification: Models PCR cycles with configurable efficiency to simulate technical duplicates
✅ Sequencing Errors: Introduces realistic substitution errors in both read sequences and UMIs
✅ Target Read Count: Back-calculate initial molecules needed to achieve a specific number of final reads
✅ Real Gene Annotations: Support for Ensembl GTF files to use real gene IDs and metadata
✅ Real Genomic Sequences: Extract actual sequences from reference genome FASTA files
✅ Memory Optimized: Efficient handling of large simulations with automatic cleanup
✅ Paired-End Reads: Support for paired-end sequencing with configurable insert sizes
✅ FASTQ Output: Generates standard gzipped FASTQ files with UMI tags
✅ Pipeline Validation: Compare pipeline results against ground truth with statistical metrics and plots
✅ Detailed Statistics: Provides comprehensive statistics including duplication rates and per-gene metrics
✅ Full Test Suite: Comprehensive pytest test suite with 94% code coverage
pip install umi-rna-simulatorOr install from source:
git clone https://github.com/lloydm/umi-rna-simulator.git
cd umi-rna-simulator
pip install -e .# Command line
umi-simulator --genes 1000 --molecules 50000 -o output/sim
# Or as Python module
python -m umi_simulator --genes 1000 --molecules 50000 -o output/simfrom umi_simulator import BulkRNAUMISimulator
sim = BulkRNAUMISimulator(
n_genes=1000,
total_molecules=50000,
pcr_cycles=10,
random_seed=42
)
sim.run_simulation(output_prefix="output/my_sim")- Python 3.10 or later
- NumPy >= 1.20.0
- SciPy >= 1.7.0
- Quick Start Guide - Get started in minutes
- Installation - Detailed installation instructions
- Examples - Example scripts for common use cases
- Contributing - Guidelines for contributors
- Changelog - Version history
Generate paired-end reads with configurable fragment length and insert size:
umi-simulator \
--paired-end \
--fragment-length 300 \
--read-length 75 \
--molecules 50000 \
-o paired_simCreates:
paired_sim_R1.fastq.gz- Forward readspaired_sim_R2.fastq.gz- Reverse reads
umi-simulator \
--gtf genes.gtf \
--fasta genome.fa \
--use-real-sequences \
--molecules 100000 \
-o real_genome# Simulator calculates required initial molecules
umi-simulator \
--target-reads 1000000 \
--pcr-cycles 10 \
-o target_sim| Parameter | Default | Description |
|---|---|---|
--genes |
100 | Number of genes to simulate |
--molecules |
1000 | Initial cDNA molecules |
--target-reads |
None | Target final read count (alternative to --molecules) |
--umi-length |
10 | UMI barcode length (bp) |
--read-length |
100 | Read length (bp) |
--pcr-cycles |
10 | PCR amplification cycles |
--pcr-efficiency |
0.7 | PCR efficiency (0-1) |
--seq-error-rate |
0.001 | Read sequencing error rate |
--umi-error-rate |
0.002 | UMI sequencing error rate |
--paired-end |
False | Generate paired-end reads |
--fragment-length |
300 | Fragment length for PE (bp) |
--umi-in-sequence |
False | Prepend UMI to sequence (vs header only) |
--gtf |
None | GTF annotation file |
--fasta |
None | Genome FASTA file |
--use-real-sequences |
False | Extract real sequences from FASTA |
-o, --output |
output | Output file prefix |
-s, --seed |
42 | Random seed |
Each simulation generates:
- FASTQ file(s):
output.fastq.gz(or_R1.fastq.gz/_R2.fastq.gzfor paired-end) - Statistics:
output_stats.txt- Simulation parameters and metrics - Ground Truth:
output_truth.txt- Per-gene molecule counts for validation
Standard gzipped FASTQ format with UMI in the header:
@SIM:0:ENSG00000139618:BRCA2:UMI:ACGTACGTAC
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...
Header format:
- Without GTF:
@SIM:<read_number>:GENE<gene_id>:UMI:<umi_sequence> - With GTF:
@SIM:<read_number>:<ensembl_id>:<gene_name>:UMI:<umi_sequence>
Paired-end mode: Creates two files <prefix>_R1.fastq.gz and <prefix>_R2.fastq.gz
Contains detailed simulation statistics:
- Total reads and unique UMIs
- UMI duplication statistics
- Amplification metrics
- Top expressed genes
- Duplication rate
Tab-delimited file with true expression counts for all simulated genes:
gene_id true_molecules unique_umis final_reads
0 12 12 6186
1 10 10 5136
2 6 6 3220
Columns:
gene_id: Gene identifier (numeric ID or Ensembl ID if using GTF)true_molecules: Number of original cDNA molecules before PCR (ground truth)unique_umis: Number of unique UMIs assigned to this genefinal_reads: Total reads after PCR amplification (with PCR duplicates; i.e., raw total reads)
Use cases:
- Benchmarking deduplication algorithms
- Evaluating expression quantification accuracy
- Testing differential expression detection
- Validating UMI-based counting methods
Uses negative binomial distribution to simulate realistic gene expression levels with overdispersion similar to real RNA-seq data.
Each molecule receives a random UMI sequence. This establishes the "ground truth" for deduplication testing.
Models PCR cycles where molecules are duplicated based on efficiency:
- Each molecule has a probability (efficiency) of being amplified per cycle
- Creates technical duplicates that share the same UMI
Introduces substitution errors to:
- Read sequences: Simulates Illumina sequencing errors
- UMI sequences: Tests robustness of UMI-based deduplication
Creates FASTQ files with UMIs embedded in read headers, compatible with standard UMI-tools pipelines.
# Generate simulation with real gene IDs
umi-simulator \
--gtf genes.gtf \
--fasta genome.fa \
--use-real-sequences \
--genes 5000 \
--molecules 200000 \
--output test_data# Extract UMIs from header
umi_tools extract \
-I test_data.fastq.gz \
--bc-pattern=NNNNNNNNNN \
--stdout test_data_extracted.fastq.gz
# Align reads with minimap2
minimap2 -ax map-ont \
genome.fa \
test_data_extracted.fastq.gz | \
samtools view -bS | \
samtools sort -o test_data_aligned.bam
samtools index test_data_aligned.bam
# Deduplicate
umi_tools dedup \
-I test_data_aligned.bam \
--output-stats=deduplicated \
-S test_data_deduped.bam
# Quantify gene expression (e.g., featureCounts, RSEM, etc.)
featureCounts \
-a genes.gtf \
-o test_data_counts.txt \
test_data_deduped.bamThe package includes a comparison utility to validate your UMI deduplication pipeline against simulation ground truth:
# Compare featureCounts or RSEM output with simulation truth
umi-compare simulation_truth.txt pipeline_counts.txt
# Generate detailed per-gene report and/or generate scatter plot with regression line (requires matplotlib)
umi-compare simulation_truth.txt pipeline_counts.txt --report comparison.txt --plot comparison.pngSupported input formats:
- featureCounts output (tab-delimited with gene ID in column 1 and counts in last column)
- RSEM
.genes.resultsfiles (expected_count column) - Auto-detects format based on file structure
Output metrics:
- Pearson correlation coefficient
- Mean absolute error (MAE)
- Mean absolute percent error (MAPE)
- Per-gene comparison (with
--reportoption) - Scatter plot with regression line (with
--plotoption)
Example workflow:
# 1. Run simulation
umi-simulator --genes 1000 --molecules 50000 -o sim/data
# 2. Process with your pipeline (e.g., STAR + UMI-tools)
# ... your pipeline commands ...
# 3. Validate results
umi-compare sim/data_truth.txt results/featureCounts.txt --report validation.txt-
--reportwrites a tab-delimited file with columns: Gene, Truth, Predicted, Diff, PctDiff for all genes in the truth set. -
--plotsaves a scatter plot (truth vs predicted) with a regression line and formula ($y = mx + b$ ,$R^2$ ) inside the plot. Example output:
- Simple substitution errors (no indels)
- Independent error probability per base
- Higher error rate for UMIs reflects lower quality in index reads
- Errors applied after PCR to simulate sequencing stage
- Negative binomial distribution (n=10, p=0.3) for realistic overdispersion
- Normalized to specified total molecule count
- Reflects the biological variability seen in real bulk RNA-seq
- SPerformance and Scalability
- Small simulation (10K genes, 100K molecules): ~10 seconds
- Medium simulation (20K genes, 1M molecules): ~60 seconds
- Large simulation (20K genes, 5M molecules): ~5 minutes
- Memory-efficient design with automatic cleanup
- ~1-2 GB RAM for typical simulations (1M molecules)
- Scales linearly with final read count (after PCR)
- Use
--target-readsto control final output size - Reduce
--pcr-cyclesor--pcr-efficiencyto limit amplification - Enable real sequences only when needed (slower but more realistic)
- Use smaller gene counts for initial testing
The BulkRNAUMISimulator class can be easily modified or extended to:
- Use different expression distributions (Poisson, log-normal)
- Add strand-specific information to reads
- Include paired-end read simulation
- Model GC-bias or length-dependent bias
- Add quality-dependent error models
- Implement custom PCR efficiency models
- Add fragment size selection
Example programmatic usage:
from simulate_bulk_rna_umi import BulkRNAUMISimulator
# Create simulator instance
simulator = BulkRNAUMISimulator(
n_genes=5000,
target_reads=1000000, # Back-calculate molecules
umi_length=12,
pcr_cycles=10,
pcr_efficiency=0.75,
random_seed=42
)
# Run simulation
simulator.run_simulation(output_prefix="custom_sim")
# Access internal data for analysis
print(f"Total molecules: {len(simulator.molecules)}")
print(f"Genes simulated: {len(simulator.gene_info)}")Run the test suite:
# Install dev dependencies
pip install -r requirements-dev.txt
# Run all tests
pytest tests/ -v
# With coverage report
pytest tests/ -v --cov=umi_simulator --cov-report=html
# Run integration tests
cd tests/integration
bash test_full_workflow_bacterial.sh # Bacterial genome
bash test_full_workflow_mouse_se.sh # Mouse chr19 single-end
bash test_full_workflow_mouse_pe.sh # Mouse chr19 paired-endMIT License - Feel free to modify and distribute
If you use this simulator in your research, please cite appropriately and mention the simulation parameters used.
