SNP-Level Inspection of Parental Transcripts
Nextflow DSL2 pipeline for allele-specific RNA-seq analysis using STAR, SNPsplit, and featureCounts.
Ultra-minimalist — 2 files only (main.nf + nextflow.config). Designed for solo bioinformaticians.
%%{init: {'theme': 'base', 'themeVariables': {'background': '#ffffff', 'primaryTextColor': '#231f20', 'lineColor': '#999999', 'textColor': '#231f20', 'mainBkg': '#ffffff', 'nodeBorder': '#999999'}}}%%
flowchart TD
subgraph INPUT ["Input (one of)"]
SRA["SRR / ERR / DRR"] --> SRA_DL["SRA_DOWNLOAD"]
GEO["GSE / GSM"] --> RESOLVE["RESOLVE_GEO"] --> SRA_DL
FQ_DIR["FASTQ directory"]
CSV["CSV samplesheet"]
end
SRA_DL --> FASTQS(("FASTQs"))
FQ_DIR --> FASTQS
CSV --> FASTQS
DOWNLOAD["DOWNLOAD_REFERENCES"] --> GPREP["SNPSPLIT_GENOME_PREP"]
GPREP --> IDX1["STAR_INDEX (N-masked)"]
DOWNLOAD --> IDX2["STAR_INDEX (reference)"]
IDX1 --> A1["STAR_ALIGN (N-masked)"]
IDX2 --> A2["STAR_ALIGN (reference)"]
FASTQS --> A1
FASTQS --> A2
A1 --> S1["SORT_DEDUP"] --> SNP["SNPSPLIT"]
A2 --> S2["SORT_DEDUP "] --> FC3["FEATURECOUNTS (reference)"]
SNP -->|"genome1"| FC1["FEATURECOUNTS (genome1)"]
SNP -->|"genome2"| FC2["FEATURECOUNTS (genome2)"]
FC1 --> O1["genome1 counts"]
FC2 --> O2["genome2 counts"]
FC3 --> O3["reference counts"]
FC1 & FC2 & FC3 --> MQC["MULTIQC"] --> O4["MultiQC report"]
classDef input fill:#0570b0,stroke:#0570b0,color:#fff
classDef process fill:#238b45,stroke:#238b45,color:#fff
classDef key fill:#cb181d,stroke:#cb181d,color:#fff,stroke-width:3px
classDef output fill:#6a51a3,stroke:#6a51a3,color:#fff
classDef data fill:#e6550d,stroke:#e6550d,color:#fff
classDef mqc fill:#41ab5d,stroke:#41ab5d,color:#fff
class SRA,GEO,FQ_DIR,CSV input
class SRA_DL,RESOLVE,DOWNLOAD,GPREP,IDX1,IDX2,A1,S1,A2,S2,FC1,FC2,FC3 process
class SNP key
class O1,O2,O3,O4 output
class FASTQS data
class MQC mqc
# From a FASTQ directory (auto-detects PE/SE)
nextflow run IPNP-BIPN/SPLIT --fastq_dir /path/to/fastqs --outdir results -resume
# From SRA accessions
nextflow run IPNP-BIPN/SPLIT --sra_ids "SRR1234567,SRR1234568" --outdir results -resume
# From a GEO dataset (auto-resolves GSE → SRR)
nextflow run IPNP-BIPN/SPLIT --sra_ids GSE80810 --outdir results -resume
# From a samplesheet CSV
nextflow run IPNP-BIPN/SPLIT --input samplesheet.csv --outdir results -resume
# Custom strains
nextflow run IPNP-BIPN/SPLIT \
--fastq_dir /path/to/fastqs \
--strain1 CAST_EiJ \
--strain2 C57BL_6NJ \
--dedup true \
--outdir results \
-resumesample,fastq_1,fastq_2
sampleA,/path/to/sampleA_R1_001.fastq.gz,/path/to/sampleA_R2_001.fastq.gz
sampleB,/path/to/sampleB.fastq.gz,Leave
fastq_2empty for single-end reads.
| Parameter | Default | Description |
|---|---|---|
--input |
null |
Samplesheet CSV (sample,fastq_1,fastq_2) |
--fastq_dir |
null |
Directory of FASTQs (*.fastq.gz) |
--sra_ids |
null |
SRA/GEO accessions (comma-separated or file) |
--outdir |
results |
Output directory |
--strain1 |
CAST_EiJ |
First strain in VCF → genome1 |
--strain2 |
C57BL_6NJ |
Second strain in VCF → genome2 |
--dedup |
true |
Remove PCR duplicates (samtools markdup -r) |
--strandedness |
0 |
featureCounts strandedness (0/1/2) |
--force_se |
false |
Force single-end counting in featureCounts |
--genome_url |
Ensembl GRCm39 | Genome FASTA URL |
--gtf_url |
Ensembl 2023_04 | GTF annotation URL |
--vcf_url |
MGP REL2021 v8 | VCF SNPs URL |
--star_limit_genome_ram |
60000000000 |
STAR --limitGenomeGenerateRAM |
--max_cpus |
auto | Maximum number of CPUs |
--max_memory |
64 GB |
Maximum memory (scales all process labels) |
| Parameter | Description |
|---|---|
--genome_fa |
Pre-downloaded genome FASTA |
--gtf |
Pre-downloaded GTF |
--vcf |
Pre-downloaded VCF |
--star_index_nmask |
Pre-built STAR index (N-masked) |
--star_index_ref |
Pre-built STAR index (reference) |
--snp_file |
Pre-existing SNPsplit SNP annotation file |
results/
├── 00_sra_fastq/ # Downloaded FASTQs (if SRA input)
├── 04_aln_nmask/ # N-mask aligned BAMs (sorted, deduped)
├── 05_aln_ref/ # Reference aligned BAMs (sorted, deduped)
├── 06_snpsplit/ # SNPsplit output (genome1, genome2, unassigned)
├── 07_counts/
│ ├── counts_genome1_CAST_EiJ.txt # Allele-specific counts (strain1)
│ ├── counts_genome2_C57BL_6NJ.txt # Allele-specific counts (strain2)
│ └── counts_reference.txt # Standard reference counts
├── 08_multiqc/ # Aggregated QC report
├── reference/ # Downloaded + cached references
│ ├── genome.fa # GRCm39 soft-masked
│ ├── genes.gtf # Ensembl annotation
│ ├── snps.vcf.gz # MGP REL2021 SNPs
│ ├── snpsplit_prep/ # N-masked genome + SNP file
│ ├── star_nmask/ # STAR index (N-masked)
│ └── star_ref/ # STAR index (reference)
└── pipeline_info/ # Nextflow timeline, trace, DAG, report
Core (always required):
STAR samtools SNPsplit featureCounts (subread) multiqc wget
Optional:
sra-tools bgzip (htslib/tabix) — for SRA download
Nextflow ≥ 23.04
-
References are automatically downloaded from Ensembl (GRCm39 genome + GTF) and MGP (SNP VCF). Cached via
storeDir— only downloaded once. -
SNPsplit genome preparation creates an N-masked genome where strain-discriminating SNP positions are replaced by N. This prevents alignment bias toward the reference allele.
-
Two parallel alignment tracks:
- N-masked track: alignments used for allele-specific analysis (SNPsplit)
- Reference track: standard alignments for total gene expression
-
SNPsplit assigns each read from the N-mask track to genome1 (strain1), genome2 (strain2), or unassigned based on informative SNP positions.
-
featureCounts produces three count tables using
gene_nameattribute for human-readable gene symbols (e.g., Gapdh instead of ENSMUSG00000057666).
The pipeline natively leverages Nextflow's cache (-resume). Already completed steps are automatically skipped. References (genome, GTF, VCF, STAR indexes, N-masked genome) are persisted via storeDir and reused across runs.
# Re-run after a crash — picks up exactly where it left off
nextflow run main.nf --fastq_dir fastqs --outdir results -resumeMIT