A comprehensive Snakemake workflow for processing CUT&Tag sequencing data, from raw FASTQ files to peak calling, quality control, and reproducibility analysis.
This pipeline processes CUT&Tag (Cleavage Under Targets and Tagmentation) sequencing data through the following stages:
- Preprocessing: Adapter trimming and quality control (optional)
- Alignment: Read alignment to reference genome (Bowtie2 or BWA-MEM2)
- Post-alignment: Duplicate marking, indexing, fragment length analysis
- Peak Calling: MACS2 peak calling with marker-specific parameters
- Quality Control: FRiP, library complexity, reproducibility metrics
- Reporting: MultiQC integration with custom statistics
- Flexible alignment: Supports Bowtie2 and BWA-MEM2 aligners
- Marker-specific peak calling: Configurable q-values and broad/narrow peak parameters per histone mark
- Comprehensive QC: FRiP, library complexity (Preseq), genomic coverage, reproducibility analysis
- MultiQC integration: Custom statistics including peak counts, genomic coverage, and Preseq metrics
- Reproducibility assessment: Jaccard similarity and correlation analysis for replicates
- Blacklist filtering: Optional blacklist region filtering
- Conda/Singularity support: Reproducible environments
- Snakemake ≥ 8.20.1
- Python ≥ 3.7
- Conda or Singularity (for environment management)
- SLURM (optional, for cluster execution)
- Reference genome FASTA file
- Genome index (Bowtie2 or BWA-MEM2)
- Chromosome sizes file
- Gene annotation BED file (optional)
- Adapter sequences FASTA (if trimming)
- Blacklist regions BED (optional)
- Clone the repository:
git clone <repository-url>
cd cuttag_sm- Install Snakemake (if not already installed):
conda install -c bioconda -c conda-forge snakemake- Configure your environment:
- Edit
config/config.ymlwith your project-specific settings - Prepare your sample sheet (see Configuration section)
- Edit
The main configuration file contains project information, pipeline settings, and reference paths:
project_info:
name: "your_project_name"
description: "Project description"
genome: "hg38" # or "mm10", etc.
samples:
table: "config/your_samplesheet.csv"
pep_config: "config/pep/project_config.yaml"
outputs:
root: "{process_dir}/{project_name}/"
pipeline_steps:
alignment:
aligner: "bowtie2" # or "bwa-mem2"
peaks:
use_igg: false
igg_token: "IgG"
n_intersects: 2
peak_qval:
CTCF: 0.0001
H3K4me3: 0.001
H3K27ac: 0.001
# ... marker-specific q-values
broad_marks:
- H3K27me3
- H3K9me3
broad_params:
slocal: 10000
max-gap: 1000
broad-cutoff: 0.01
reference:
genome_index: "{database_dir}/hg38/indices_for_Bowtie2/hg38"
genome_fasta: "{database_dir}/hg38/hg38.fa"
chrom_sizes: "genes/hg38.chrom.sizes"
adapter_fasta: "genes/adapter_seqs.fa"
blacklist: "" # Optional: path to blacklist BED fileThe sample sheet should be a CSV file with the following columns:
sample: Sample identifierrun: Run/replicate numberR1: Path to R1 FASTQ fileR2: Path to R2 FASTQ filemark: Histone mark or protein (e.g., "H3K27ac", "CTCF")condition: Experimental conditionigg: IgG control sample (if applicable)replicate_sample_name: Base sample name for grouping replicates
Example:
sample,run,R1,R2,mark,condition,igg,gopeaks,replicate_sample_name
Sample1,1,/path/to/Sample1_R1.fastq.gz,/path/to/Sample1_R2.fastq.gz,H3K27ac,Condition1,Sample1_IgG,-,Sample1
Sample1,2,/path/to/Sample1_rep2_R1.fastq.gz,/path/to/Sample1_rep2_R2.fastq.gz,H3K27ac,Condition1,Sample1_IgG,-,Sample1Run the pipeline with Snakemake:
snakemake --configfile config/config.yml -c <number_of_cores>For cluster environments, use a profile:
snakemake --configfile config/config.yml --profile slurmTest the workflow without executing:
snakemake --configfile config/config.yml -nRun specific outputs:
snakemake --configfile config/config.yml multiqc_report.html- Adapter trimming: Optional trimming using fastp
- Quality control: FastQC-style reports
- Read alignment: Bowtie2 or BWA-MEM2
- Sorting: Coordinate-sorted BAM files
- Duplicate marking: Sambamba markdup
- Indexing: BAM index generation
- Fragment length analysis: Calculate fragment size distributions
- BigWig track generation: Normalized coverage tracks
- MACS2 peak calling:
- Broad peaks: For marks like H3K27me3, H3K9me3
- Narrow peaks: For other marks (CTCF, H3K4me3, etc.)
- Marker-specific parameters:
- Q-values configured per mark
- Broad peak parameters (slocal, max-gap, broad-cutoff)
- Blacklist filtering: Optional removal of blacklisted regions
- Peak processing: Conversion to BED format
- FRiP (Fraction of Reads in Peaks): DeepTools plotEnrichment
- Library complexity: Preseq lc_extrap
- Fingerprinting: DeepTools plotFingerprint
- MultiQC report: Aggregated QC metrics
- BAM correlation: DeepTools multiBamSummary and plotCorrelation
- Peak similarity: bedtools jaccard for replicate comparison
- Peak counting: Per-sample peak counts
- Genomic coverage: Percentage of genome covered by peaks
{output_base_dir}/
├── Important_processed/
│ ├── Bam/
│ │ ├── {sample}.sorted.markd.bam
│ │ └── {sample}.sorted.markd.bam.bai
│ ├── Track/
│ │ └── tracks/{sample}.bw
│ └── Peaks/
│ └── callpeaks/
│ ├── macs2_broad_{sample}_peaks.broadPeak
│ ├── macs2_narrow_{sample}_peaks.narrowPeak
│ └── {sample}_peaks.bed
├── Report/
│ ├── multiqc/
│ │ └── multiqc_report.html
│ ├── plotEnrichment/
│ │ ├── frip_{sample}.tsv
│ │ └── frip.html
│ ├── preseq/
│ │ └── lcextrap_{sample}.txt
│ ├── peak_stat/
│ │ ├── peakcount/
│ │ │ └── {sample}_peakcount.txt
│ │ └── coverage_report.tsv
│ ├── bamReproducibility/
│ │ ├── {sample_rep}_global_rep_cor.txt
│ │ └── {sample_rep}_global_rep_heatmap.pdf
│ └── bedtools_jaccard/
│ └── {sample_rep}_jaccard.txt
└── logs/
└── {rule}_{sample}.log
The pipeline generates comprehensive QC metrics:
- FRiP (Fraction of Reads in Peaks): Percentage of reads overlapping called peaks
- Library Complexity: Expected distinct reads at various sequencing depths (Preseq)
- Genomic Coverage: Percentage of genome covered by peaks
- Peak Counts: Number of peaks per sample
- Reproducibility:
- Spearman correlation between replicates (BAM level)
- Jaccard similarity between peak sets
All metrics are integrated into the MultiQC report for easy visualization.
Configure q-values and broad peak parameters in config/config.yml:
peak_qval:
CTCF: 0.0001
H3K4me3: 0.001
H3K27ac: 0.001
broad_marks:
- H3K27me3
- H3K9me3
broad_params:
slocal: 10000
max-gap: 1000
broad-cutoff: 0.01Switch between aligners in the config:
pipeline_steps:
alignment:
aligner: "bowtie2" # or "bwa-mem2"Enable/disable adapter trimming:
TRIM_ADAPTERS: true # or false- Missing input files: Ensure sample sheet paths are correct and files exist
- Memory errors: Adjust
mem_mbin rule resources or use cluster execution - Wildcard errors: Check that sample names are consistent across config and sample sheet
- Conda environment issues: Rebuild environments with
--use-conda
Check rule-specific logs in {output_base_dir}/logs/ for detailed error messages.
If you use this pipeline, please cite:
- CUT&Tag method: Kaya-Okur et al., 2019
- Snakemake: Mölder et al., 2021
- MACS2: Zhang et al., 2008
MIT License - see LICENSE file for details.
For issues, questions, or contributions, please open an issue on the repository.