CUT&Tag Snakemake Pipeline

A comprehensive Snakemake workflow for processing CUT&Tag sequencing data, from raw FASTQ files to peak calling, quality control, and reproducibility analysis.

Overview

This pipeline processes CUT&Tag (Cleavage Under Targets and Tagmentation) sequencing data through the following stages:

Preprocessing: Adapter trimming and quality control (optional)
Alignment: Read alignment to reference genome (Bowtie2 or BWA-MEM2)
Post-alignment: Duplicate marking, indexing, fragment length analysis
Peak Calling: MACS2 peak calling with marker-specific parameters
Quality Control: FRiP, library complexity, reproducibility metrics
Reporting: MultiQC integration with custom statistics

Features

Flexible alignment: Supports Bowtie2 and BWA-MEM2 aligners
Marker-specific peak calling: Configurable q-values and broad/narrow peak parameters per histone mark
Comprehensive QC: FRiP, library complexity (Preseq), genomic coverage, reproducibility analysis
MultiQC integration: Custom statistics including peak counts, genomic coverage, and Preseq metrics
Reproducibility assessment: Jaccard similarity and correlation analysis for replicates
Blacklist filtering: Optional blacklist region filtering
Conda/Singularity support: Reproducible environments

Requirements

Software Dependencies

Snakemake ≥ 8.20.1
Python ≥ 3.7
Conda or Singularity (for environment management)
SLURM (optional, for cluster execution)

Reference Data

Reference genome FASTA file
Genome index (Bowtie2 or BWA-MEM2)
Chromosome sizes file
Gene annotation BED file (optional)
Adapter sequences FASTA (if trimming)
Blacklist regions BED (optional)

Installation

Clone the repository:

git clone <repository-url>
cd cuttag_sm

Install Snakemake (if not already installed):

conda install -c bioconda -c conda-forge snakemake

Configure your environment:
- Edit config/config.yml with your project-specific settings
- Prepare your sample sheet (see Configuration section)

Configuration

Main Configuration File (`config/config.yml`)

The main configuration file contains project information, pipeline settings, and reference paths:

project_info:
  name: "your_project_name"
  description: "Project description"
  genome: "hg38"  # or "mm10", etc.
  samples:
    table: "config/your_samplesheet.csv"
    pep_config: "config/pep/project_config.yaml"
  outputs:
    root: "{process_dir}/{project_name}/"

pipeline_steps:
  alignment:
    aligner: "bowtie2"  # or "bwa-mem2"
  peaks:
    use_igg: false
    igg_token: "IgG"
    n_intersects: 2
  peak_qval:
    CTCF: 0.0001
    H3K4me3: 0.001
    H3K27ac: 0.001
    # ... marker-specific q-values
  broad_marks:
    - H3K27me3
    - H3K9me3
  broad_params:
    slocal: 10000
    max-gap: 1000
    broad-cutoff: 0.01

reference:
  genome_index: "{database_dir}/hg38/indices_for_Bowtie2/hg38"
  genome_fasta: "{database_dir}/hg38/hg38.fa"
  chrom_sizes: "genes/hg38.chrom.sizes"
  adapter_fasta: "genes/adapter_seqs.fa"
  blacklist: ""  # Optional: path to blacklist BED file

Sample Sheet Format

The sample sheet should be a CSV file with the following columns:

sample: Sample identifier
run: Run/replicate number
R1: Path to R1 FASTQ file
R2: Path to R2 FASTQ file
mark: Histone mark or protein (e.g., "H3K27ac", "CTCF")
condition: Experimental condition
igg: IgG control sample (if applicable)
replicate_sample_name: Base sample name for grouping replicates

Example:

sample,run,R1,R2,mark,condition,igg,gopeaks,replicate_sample_name
Sample1,1,/path/to/Sample1_R1.fastq.gz,/path/to/Sample1_R2.fastq.gz,H3K27ac,Condition1,Sample1_IgG,-,Sample1
Sample1,2,/path/to/Sample1_rep2_R1.fastq.gz,/path/to/Sample1_rep2_R2.fastq.gz,H3K27ac,Condition1,Sample1_IgG,-,Sample1

Usage

Basic Execution

Run the pipeline with Snakemake:

snakemake --configfile config/config.yml -c <number_of_cores>

Cluster Execution (SLURM)

For cluster environments, use a profile:

snakemake --configfile config/config.yml --profile slurm

Dry Run

Test the workflow without executing:

snakemake --configfile config/config.yml -n

Specific Targets

Run specific outputs:

snakemake --configfile config/config.yml multiqc_report.html

Pipeline Steps

1. Preprocessing (`rules/preprocess.smk`)

Adapter trimming: Optional trimming using fastp
Quality control: FastQC-style reports

2. Alignment (`rules/align.smk`)

Read alignment: Bowtie2 or BWA-MEM2
Sorting: Coordinate-sorted BAM files
Duplicate marking: Sambamba markdup
Indexing: BAM index generation

3. Post-alignment (`rules/postalign.smk`)

Fragment length analysis: Calculate fragment size distributions
BigWig track generation: Normalized coverage tracks

4. Peak Calling (`rules/peaks.smk`)

MACS2 peak calling:
- Broad peaks: For marks like H3K27me3, H3K9me3
- Narrow peaks: For other marks (CTCF, H3K4me3, etc.)
Marker-specific parameters:
- Q-values configured per mark
- Broad peak parameters (slocal, max-gap, broad-cutoff)
Blacklist filtering: Optional removal of blacklisted regions
Peak processing: Conversion to BED format

5. Quality Control (`rules/qc.smk`)

FRiP (Fraction of Reads in Peaks): DeepTools plotEnrichment
Library complexity: Preseq lc_extrap
Fingerprinting: DeepTools plotFingerprint
MultiQC report: Aggregated QC metrics

6. Reproducibility (`rules/reproducibility.smk`)

BAM correlation: DeepTools multiBamSummary and plotCorrelation
Peak similarity: bedtools jaccard for replicate comparison

7. Statistics

Peak counting: Per-sample peak counts
Genomic coverage: Percentage of genome covered by peaks

Output Structure

{output_base_dir}/
├── Important_processed/
│   ├── Bam/
│   │   ├── {sample}.sorted.markd.bam
│   │   └── {sample}.sorted.markd.bam.bai
│   ├── Track/
│   │   └── tracks/{sample}.bw
│   └── Peaks/
│       └── callpeaks/
│           ├── macs2_broad_{sample}_peaks.broadPeak
│           ├── macs2_narrow_{sample}_peaks.narrowPeak
│           └── {sample}_peaks.bed
├── Report/
│   ├── multiqc/
│   │   └── multiqc_report.html
│   ├── plotEnrichment/
│   │   ├── frip_{sample}.tsv
│   │   └── frip.html
│   ├── preseq/
│   │   └── lcextrap_{sample}.txt
│   ├── peak_stat/
│   │   ├── peakcount/
│   │   │   └── {sample}_peakcount.txt
│   │   └── coverage_report.tsv
│   ├── bamReproducibility/
│   │   ├── {sample_rep}_global_rep_cor.txt
│   │   └── {sample_rep}_global_rep_heatmap.pdf
│   └── bedtools_jaccard/
│       └── {sample_rep}_jaccard.txt
└── logs/
    └── {rule}_{sample}.log

Quality Control Metrics

The pipeline generates comprehensive QC metrics:

FRiP (Fraction of Reads in Peaks): Percentage of reads overlapping called peaks
Library Complexity: Expected distinct reads at various sequencing depths (Preseq)
Genomic Coverage: Percentage of genome covered by peaks
Peak Counts: Number of peaks per sample
Reproducibility:
- Spearman correlation between replicates (BAM level)
- Jaccard similarity between peak sets

All metrics are integrated into the MultiQC report for easy visualization.

Customization

Marker-Specific Peak Calling

Configure q-values and broad peak parameters in config/config.yml:

peak_qval:
  CTCF: 0.0001
  H3K4me3: 0.001
  H3K27ac: 0.001

broad_marks:
  - H3K27me3
  - H3K9me3

broad_params:
  slocal: 10000
  max-gap: 1000
  broad-cutoff: 0.01

Aligner Selection

Switch between aligners in the config:

pipeline_steps:
  alignment:
    aligner: "bowtie2"  # or "bwa-mem2"

Adapter Trimming

Enable/disable adapter trimming:

TRIM_ADAPTERS: true  # or false

Troubleshooting

Common Issues

Missing input files: Ensure sample sheet paths are correct and files exist
Memory errors: Adjust mem_mb in rule resources or use cluster execution
Wildcard errors: Check that sample names are consistent across config and sample sheet
Conda environment issues: Rebuild environments with --use-conda

Log Files

Check rule-specific logs in {output_base_dir}/logs/ for detailed error messages.

Citation

If you use this pipeline, please cite:

License

MIT License - see LICENSE file for details.

Support

For issues, questions, or contributions, please open an issue on the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
blacklists		blacklists
config		config
genes		genes
schemas		schemas
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CUT&Tag Snakemake Pipeline

Overview

Features

Requirements

Software Dependencies

Reference Data

Installation

Configuration

Main Configuration File (config/config.yml)

Sample Sheet Format

Usage

Basic Execution

Cluster Execution (SLURM)

Dry Run

Specific Targets

Pipeline Steps

1. Preprocessing (rules/preprocess.smk)

2. Alignment (rules/align.smk)

3. Post-alignment (rules/postalign.smk)

4. Peak Calling (rules/peaks.smk)

5. Quality Control (rules/qc.smk)

6. Reproducibility (rules/reproducibility.smk)

7. Statistics

Output Structure

Quality Control Metrics

Customization

Marker-Specific Peak Calling

Aligner Selection

Adapter Trimming

Troubleshooting

Common Issues

Log Files

Citation

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Main Configuration File (`config/config.yml`)

1. Preprocessing (`rules/preprocess.smk`)

2. Alignment (`rules/align.smk`)

3. Post-alignment (`rules/postalign.smk`)

4. Peak Calling (`rules/peaks.smk`)

5. Quality Control (`rules/qc.smk`)

6. Reproducibility (`rules/reproducibility.smk`)

Packages