Skip to content

Commit 919cfde

Browse files
committed
make duplicate handling configurable
#47
1 parent 26e0e72 commit 919cfde

File tree

6 files changed

+65
-36
lines changed

6 files changed

+65
-36
lines changed

CITATION.cff

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ authors:
2424
family-names: Folkman
2525
orcid: 'https://orcid.org/0000-0002-5811-8875'
2626
affiliation: CeMM Research Center for Molecular Medicine
27+
- given-names: Fangwen
28+
family-names: Zhao
29+
affiliation: CeMM Research Center for Molecular Medicine
2730
- given-names: Lina
2831
family-names: Dobnikar
2932
affiliation: CeMM Research Center for Molecular Medicine

README.md

Lines changed: 31 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ A [Snakemake 8](https://snakemake.readthedocs.io/en/stable/) workflow implementa
2424
- [Bekir Ergüner](https://github.com/berguner)
2525
- [Daniele Barreca](https://github.com/DanieleBarreca)
2626
- [Lukas Folkman](https://github.com/lukas-folkman)
27+
- [Fangwen Zhao](https://github.com/fwzhao)
2728
- [Lina Dobnikar](https://github.com/ld401)
2829
- [Christoph Bock](https://github.com/chrbock)
2930

@@ -51,7 +52,7 @@ This project wouldn't be possible without the following software and their depen
5152
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (`workflow/envs/*.yaml file`) or post-execution in the result directory (`atacseq_pipeline/envs/*.yaml`). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].
5253

5354
**Processing.**
54-
Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup auto” and “--extsize 147” options on each sample. HOMER (ver) [ref] function findMotifs was used for motif enrichment analysis of the detected open chromatin regions. Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.
55+
Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], samtools view flags [SAM_flag], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup [macs2_keep_dup]” and “--extsize 147” options on each sample. HOMER (ver) [ref] function findMotifs was used for motif enrichment analysis of the detected open chromatin regions. Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.
5556

5657
**Quantification.**
5758
A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by [slop_extension]bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref]. The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref]. Furthermore, the consensus region set was used to quantify the peak support per sample and each region was mapped to its closest TSS according to the HOMER annotation within proximal TSS up and downstream distances [proximal_size_up/down] yielding a gene/TSS-based quantification. Complementary, all promoter regions, defined by the same parameters, were quantified for each sample and aggregated to yield a gene/promoter-based quantification. Finally, all sample-wise enriched known motifs according to HOMER were aggregated.
@@ -62,30 +63,43 @@ Consensus regions were annotated using annotatePeaks function from HOMER (ver) [
6263
The processing and quantification described here was performed using a publicly available Snakemake [ver] (ref) workflow [[10.5281/zenodo.6323634](https://doi.org/10.5281/zenodo.6323634)].
6364

6465
# 🚀 Features
65-
- Processing (results/)
66+
- Processing (`results/`)
6667
- Alignment of both single-end and paired-end reads in raw/unaligned BAM format with Bowtie2.
67-
- Peak calling with MAC2.
68-
- Note: even though the peak support of a region in a certain sample is 0, does not mean that there are no reads counted in the count matrix, it just states that there was no peak called.
69-
- Note: the peak support can be >1 for certain samples in case of a consensus region spanning more than one peak within the respective sample.
68+
- Filtering using `samtools view` can be configured using [SAM Flags](https://broadinstitute.github.io/picard/explain-flags.html) (`SAM_flag`).
69+
- Peak calling with `MACS2`.
70+
- Duplicate handling can be configured using `macs2_keep_dup`.
71+
- Even though the peak support of a region in a certain sample is 0, does not mean that there are no reads counted in the count matrix, it just states that there was no peak called.
72+
- The peak support can be >1 for certain samples in case of a consensus region spanning more than one peak within the respective sample.
7073
- Peak annotation and motif analysis HOMER.
71-
- Quantification of TSS coverage (results/).
72-
- Reporting (report/)
74+
- Quantification of TSS coverage.
75+
- Reporting (`report/`)
7376
- MultiQC report generation using MultiQC, extended with an in-house developed plugin [atacseq_report](./workflow/scripts/multiqc_atacseq).
74-
- Quantification (counts/)
75-
- Consensus region set generation across all called peaks (consensus_regions.bed).
76-
- Read count quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (consensus_regions.bed, consensus_counts.csv).
77-
- Peak support quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (support_counts.csv).
78-
- Consensus regions mapped to closest gene TSS according to HOMER (Distance to TSS) within proximal TSS up and downstream distances (TSS_regions.bed, TSS_counts.csv, TSS_annotation.csv).
79-
- Read count quantification of promoter regions based on provided proximal TSS up and downstream distances (promoter_regions.bed, promoter_counts.csv, promoter_annotation.csv).
80-
- [Pseudoautosomal regions in human](https://www.ensembl.org/info/genome/genebuild/human_PARS.html) chromosome Y are skipped.
81-
- Aggregation of all sample-wise HOMER known motif enrichment results into one CSV in long-format (HOMER_knownMotifs.csv).
82-
- Annotation (counts/)
77+
- Quantification (`counts/`)
78+
- Consensus region set generation across all called peaks (`consensus_regions.bed`).
79+
- Read count quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (`consensus_counts.csv`).
80+
- Peak support quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (`support_counts.csv`).
81+
- Consensus regions mapped to closest gene TSS according to HOMER (Distance to TSS) within proximal TSS up and downstream distances (`TSS_regions.bed`, `TSS_counts.csv`, `TSS_annotation.csv`).
82+
- Read count quantification of promoter regions based on provided proximal TSS up and downstream distances (`promoter_regions.bed`, `promoter_counts.csv`, `promoter_annotation.csv`).
83+
- [Pseudoautosomal regions in human](https://www.ensembl.org/info/genome/genebuild/human_PARS.html) chromosome `Y` are skipped.
84+
- Aggregation of all sample-wise HOMER known motif enrichment results into one CSV in long-format (`HOMER_knownMotifs.csv`).
85+
- Annotation (`counts/`)
8386
- Sample annotation file based on MultiQC general stats and provided annotations for downstream analysis (`sample_annotation.csv`).
8487
- Consensus region set annotation using (`consensus_annotation.csv`)
8588
- UROPA with regulatory build and gencode as references, configurable here: `workflow/resources/UROPA/*.txt`.
8689
- HOMER with `annotatePeaks.pl`
8790
- bedtools for nucleotide counts/content (e.g., % of GC)
8891

92+
> [!IMPORTANT]
93+
> **Duplciate reads** can be filtered during the alignment step by `samtools` and/or ignored during peak calling by `MACS2`.
94+
>
95+
> The inclusion of duplicates should be intentional, and may lead to a large number of consensus regions.
96+
>
97+
> The removal of duplicates should be intentional, might remove real biological signal.
98+
>
99+
> The decision depends on your downstream analysis steps e.g., rigorous filtering (e.g., using `edgeR::filterByExpr`) and/or accounting for PCR bias by normalization conditional on genomic region length and GC content (e.g., [CQN](https://academic.oup.com/biostatistics/article/13/2/204/1746212)) and goals (e.g., differential accessibility analysis).
100+
>
101+
> We recommend reading this ChIP-seq tutorial's section on ["Removing redundancy"](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html).
102+
89103
# 🛠️ Usage
90104
These steps are the recommended usage for this workflow:
91105

@@ -170,6 +184,7 @@ Finally, a previous PhD student in our lab, [André Rendeiro](https://orcid.org/
170184
- [Unsupervised Analysis](https://github.com/epigen/unsupervised_analysis) to understand and visualize similarities and variations between cells/samples, including dimensionality reduction and cluster analysis. Useful for all tabular data including single-cell and bulk sequencing data.
171185
- [Differential Analysis with limma](https://github.com/epigen/dea_limma) to identify and visualize statistically significantly different features (e.g., genes or genomic regions) between sample groups.
172186
- [Enrichment Analysis](https://github.com/epigen/enrichment_analysis) for biomedical interpretation of (differential) analysis results using prior knowledge.
187+
- [Introduction to ChIP-seq using high performance computing](https://hbctraining.github.io/Intro-to-ChIPseq/)
173188

174189
# 📑 Publications
175190
The following publications successfully used this module for their analyses.

config/config.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,19 +23,19 @@ sequencing_center: CeMM_BSF
2323
annot_columns: ['pass_qc','read_type','organism'] # (optional) can be empty [""]. must be columns in the annotation sheet
2424

2525
##### QUANTIFICATION #####
26+
27+
# samtools view filtering SAM flags
28+
# https://broadinstitute.github.io/picard/explain-flags.html
29+
SAM_flag: 2316 # int; keep duplictaes 2316 or filter duplicates 3340
30+
2631
# coverage calculation with bedtools for QC report
2732
tss_slop: 2000
2833
noise_lower: 100
2934

30-
# specify if duplicates should be kept during filtering bam files to define samtools view filtering flags
31-
# filtered reads used as input for macs2 peak calling and counts
32-
# warning: inclusion of duplicates should be intentional, and may lead to a large number of consensus regions
33-
remove_dup: True # bool: True by default
34-
35-
# specify how duplicate reads should be handled by macs2
36-
# warning: inclusion of duplicates should be intentional, and may lead to a large number of consensus regions
37-
# see documentation for parameter: https://manpages.ubuntu.com/manpages/mantic/man1/macs2_callpeak.1.html
38-
keep_dup: 1 # int: default = 1, int for kept reads at given genomic coordinates. or "auto"
35+
# MACS2 --keep-dup paramater for duplicate read handling during peak calling
36+
# https://manpages.ubuntu.com/manpages/mantic/man1/macs2_callpeak.1.html
37+
# https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html
38+
macs2_keep_dup: "auto" # int: default = 1, int for kept reads at given genomic coordinates; or "auto"
3939

4040
# determination of consensus regions using (py)bedtools
4141
slop_extension: 250

test/hg38test/config/hg38test_atacseq_pipeline_config.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
##### RESOURCES #####
33
mem: '32000'
44
threads: 2
5-
partition: 'shortq'
65

76
##### GENERAL #####
87
project_name: hg38test #MyATACproject # name of the project/dataset
@@ -24,10 +23,20 @@ sequencing_center: CeMM_BSF
2423
annot_columns: ['pass_qc','read_type','organism'] # (optional) can be empty [""]. must be columns in the annotation sheet
2524

2625
##### QUANTIFICATION #####
26+
27+
# samtools view filtering SAM flags
28+
# https://broadinstitute.github.io/picard/explain-flags.html
29+
SAM_flag: 2316 # int; keep duplictaes 2316 or filter duplicates 3340
30+
2731
# coverage calculation with bedtools for QC report
2832
tss_slop: 2000
2933
noise_lower: 100
3034

35+
# MACS2 --keep-dup paramater for duplicate read handling during peak calling
36+
# https://manpages.ubuntu.com/manpages/mantic/man1/macs2_callpeak.1.html
37+
# https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html
38+
macs2_keep_dup: "auto" # int: default = 1, int for kept reads at given genomic coordinates; or "auto"
39+
3140
# determination of consensus regions using (py)bedtools
3241
slop_extension: 250
3342

test/mm10test/config/mm10test_atacseq_pipeline_config.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
##### RESOURCES #####
33
mem: '32000'
44
threads: 2
5-
partition: 'shortq'
65

76
##### GENERAL #####
87
project_name: mm10test #MyATACproject # name of the project/dataset
@@ -24,10 +23,20 @@ sequencing_center: CeMM_BSF
2423
annot_columns: ['pass_qc','read_type','organism'] # (optional) can be empty [""]. must be columns in the annotation sheet
2524

2625
##### QUANTIFICATION #####
26+
27+
# samtools view filtering SAM flags
28+
# https://broadinstitute.github.io/picard/explain-flags.html
29+
SAM_flag: 2316 # int; keep duplictaes 2316 or filter duplicates 3340
30+
2731
# coverage calculation with bedtools for QC report
2832
tss_slop: 2000
2933
noise_lower: 100
3034

35+
# MACS2 --keep-dup paramater for duplicate read handling during peak calling
36+
# https://manpages.ubuntu.com/manpages/mantic/man1/macs2_callpeak.1.html
37+
# https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html
38+
macs2_keep_dup: "auto" # int: default = 1, int for kept reads at given genomic coordinates; or "auto"
39+
3140
# determination of consensus regions using (py)bedtools
3241
slop_extension: 250
3342

workflow/rules/processing.smk

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,12 @@ rule align:
1717
samtools_flagstat_log = os.path.join(result_path, 'results', "{sample}", 'mapped', '{sample}.samtools_flagstat.log'),
1818
stats = os.path.join(result_path, 'results', "{sample}", '{sample}.align.stats.tsv'),
1919
params:
20-
# alignment parameters
2120
interleaved_in = lambda w: "--interleaved_in" if samples["{}".format(w.sample)]["read_type"] == "paired" else " ",
2221
interleaved = lambda w: "--interleaved" if samples["{}".format(w.sample)]["read_type"] == "paired" else " ",
23-
filtering = lambda w: "-q 30 -F {flag} -f 2 -L {whitelist}".format(flag=3340 if config['remove_dup'] else 2316, whitelist=config["whitelisted_regions"]) if samples["{}".format(w.sample)]["read_type"] == "paired" else "-q 30 -F {flag} -L {whitelist}".format(flag=3340 if config['remove_dup'] else 2316, whitelist=config["whitelisted_regions"]),
22+
filtering = lambda w: "-q 30 -F {flag} -f 2 -L {whitelist}".format(flag=config['SAM_flag'], whitelist=config["whitelisted_regions"]) if samples["{}".format(w.sample)]["read_type"] == "paired" else "-q 30 -F {flag} -L {whitelist}".format(flag=config['SAM_flag'], whitelist=config["whitelisted_regions"]),
2423
add_mate_tags = lambda w: "--addMateTags" if samples["{}".format(w.sample)]["read_type"] == "paired" else " ",
2524
adapter_sequence = "-a " + config["adapter_sequence"] if config["adapter_sequence"] !="" else " ",
2625
adapter_fasta = "--adapter_fasta " + config["adapter_fasta"] if config["adapter_fasta"] !="" else " ",
27-
# configs
2826
sequencing_platform = config["sequencing_platform"],
2927
sequencing_center = config["sequencing_center"],
3028
mitochondria_name = config["mitochondria_name"],
@@ -64,10 +62,8 @@ rule tss_coverage:
6462
output:
6563
tss_hist = os.path.join(result_path,"results","{sample}","{sample}.tss_histogram.csv"),
6664
params:
67-
# parameters for coverage
6865
noise_upper = ( config["tss_slop"] * 2 ) - config["noise_lower"],
6966
double_slop = ( config["tss_slop"] * 2 ),
70-
# configs
7167
genome_size = config["genome_size"],
7268
tss_slop = config["tss_slop"],
7369
unique_tss = config["unique_tss"],
@@ -106,17 +102,14 @@ rule peak_calling:
106102
macs2_log = os.path.join(result_path, 'results', "{sample}", 'peaks', '{sample}.macs2.log'),
107103
stats = os.path.join(result_path, 'results', "{sample}", '{sample}.peak.stats.tsv'),
108104
params:
109-
# paths
110105
peaks_dir = os.path.join(result_path,"results","{sample}","peaks"),
111106
homer_dir = os.path.join(result_path,"results","{sample}","homer"),
112107
homer_bin = os.path.join(HOMER_path,"bin"),
113-
# peak calling parameters
114108
formating = lambda w: '--format BAMPE' if samples["{}".format(w.sample)]["read_type"] == "paired" else '--format BAM',
115-
# configs
116109
genome_size = config["genome_size"],
117110
genome = config["genome"],
118111
regulatory_regions = config["regulatory_regions"],
119-
keep_dup = config['keep_dup'],
112+
keep_dup = config['macs2_keep_dup'],
120113
resources:
121114
mem_mb=config.get("mem", "16000"),
122115
threads: 4*config.get("threads", 2)

0 commit comments

Comments
 (0)