- Quality Control on Raw Reads
- Adapter Trimming
- Quality Control on Trimmed Reads
- Alignment
- Post-Alignment Filtering
- Peak Calling
- Peak Count Matrix
- Differential Peak Analyses
- MOTIF Enrichment
- File Conversion Wig/bigWig
- Overall TSS Accessibility
For questions, comments, suggestions, anything, feel free to contact via git or through the following.
- Email: jackfreeland01@gmail.com
- LinkedIn: @JackFreeland
- Twitter: @JackFreelandLab
This repository provides an example pipeline for processing bulk ATAC-seq data (Assay for Transposase-Accessible Chromatin sequencing). Starting with raw fastq files, the pipeline calls individual peaks and can generate consensus peak count matrices, followed by differential peak accessibility, overall TSS accessibility, and MOTIF enrichment analyses. This walkthrough will cover each step in detail, including the rationale, example code, and explanations of parameters/arguments when appropriate. An excellent resource to first get familiarized with ATAC-seq analysis can be found in Yan et al. (2020).
A script is available to perform the pre-processing steps, file conversion, and peak calling in a single workflow (ATACSeq_preprocessing_JF.sh). Additionally, individual scripts for the remaining post-processing steps are provided for flexibility and modular execution. A conda environment is available to aid in getting the code running (Conda_Env_Main_ATACSeq_JF.yml).
Before initiating any formal processing steps, it is advisable to assess the overall quality of the raw sequencing files. This pipeline uses FastQC to generate FastQC reports that provide information on sequence quality, GC content, length distribution, duplicate sequences, overrepresented sequences, K-mer content and adapter contamination. For paired-end sequencing, QC should be performed separately on both the forward and reverse read files. In ATAC-seq FastQC reports, we expect to see Nextera transposase sequencing adapters over-represented and a decrease in overall sequence quality near the 3' end. If interested, alternative or similar software to FastQC includes FastP and BBDuk.
# -t number of threads
# -o output directory
fastqc -o <output_dir> <sample_ID>_R1.fastq.gz -t <#_of_threads>
fastqc -o <output_dir> <sample_ID>_R2.fastq.gz -t <#_of_threads>
Figure 1: Example FastQC Output of Adapter Content
Figure 2: Example FastQC Output of Sequence Quality
FastQC often identifies adapter sequences contaminating the reads, which must be trimmed prior to alignment. These adapter sequences are artificial sequences introduced during library preparation. If not removed, they can falsely align to the reference genome, leading to incorrect mapping results and increased noise. This pipeline uses CutAdapt. If interested, alternative or similar software to CutAdapt includes FastP, ApadpterRemoval and Trimmomatic.
Common adapter sequences are the following:
- Illumina: AGATCGGAAGAGC
- Small RNA: TGGAATTCTCGG
- Nextera: CTGTCTCTTATA
# -a adapter sequence on the forward read (R1)
# -A adapter sequence on the reverse read (R2)
# -j number of threads
# -q phred quality score threshold (q of 20 keeps bases with 99% accuracy)
# -O minimum overlap between the adapter sequence and the read required for the adapter to be trimmed
# -m minimum read length rquired to be kept after trimming
# -o output file for the trimmed forward reads (R1)
# -p output file for the trimmed reverse reads (R2)
python3 -m cutadapt
-a CTGTCTCTTATA \ # Nextera example
-A CTGTCTCTTATA \
-j <#_of_threadss> \
-q 20 \
-O 6 \
-m 35 \
-o <sample_ID>_R1.trim.fastq.gz \
-p <sample_ID>_R2.trim.fastq.gz \
<sample_ID>_R1.fastq.gz \
<sample_ID>_R2.fastq.gz
After running CutAdapt, it is good practice to confirm the adapter sequences are no longer present by running FastQC again.
# -t number of threads
# -o output directory
fastqc -o <output_dir> <sample_ID>_R1.trim.fastq.gz -t <#_of_threads>
fastqc -o <output_dir> <sample_ID>_R2.trim.fastq.gz -t <#_of_threads>
Figure 3: Example FastQC Output of Adapter Content Post CutAdapt
After confirming that adapter sequences have been trimmed, the next step is to align the reads to a reference genome. This pipeline uses Bowtie 2 to align reads to the human genome (GRCh38). Bowtie 2 references can be found here. In this example, GRCh38_noalt_decoy_as is used.
The GRCh38 reference genome includes alternate haplotypes, which can cause reads to map equally to multiple regions, potentially resulting in lower quality scores. To address this, it is advantageous to remove alternate haplotypes and retain only the primary assembly ('noalt'). Additionally, including decoy sequences in your analysis can enhance accuracy. If a read aligns to a decoy sequence better than anywhere in the primary reference, it prevents false positives from misalignments within the primary genome. Both steps also decrease processing overhead and runtime. If interested, alternative or similar software to Bowtie 2 is BWA.
# --non-deterministic default, resolves ties randomly
# --mm enables memory mapping
# --phred33 default, FASTQ files follow Phred+33 scale
# --very-sensitive slower but most sensitive alignment mode
# -p number of threads
# -x reference genome
# -1 input file for the trimmed forward reads (R1)
# -2 input file for the trimmed reverse reads (R2)
# -S output file in SAM format
bowtie2 --non-deterministic --mm --phred33 --very-sensitive \
-p <#_of_threads> \
-x <GRCh38_noalt_decoy_as/GRCh38_noalt_decoy_as> \
-1 <sample_ID>_R1.trim.fastq.gz \
-2 <sample_ID>_R2.trim.fastq.gz \
-S <sample_ID>.sam
After aligning the reads, the data can be prepared for downstream analyses by converting the file type and applying QC/filtering measures:
- Converting from SAM to Bam
- Sort Reads and Index
- Adjust for Transposase Binding Offset
- Filter Unaligned Reads
- Filter ChrM, DNA scaffold, and 'Decoy' Reads
- Filter PCR Duplicates
- Filter for Quality Reads
- Filter of ENCODE blacklists
Before applying filtering or QC, the SAM file (a text based format) generated by Bowtie 2 is converted to a BAM file (a binary format) using Samtools. This conversion significantly reduces storage requirements and processing time, as most tools are optimized to handle BAM files.
samtools view -@ <#_of_threads> \
-bS <sample_ID>.sam > <sample_ID>.bam
Downstream software requires the BAM file to be sorted and indexed. Sorting organizes the alignments by their genomic coordinates, while indexing creates a companion index file that enables efficient random access to specific regions of the genome within the BAM files. Both steps are performed using Samtools.
# -@ number of threads
# -o output bam file
samtools sort \
-@ <#_of_threads> \
-o <sample_ID>_V1.bam \
<sample_ID>.bam
samtools index <sample_ID>_V1.bam
Reads must be adjusted to account for the transposase binding offset to ensure accurate peak summit identification. Tn5 transposase binds to DNA and inserts sequencing adapters at staggered positions. It cuts the DNA with a 9-bp overhang between the strands, meaning the positions reported in sequencing data are not the exact sites of chromatin accessibility but are slightly shifted versions. Without adjustment, peak edges appear shifted, leading to imprecise peak summits. This offset can be corrected using deepTools alignmentSieve.
# --bam input BAM file
# -o output BAM file
alignmentSieve --bam <sample_ID>_V1.bam \
-o <sample_ID>_V2.bam --ATACshift \
--numberOfProcessors <#_of_threads>
Reads that failed to align to the reference genome should be removed as they do not provide any information about chromatin accessibility. Removing these reads also decreases file size and computational overhead. Samtools can be used to remove reads with the SAM flag 4, which indicates that the read is unmapped.
# -b output results in BAM format
# -F exclude reads with given flag
samtools view -@ <#_of_threads> -b -F 4 \
-o <sample_ID>_V3.bam <sample_ID>_V2.bam
Mitochondrial reads should be removed as they, like unaligned reads, do not provide any information about chromatin accessibility (chrM). Unplaced and random DNA scaffolds should also be removed, as they are unlikely to reflect the chromatin accessibility of known, structured genomic loci. Decoy reads should be removed as decoy sequences are added to a reference genome to mitigate ambiguous read alignment. Decoy sequences largely do not correspond to function genomic loci and often represent technical artifacts or poorly characterized regions. Samtools is used again.
# view extracts BAM file and outputs as SAM (human-readable) to allow us to filter
# -@ number of threads
# -h includes header (important for downstream)
# egrep -v excludes lines containing the provided patern
# samtools view -b -o conerts SAM back to BAM and writes file
samtools view -@ <#_of_threads> -h <sample_ID>_V3.bam | \
egrep -v "chrM|Un|random|decoy" | \
samtools view -b -o <sample_ID>_V4.bam
PCR duplicates occur during library preparation. As the limited amount of DNA you start with in ATAC-seq requires PCR amplification to generate sufficient material for sequencing, multiple copies of the same DNA fragment are produced. This leads to identical (duplicated)reads that should be removed. To achieve this, the BAM file is first sorted again with Samtools and then filtered using Picard MarkDuplicates.
samtools sort \
-@ <#_of_threads> \
-o <sample_ID>_V5.bam <sample_ID>_V4.bam
# -Xmx3G sets max memory allocation for the java virtual machine (JVM)
# -XX:ParallelGCThreads sets number of threads for java's garbage collection process
# -jar picard.jar run picard.jar which contains Picard tools
# -M writes duplication metrics to file
# -I input file
# -O output file
# -REMOVE_DUPLICATES if set to TRUE, duplicates are removed instead of just flagged
java -Xmx3G -XX:ParallelGCThreads=<#_of_threads> \
-jar picard.jar MarkDuplicates \
-M <sample_ID>_m.txt" \
-I <sample_ID>_V5.bam \
-O <sample_ID>_V6.bam \
-REMOVE_DUPLICATES true
Low-quality reads are then filtered out using Samtools to ensure downstream analyses are based on reliable and biologically meaningful data.
# -q MAPQ score threshold (higher = more strict)
samtools view -b -q 30 -@ <#_of_threads> \
<sample_ID>_V6.bam > <sample_ID>_V7.bam
Finally, the ENCODE blacklist contains a list of genomic regions that are critical to remove when analyzing functional genomic data. Reads that map to these regions are typically not due to true biological signal but rather technical artifacts (e.g., misalignment, PCR amplification biases). Keeping these reads can introduce noise and false positives. Bedtools can be used to filter out these regions. The final BAM file will also be indexed to aid in downstream analyses.
# subtract subcommand used to subtract regions from one file based on overlaps with another
# -a BAM file to be filtered
# -b ENCODE blacklist (BED file)
# -A only remove reads if they completely overlap
bedtools subtract -a <sample_ID>_V7.bam -b <hg38-blacklist.bed> \
-A > <sample_ID>_V8.bam
After filtering the BAM files, peaks can be called. Peaks represent genomic regions with high read enrichment, reflecting areas of open chromatin or accessible DNA. These regions are typically associated with regulatory elements, such as promoters, enhancers, transcription factor binding sites, or other DNA elements where the chromatin is less compact, allowing transcriptional machinery and regulatory proteins to bind. This pipeline uses MACS3 callpeak to identify peaks.
# -f Specify format of input file (BAMPE = paired-end BAM files)
# -g Specify gemoe size (hs = human genome, mm = mouse)
# -q False discovery rate (FDR) threshold for peak calling
# -t Path to input BAM file
# -n Specifies a name prefix for output files
macs3 callpeak \
-f BAMPE \
-g hs \
-q 0.01 \
-t <sample_ID>_V8.bam \
-n <sample_ID> \
--outdir <output_directory>
After calling peaks, a peak count matrix can be generated for downstream analyses. As a peak count matrix is a tabular representation of read counts for identified genomic regions (peaks) across multiple samples, it is essential to first generate a set of consensus peaks across all samples. This is because any specific region or peak representing the same biological element across samples may vary by a few bases due to differences in MACS3 peak calling or biology. These variations can significantly complicate downstream quantification, analysis, and interpretation of results.
#!/bin/bash
### Find all narrowPeak files from MACS3 and concatenate them into one file
# $peak_dir Directory containing all MACS3 output files
find "$peak_dir" -name "*.narrowPeak" -exec cat {} + > "$peak_dir/all_concatenate.narrowPeak.bed"
### Sort the combined file
sort -k1,1 -k2,2n "$peak_dir/all_concatenate.narrowPeak.bed" > "$peak_dir/all_concatenate_sorted.narrowPeak.bed"
### Merge the peaks
bedtools merge -i "$peak_dir/all_concatenate_sorted.narrowPeak.bed" > "$peak_dir/all_concatenate_sorted_merged.narrowPeak.bed"
#!/bin/bash
### Create an array of all BAM files
# $bam_dir Directory containing all processed BAM files
bam_files=($bam_dir/*V8.bam)
### Create header with BAM filenames for the multicov output, ensuring tab separation
header="chr\tstart\tend"
for bam_file in "${bam_files[@]}"; do
sample_name=$(basename "$bam_file" | sed 's/_V8.bam//g')
header="$header\t$sample_name"
done
### Output the header to the file
echo -e "$header" > "$peak_dir/multicov_merged_counts.txt"
### Run bedtools multicov to count reads in merged peaks across all BAM files
bedtools multicov -bams "${bam_files[@]}" -bed "$peak_dir/all_concatenate_sorted_merged.narrowPeak.bed" >> "$peak_dir/multicov_merged_counts.txt"
To perform differential peak analyses, workflows very similar to those used for differential gene expression analyses can be applied, including packages such as DESeq2, edgeR, and limma.
Of note, the 'Peak Count Matrix' protocol will generate a count file with the chromosome number, starting base pair position, and ending base pair position in three separate columns. These columns will need to be combined into a single 'name' column before running any of the above packages. For example, by running DESeq2, you will generate a table as follows:
baseMean log2FC lfcSE stat pvalue padj
chr1_808161_808284 30.10598 -0.380062 0.417783 -0.909712 0.362974 0.999999
chr1_817195_817520 84.26785 0.3425648 0.254628 1.3453500 0.178512 0.999999
... ... ... ... ... ... ...
After performing differential peak analysis, motif enrichment analysis allows you to identify transcription factor binding sites or regulatory elements that are enriched within the differentially accessible regions. This provides insights into the potential regulatory mechanisms driving changes in chromatin accessibility and gene expression, helping to connect observed epigenetic changes with underlying biological processes or pathways. By uncovering enriched motifs, you can prioritize key transcription factors or regulatory networks for further functional validation. This pipeline uses HOMER to perform the enrichment.
findMotifsGenome.pl <significant_peaks_input_file> hg38 <output_dir> -size given -p <#_of_thread
s>
Figure 4: Example HOMER findMotifsGenome.pl Output
Many packages which visualize genomic data (such as ATAC) requires BAM files to be converted to either WIG or BigWig files. Here, deepTools bamCoverage is used to convert from BAM to BigWig and UCSC Genome Broswer bigWigtoWig to convert from BigWig to Wig.
# -b Input BAM file
# --normalizeUsing Normalization methods to account for sequencing depth, genome size, or scaling factors.
# --effectiveGenomeSize Specify effective genome size (portion of the genome mappable by reads)
# -p Number of threads for parallel processing
# -o Output file path
# --binSize Size of the bins (in base pairs) used to compute coverage
bamCoverage -b <sample_ID>_V8.bam \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \ (for homo sapiens)
-p <#_of_threads> \
-o <sample_ID>.bw \
--binSize 10
bigWigToWig <sample_ID>.bw <GRCH38_noalt_decoy_as.chrom.sizes> <sample_ID>.wig
# In this example, the reference file <GRCH38_noalt_decoy_as.chrom.sizes> is structured as follows:
# chr1 248956422
# chr2 242193529
# chr3 198295559
# ... ...
Aside from analyzing individual peaks, groups of peaks, or motifs, it can also be highly informative to examine the overall accessibility of transcription start sites (TSSs). This approach provides a global view of the openness or closeness of the chromatin across the transcriptome, offering insights into overall regulatory dynamics.
To achieve this, computeMatrix from the DeepTools suite is used to generate a matrix of scores around specific genomic regions, such as TSSs or other features of interest. The matrix contains binned signal values (e.g., read coverage or signal intensity) for defined regions, allowing for detailed analysis of patterns around these sites.
Once the matrix is generated, it can be visualized using plotHeatmap. This tool creates an intuitive heatmap that highlights the accessibility patterns of the selected regions, often complemented by an aggregate profile plot. The heatmap helps identify trends in chromatin accessibility, such as regions with consistently high or low accessibility, and can reveal differences between conditions or sample groups.
To group biological replicates or conditions prior to plotting, WiggleTools can be used to calculate the average of multiple WIG files before converting them to BigWig.
### Average WIG files (if needed)
wiggletools mean <sample_ID_1>.wig <sample_ID_2>.wig <sample_ID_3>.wig > <sample_ID_123>.wig
### Prepping matrix
# --referencePoint Specifies the mode of matrix computation. This calculates scores relative to a fixed point.
# --referencePoint Defines the reference point within the regions specified in the BED file. Here, TSS is used as the reference point.
# --regionsFileName Input file containing genomic regions of interest in BED format.
# --scoreFileName Input BigWig file containing the scores (e.g., read coverage, signal intensity)
# --beforeRegionStartLength The distance upstream (5' direction) of the reference point (TSS) to include in the analysis.
# --afterRegionStartLength The distance downstream (3' direction) of the reference point (TSS) to include in the analysis.
# --binSize Defines the size (in base pairs) of bins used for averaging scores. Smaller bin results in finer resolution.
# --outFileName Specifies output file name for the resulting matrix to be used in downstream visualization tools.
# --numberOfProcessors Specifies number of threads to use
# --missingDataAsZero Treats missing data (e.g., regions with no signal in the BigWig file) as zero instead of excluding them from the matrix
# --skipZeros Excludes bins with a score of zero from the output matrix
## For one BigWig file
computeMatrix reference-point \
--referencePoint TSS \
--regionsFileName TSS.bed \
--scoreFileName <sample_ID>.bw \
--beforeRegionStartLength 3000 \
--afterRegionStartLength 3000 \
--binSize 10 \
--outFileName <sample_ID>_ComputeMatrix_RefPnt.gz \
--numberOfProcessors max \
--missingDataAsZero \
--skipZeros
## For all BigWig files in a given directory
computeMatrix reference-point \
--referencePoint TSS \
--regionsFileName TSS.bed \
--scoreFileName "$bigwig_dir"/* \
--beforeRegionStartLength 3000 \
--afterRegionStartLength 3000 \
--binSize 10 \
--outFileName <sample_ID>_ComputeMatrix_RefPnt.gz \
--numberOfProcessors max \
--missingDataAsZero \
--skipZeros
### Plot matrix
plotHeatmap -m <sample_ID>_ComputeMatrix_RefPnt.gz \
-out <sample_ID>_TSS_heatmap.png
Figure 5: Example plotHeatmap Output




