Skip to content

Commit d6101d6

Browse files
committed
add sample annotation visualization
1 parent 4b79a65 commit d6101d6

File tree

7 files changed

+255
-5
lines changed

7 files changed

+255
-5
lines changed

README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ The processing and quantification described here was performed using a publicly
7575
- Quantification of TSS coverage.
7676
- Reporting (`report/`)
7777
- MultiQC report generation using MultiQC, extended with an in-house developed plugin [atacseq_report](./workflow/scripts/multiqc_atacseq).
78+
- Sample annotation is visualized as a hierarchically-clustered QC heatmap with matching metadata annotation, exported both as a `PNG` and an interactive `HTML` with metadata as tooltips (`sample_annotation.{png|html}`).
7879
- Quantification (`counts/`)
7980
- Consensus region set generation across all called peaks (`consensus_regions.bed`).
8081
- Read count quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (`consensus_counts.csv`).
@@ -84,13 +85,13 @@ The processing and quantification described here was performed using a publicly
8485
- [Pseudoautosomal regions in human](https://www.ensembl.org/info/genome/genebuild/human_PARS.html) chromosome `Y` are skipped.
8586
- Aggregation of all sample-wise HOMER known motif enrichment results into one CSV in long-format (`HOMER_knownMotifs.csv`).
8687
- Annotation (`counts/`)
87-
- Sample annotation file based on MultiQC general stats and provided annotations for downstream analysis (`sample_annotation.csv`).
88+
- Sample annotation file based on `MultiQC` general stats and provided annotations for downstream analysis (`sample_annotation.csv`).
8889
- Consensus region set annotation using (`consensus_annotation.csv`)
8990
- `UROPA` with regulatory build and gencode as references, configurable here: `workflow/resources/UROPA/*.txt`.
9091
- `HOMER` with `annotatePeaks.pl`. NB: We have empirically found, that some human sex genes, e.g., the well established protein coding genes UTY and STS, are not annotated.
9192
- `bedtools` for nucleotide counts/content (e.g., % of GC).
9293

93-
> [!IMPORTANT]
94+
> [!IMPORTANT]
9495
> **Duplciate reads** can be filtered during the alignment step by `samtools` and/or ignored during peak calling by `MACS2`.
9596
> **The inclusion of duplicates** should be intentional, and may lead to a large number of consensus regions.
9697
> **The removal of duplicates** should be intentional, might remove real biological signal.
@@ -106,16 +107,19 @@ These steps are the recommended usage for this workflow:
106107
3. Fill out the mandatory quality control column (pass_qc) in the annotation file accordingly (everything >0 will be included in the downstream steps).
107108
4. Finally, execute the remaining downstream quantification and annotation steps by running the workflow. Thereby only the samples that passed quality control will be included in the consensus region set generation (i.e., the feature space) and all downstream steps.
108109

110+
> [!NOTE]
111+
> Although inputs and parameters may be identical, **MACS2 peak calling can yield slightly varying results** (± a few peaks) due to stochastic elements in its algorithm (e.g., duplicate handling).This minor variability in peak calls sohuld have no impact on downstream analyses or the overall robustness of results.
112+
109113
This workflow is written with Snakemake and its usage is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/atacseq_pipeline).
110114

111115
# ⚙️ Configuration
112116
Detailed specifications can be found here [./config/README.md](./config/README.md)
113117

114118
# 📖 Examples
115-
Explore a detailed example showcasing module usage and downstream analysis in our comprehensive end-to-end [MrBiomics Recipe](https://github.com/epigen/MrBiomics?tab=readme-ov-file#-recipes) for [ATACseq Analysis](https://github.com/epigen/MrBiomics/wiki/ATAC%E2%80%90seq-Analysis-Recipe), including data, configuration, annotation and results.
119+
Explore a detailed example showcasing module usage and downstream analysis in our comprehensive end-to-end [MrBiomics Recipe](https://github.com/epigen/MrBiomics?tab=readme-ov-file#-recipes) for [ATAC-seq Analysis](https://github.com/epigen/MrBiomics/wiki/ATAC%E2%80%90seq-Analysis-Recipe), including data, configuration, annotation and results.
116120

117121
# 🔍 Quality Control
118-
Below are some guidelines for the manual quality control of each sample, but keep in mind that every experiment/dataset is different.
122+
Below are some guidelines for the manual quality control of each sample using the generated `MultiQC` report and visualized (interactive) sample annotation, but keep in mind that every experiment/dataset is different. Thresholds are general suggestions and may vary based on experiment type, organism, and library prep.
119123

120124
1. Reads Mapped ~ $30\cdot 10^{6}$ ($>20\cdot 10^{6}$ at least)
121125
2. % Aligned >90%

workflow/Snakefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ min_version("8.20.1")
1919
report: os.path.join("report", "workflow.rst")
2020

2121
# list of names of the used environment specifications in workflow/envs/{env_name}.yaml
22-
envs = ["bowtie2","macs2_homer","multiqc","pybedtools","uropa", "datamash"]
22+
envs = ["bowtie2","macs2_homer","multiqc","pybedtools","uropa","datamash","ggplot"]
2323

2424
##### load config and sample annotation sheets #####
2525
configfile: os.path.join("config","config.yaml")
@@ -45,6 +45,7 @@ rule all:
4545
multiqc_report = os.path.join(result_path,"report","multiqc_report.html"),
4646
# QUANTIFICATION
4747
sample_annotation = os.path.join(result_path, "counts", "sample_annotation.csv") if len(samples_quantify)>0 else [],
48+
sample_annotation_plot = os.path.join(result_path, "report", "sample_annotation.png") if len(samples_quantify)>0 else [],
4849
support_counts = os.path.join(result_path,"counts","support_counts.csv") if len(samples_quantify)>0 else [],
4950
consensus_counts = os.path.join(result_path,"counts","consensus_counts.csv") if len(samples_quantify)>0 else [],
5051
promoter_counts = os.path.join(result_path,"counts","promoter_counts.csv") if len(samples_quantify)>0 else [],

workflow/envs/ggplot.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
channels:
2+
- conda-forge
3+
- bioconda
4+
- nodefaults
5+
dependencies:
6+
- r-tidyverse=2.0.0
7+
- r-data.table=1.17.6
8+
- r-patchwork=1.3.0
9+
- r-ggplot2=3.5.2
10+
- r-ggnewscale=0.5.1
11+
- r-plotly=4.11.0
12+
- r-htmlwidgets=1.6.4
13+
- r-stringr=1.5.1
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Interactive visualization of hierarchically clustered sample QC metrics and annotation.

workflow/rules/quantification.smk

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@ rule get_consensus_regions:
4646
chromosome_sizes = config["chromosome_sizes"],
4747
output:
4848
consensus_regions = os.path.join(result_path,"counts","consensus_regions.bed"),
49+
params:
50+
slop_extension = config["slop_extension"],
4951
resources:
5052
mem_mb=config.get("mem", "16000"),
5153
threads: config.get("threads", 2)

workflow/rules/report.smk

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,27 @@ rule multiqc:
6565
"""
6666
multiqc {params.result_path}/report --force --verbose --outdir {params.result_path}/report --filename multiqc_report.html --cl-config "{params.multiqc_configs}"
6767
"""
68+
69+
# visualize sample annotation (including QC metrics)
70+
rule plot_sample_annotation:
71+
input:
72+
sample_annotation = config["annotation"],
73+
sample_annotation_w_QC = os.path.join(result_path, "counts", "sample_annotation.csv"),
74+
output:
75+
sample_annotation_plot = os.path.join(result_path,"report","sample_annotation.png"),
76+
sample_annotation_html = report(os.path.join(result_path,"report","sample_annotation.html"),
77+
caption="../report/sample_annotation.rst",
78+
category="{}_{}".format(config["project_name"], module_name),
79+
subcategory="QC",
80+
labels={
81+
"name": "Sample annotation",
82+
"type": "HTML",
83+
}),
84+
log:
85+
"logs/rules/plot_sample_annotation.log",
86+
resources:
87+
mem_mb="4000",
88+
conda:
89+
"../envs/ggplot.yaml"
90+
script:
91+
"../scripts/plot_sample_annotation.R"
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
#### libraries ####
2+
library(data.table)
3+
library(tidyverse)
4+
library(patchwork)
5+
library(ggplot2)
6+
library(ggnewscale)
7+
library(stringr)
8+
# for interactive plotting
9+
library(plotly)
10+
library(htmlwidgets)
11+
# set plot base size
12+
theme_set(theme_minimal(base_size = 6))
13+
14+
#### configs ####
15+
# input
16+
sample_annotation_path <- snakemake@input[["sample_annotation"]]
17+
sample_annotation_w_QC_path <- snakemake@input[["sample_annotation_w_QC"]]
18+
19+
# output
20+
sample_annotation_plot_path <- snakemake@output[["sample_annotation_plot"]]
21+
sample_annotation_html_path <- snakemake@output[["sample_annotation_html"]]
22+
23+
#### load & prepare data ####
24+
# load data
25+
sample_annotation <- data.table::fread(file.path(sample_annotation_path), header = TRUE)
26+
sample_annotation <- data.frame(sample_annotation[!duplicated(sample_annotation[[1]]), ], row.names = 1, check.names = FALSE)
27+
28+
anno <- data.frame(fread(file.path(sample_annotation_w_QC_path), header=TRUE), row.names=1)
29+
30+
# determine QC (pipeline provided) columns
31+
names(sample_annotation) <- gsub(" +", "_", names(sample_annotation)) # replace empty space ` ` with underscore `_`
32+
qc_cols <- setdiff(names(anno), names(sample_annotation))
33+
34+
# determine metadata (user provided) columns by removing non-numeric columns that are unique for each row (e.g., bam_file)
35+
sample_annotation <- sample_annotation %>% select(where(\(.x){ n <- dplyr::n_distinct(na.omit(.x)); (is.numeric(.x) || n < nrow(sample_annotation)) }))
36+
meta_cols <- names(sample_annotation)
37+
38+
# drop columns with no variation (e.g., read_type)
39+
has_var <- function(x) length(unique(na.omit(x))) > 1
40+
qc_cols <- keep(qc_cols, ~ has_var(anno[[.x]]))
41+
meta_cols <- keep(meta_cols, ~ has_var(anno[[.x]]))
42+
43+
# collapse "duplicates" (ie redundant columns) only among non-numeric
44+
num_cols <- meta_cols[vapply(anno[meta_cols], is.numeric, logical(1))]
45+
cat_cols <- setdiff(meta_cols, num_cols)
46+
sig <- vapply(cat_cols,
47+
\(col) paste(match(anno[[col]], unique(anno[[col]])), collapse = "|"),
48+
character(1))
49+
meta_cols <- c(cat_cols[!duplicated(sig)], num_cols)
50+
51+
#### Z-score & cluster QC data ####
52+
qc_mat <- anno |> select(all_of(qc_cols)) |> scale() |> as.matrix()
53+
54+
row_ord <- hclust(dist(qc_mat))$order
55+
col_ord <- hclust(dist(t(qc_mat)))$order
56+
57+
qc_long <- as_tibble(qc_mat[row_ord, col_ord], rownames = "sample") |>
58+
pivot_longer(-sample, names_to = "metric", values_to = "z")
59+
60+
#### prepare metadata for plotting ####
61+
meta_long <- anno[row_ord, meta_cols] %>%
62+
mutate(sample = rownames(anno)[row_ord]) %>%
63+
mutate(across(-sample, as.character)) %>%
64+
pivot_longer(-sample, names_to = "meta", values_to = "value") %>%
65+
group_by(meta) %>%
66+
mutate(num_val = suppressWarnings(as.numeric(value)),
67+
type = if (all(!is.na(num_val))) "numeric" else "factor",
68+
col = if (type[1] == "numeric") {
69+
scales::col_numeric("plasma",
70+
domain = range(num_val, na.rm = TRUE))(num_val)
71+
} else {
72+
pal <- scales::hue_pal(l = 65)(n_distinct(value))
73+
setNames(pal, sort(unique(value)))[value]
74+
}
75+
) %>%
76+
ungroup() %>%
77+
select(-num_val)
78+
79+
#### embed ALL metadata into a tooltip string of interactive plot ####
80+
meta_txt <- anno[row_ord, meta_cols] %>% mutate(across(everything(), as.character))
81+
meta_txt <- pmap_chr(meta_txt, \(...) {
82+
vals <- c(...)
83+
paste(paste(names(vals), vals, sep = ": "), collapse = "<br>")
84+
})
85+
names(meta_txt) <- rownames(anno)[row_ord]
86+
87+
#### plot heatmaps ####
88+
89+
#### QC heatmap ####
90+
91+
# add (un-scaled) metric values (raw) for tooltip of interactive plot
92+
qc_raw_long <- anno[row_ord, qc_cols] %>%
93+
mutate(sample = rownames(anno)[row_ord]) %>%
94+
pivot_longer(-sample, names_to = "metric", values_to = "raw")
95+
96+
qc_long <- qc_long %>% left_join(qc_raw_long, by = c("sample", "metric"))
97+
98+
# keep the clustered order in the plot and add hover-tooltip
99+
qc_long <- qc_long %>%
100+
mutate(sample = factor(sample, levels = rownames(qc_mat)[row_ord]), # row-order
101+
metric = factor(metric, levels = colnames(qc_mat)[col_ord]), # col-order
102+
hover = paste0("Sample: ", sample,
103+
"<br>Metric: ", metric,
104+
"<br>Value: ", signif(raw, 4),
105+
"<br>", meta_txt[as.character(sample)])
106+
)
107+
108+
# plot
109+
p_qc <- ggplot(qc_long, aes(x = metric,
110+
y = sample,
111+
fill = z,
112+
text = hover)) +
113+
geom_tile() +
114+
scale_x_discrete(limits = colnames(qc_mat)[col_ord]) + # enforce col order
115+
scale_y_discrete(limits = rownames(qc_mat)[row_ord]) + # enforce row order
116+
scale_fill_gradient2(low = "blue", mid = "white", high = "red", name = "z-score",
117+
guide = guide_colourbar(barheight = 2, # thinner
118+
barwidth = 0.15)) +
119+
labs(x = NULL, y = NULL, title = "QC metrics (scaled)") +
120+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
121+
panel.grid = element_blank())
122+
123+
#### metadata heatmap as "annotation" ####
124+
p_meta <- NULL
125+
if(length(meta_cols) > 0){
126+
# order columns (x) exactly like the QC heatmap
127+
meta_long <- meta_long %>% mutate(sample = factor(sample, levels = rownames(qc_mat)[row_ord])) # row-order
128+
meta_levels <- unique(meta_long$meta)
129+
meta_long <- meta_long %>% mutate(meta = factor(meta, levels = meta_levels))
130+
131+
p_meta <- ggplot() +
132+
scale_y_discrete(limits = levels(qc_long$sample)) +
133+
scale_x_discrete(limits = meta_levels) +
134+
labs(x = NULL, y = NULL, title = "Metadata") +
135+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
136+
axis.text.y = element_blank(),
137+
axis.title = element_blank(),
138+
panel.grid = element_blank())
139+
140+
for (v in meta_levels) {
141+
dat <- dplyr::filter(meta_long, meta == v)
142+
143+
p_meta <- p_meta + ggnewscale::new_scale_fill() # reset “fill” for this column
144+
145+
if (dat$type[1] == "numeric") { # continuous legend
146+
p_meta <- p_meta +
147+
geom_tile(data = dat, aes(x = meta, y = sample, fill = as.numeric(value)), colour = "grey60", linewidth = 0.1) +
148+
scale_fill_viridis_c(name = v, option = "plasma",
149+
guide = guide_colourbar(barheight = 2, # thinner
150+
barwidth = 0.15))
151+
} else { # categorical legend
152+
pal <- setNames(dat$col, dat$value)
153+
154+
# reduce legend in case of more than 10 levels
155+
max_items <- min(10, length(unique(dat$value)))
156+
all_levels <- unique(names(pal))
157+
show_levels <- all_levels[1:max_items]
158+
159+
p_meta <- p_meta +
160+
geom_tile(data = dat, aes(x = meta, y = sample, fill = value), colour = "grey60", linewidth = 0.1) +
161+
scale_fill_manual(values = pal,
162+
# name = v,
163+
breaks = show_levels,
164+
guide = guide_legend(keywidth = 0.25,
165+
keyheight = 0.4,
166+
ncol=1,
167+
byrow = TRUE,
168+
title = ifelse(
169+
length(all_levels) <= max_items,
170+
v,
171+
paste0(v, " (showing ", max_items, "/", length(all_levels), ")")
172+
)
173+
)
174+
)
175+
}
176+
}
177+
}
178+
179+
#### combine and save plots ####
180+
p_combined <- if (is.null(p_meta)) p_qc else (p_qc | p_meta) + plot_layout(widths = c(length(qc_cols), length(meta_cols)), guides = "collect") & theme(legend.position = "right")
181+
182+
# determine sizes
183+
n_rows <- nrow(qc_mat)
184+
n_cols <- length(qc_cols) + length(meta_cols)
185+
max_row_label <- max(nchar(rownames(anno)))
186+
max_col_label <- max(nchar(c(qc_cols, meta_cols)))
187+
188+
height_in <- n_rows * 0.08 + max_col_label * 0.05 + 1
189+
width_in <- n_cols * 0.10 + max_row_label * 0.05 + 2
190+
191+
# options(repr.plot.width = width_in, repr.plot.height = height_in)
192+
# p_combined
193+
194+
ggsave(sample_annotation_plot_path, plot = p_combined, width = width_in, height = height_in, units = "in", dpi = 300)
195+
196+
#### interactive plot ####
197+
# determine sizes in pixels
198+
width_px <- round((length(qc_cols) * 0.10 + max_row_label * 0.05 + 2) * 96)
199+
height_px <- round(height_in * 96)
200+
201+
p_qc_interactive <- plotly::ggplotly(p_qc, tooltip = "text", width = width_px, height = height_px)
202+
203+
# p_qc_interactive
204+
205+
htmlwidgets::saveWidget(p_qc_interactive, sample_annotation_html_path, selfcontained = TRUE, title = "Sample annotation")

0 commit comments

Comments
 (0)