add sample annotation visualization

sreichl · sreichl · commit d6101d696d79 · 2025-07-07T17:23:27.000+02:00
diff --git a/README.md b/README.md
@@ -75,6 +75,7 @@ The processing and quantification described here was performed using a publicly
     - Quantification of TSS coverage.
 - Reporting (`report/`)
     - MultiQC report generation using MultiQC, extended with an in-house developed plugin [atacseq_report](./workflow/scripts/multiqc_atacseq).
+    - Sample annotation is visualized as a hierarchically-clustered QC heatmap with matching metadata annotation, exported both as a `PNG` and an interactive `HTML` with metadata as tooltips (`sample_annotation.{png|html}`).
 - Quantification (`counts/`)
     - Consensus region set generation across all called peaks (`consensus_regions.bed`).
     - Read count quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (`consensus_counts.csv`).
@@ -84,13 +85,13 @@ The processing and quantification described here was performed using a publicly
       - [Pseudoautosomal regions in human](https://www.ensembl.org/info/genome/genebuild/human_PARS.html) chromosome `Y` are skipped.
     - Aggregation of all sample-wise HOMER known motif enrichment results into one CSV in long-format (`HOMER_knownMotifs.csv`).
 - Annotation (`counts/`)
-    - Sample annotation file based on MultiQC general stats and provided annotations for downstream analysis (`sample_annotation.csv`).
+    - Sample annotation file based on `MultiQC` general stats and provided annotations for downstream analysis (`sample_annotation.csv`).
     - Consensus region set annotation using (`consensus_annotation.csv`)
       - `UROPA` with regulatory build and gencode as references, configurable here: `workflow/resources/UROPA/*.txt`.
       - `HOMER` with `annotatePeaks.pl`. NB: We have empirically found, that some human sex genes, e.g., the well established protein coding genes UTY and STS, are not annotated.
       - `bedtools` for nucleotide counts/content (e.g., % of GC).
 
-> [!IMPORTANT]  
+> [!IMPORTANT] 
 > **Duplciate reads** can be filtered during the alignment step by `samtools` and/or ignored during peak calling by `MACS2`.
 > **The inclusion of duplicates** should be intentional, and may lead to a large number of consensus regions.
 > **The removal of duplicates** should be intentional, might remove real biological signal.
@@ -106,16 +107,19 @@ These steps are the recommended usage for this workflow:
 3. Fill out the mandatory quality control column (pass_qc) in the annotation file accordingly (everything >0 will be included in the downstream steps).
 4. Finally, execute the remaining downstream quantification and annotation steps by running the workflow. Thereby only the samples that passed quality control will be included in the consensus region set generation (i.e., the feature space) and all downstream steps.
 
+> [!NOTE]
+> Although inputs and parameters may be identical, **MACS2 peak calling can yield slightly varying results** (± a few peaks) due to stochastic elements in its algorithm (e.g., duplicate handling).This minor variability in peak calls sohuld have no impact on downstream analyses or the overall robustness of results.
+
 This workflow is written with Snakemake and its usage is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/atacseq_pipeline).
 
 # ⚙️ Configuration
 Detailed specifications can be found here [./config/README.md](./config/README.md)
 
 # 📖 Examples
-Explore a detailed example showcasing module usage and downstream analysis in our comprehensive end-to-end [MrBiomics Recipe](https://github.com/epigen/MrBiomics?tab=readme-ov-file#-recipes) for [ATACseq Analysis](https://github.com/epigen/MrBiomics/wiki/ATAC%E2%80%90seq-Analysis-Recipe), including data, configuration, annotation and results.
+Explore a detailed example showcasing module usage and downstream analysis in our comprehensive end-to-end [MrBiomics Recipe](https://github.com/epigen/MrBiomics?tab=readme-ov-file#-recipes) for [ATAC-seq Analysis](https://github.com/epigen/MrBiomics/wiki/ATAC%E2%80%90seq-Analysis-Recipe), including data, configuration, annotation and results.
 
 # 🔍 Quality Control
-Below are some guidelines for the manual quality control of each sample, but keep in mind that every experiment/dataset is different.
+Below are some guidelines for the manual quality control of each sample using the generated `MultiQC` report and visualized (interactive) sample annotation, but keep in mind that every experiment/dataset is different. Thresholds are general suggestions and may vary based on experiment type, organism, and library prep.
 
 1. Reads Mapped ~ $30\cdot 10^{6}$ ($>20\cdot 10^{6}$ at least)
 2. % Aligned >90%
diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -19,7 +19,7 @@ min_version("8.20.1")
 report: os.path.join("report", "workflow.rst")
     
 # list of names of the used environment specifications in workflow/envs/{env_name}.yaml
-envs = ["bowtie2","macs2_homer","multiqc","pybedtools","uropa", "datamash"]
+envs = ["bowtie2","macs2_homer","multiqc","pybedtools","uropa","datamash","ggplot"]
 
 ##### load config and sample annotation sheets #####
 configfile: os.path.join("config","config.yaml")
@@ -45,6 +45,7 @@ rule all:
         multiqc_report = os.path.join(result_path,"report","multiqc_report.html"),
         # QUANTIFICATION
         sample_annotation = os.path.join(result_path, "counts", "sample_annotation.csv") if len(samples_quantify)>0 else [],
+        sample_annotation_plot = os.path.join(result_path, "report", "sample_annotation.png") if len(samples_quantify)>0 else [],
         support_counts = os.path.join(result_path,"counts","support_counts.csv") if len(samples_quantify)>0 else [],
         consensus_counts = os.path.join(result_path,"counts","consensus_counts.csv") if len(samples_quantify)>0 else [],
         promoter_counts = os.path.join(result_path,"counts","promoter_counts.csv") if len(samples_quantify)>0 else [],
diff --git a/workflow/envs/ggplot.yaml b/workflow/envs/ggplot.yaml
@@ -0,0 +1,13 @@
+channels:
+  - conda-forge
+  - bioconda
+  - nodefaults
+dependencies:
+  - r-tidyverse=2.0.0
+  - r-data.table=1.17.6
+  - r-patchwork=1.3.0
+  - r-ggplot2=3.5.2
+  - r-ggnewscale=0.5.1
+  - r-plotly=4.11.0
+  - r-htmlwidgets=1.6.4
+  - r-stringr=1.5.1
diff --git a/workflow/report/sample_annotation.rst b/workflow/report/sample_annotation.rst
@@ -0,0 +1 @@
+Interactive visualization of hierarchically clustered sample QC metrics and annotation.
diff --git a/workflow/rules/quantification.smk b/workflow/rules/quantification.smk
@@ -46,6 +46,8 @@ rule get_consensus_regions:
         chromosome_sizes = config["chromosome_sizes"],
     output:
         consensus_regions = os.path.join(result_path,"counts","consensus_regions.bed"),
+    params:
+        slop_extension = config["slop_extension"],
     resources:
         mem_mb=config.get("mem", "16000"),
     threads: config.get("threads", 2)
diff --git a/workflow/rules/report.smk b/workflow/rules/report.smk
@@ -65,3 +65,27 @@ rule multiqc:
         """
         multiqc {params.result_path}/report --force --verbose --outdir {params.result_path}/report --filename multiqc_report.html --cl-config "{params.multiqc_configs}"
         """
+
+# visualize sample annotation (including QC metrics)
+rule plot_sample_annotation:
+    input:
+        sample_annotation = config["annotation"],
+        sample_annotation_w_QC = os.path.join(result_path, "counts", "sample_annotation.csv"),
+    output:
+        sample_annotation_plot = os.path.join(result_path,"report","sample_annotation.png"),
+        sample_annotation_html = report(os.path.join(result_path,"report","sample_annotation.html"),
+                       caption="../report/sample_annotation.rst",
+                       category="{}_{}".format(config["project_name"], module_name),
+                       subcategory="QC",
+                       labels={
+                           "name": "Sample annotation",
+                           "type": "HTML",
+                           }),
+    log:
+        "logs/rules/plot_sample_annotation.log",
+    resources:
+        mem_mb="4000",
+    conda:
+        "../envs/ggplot.yaml"
+    script:
+        "../scripts/plot_sample_annotation.R"
diff --git a/workflow/scripts/plot_sample_annotation.R b/workflow/scripts/plot_sample_annotation.R
@@ -0,0 +1,205 @@
+#### libraries ####
+library(data.table)
+library(tidyverse)
+library(patchwork)
+library(ggplot2)
+library(ggnewscale)
+library(stringr)
+# for interactive plotting
+library(plotly)
+library(htmlwidgets)
+# set plot base size
+theme_set(theme_minimal(base_size = 6))
+
+#### configs ####
+# input
+sample_annotation_path <- snakemake@input[["sample_annotation"]]
+sample_annotation_w_QC_path <- snakemake@input[["sample_annotation_w_QC"]]
+
+# output
+sample_annotation_plot_path <- snakemake@output[["sample_annotation_plot"]]
+sample_annotation_html_path <- snakemake@output[["sample_annotation_html"]]
+
+#### load & prepare data ####
+# load data
+sample_annotation <- data.table::fread(file.path(sample_annotation_path), header = TRUE)
+sample_annotation <- data.frame(sample_annotation[!duplicated(sample_annotation[[1]]), ], row.names = 1, check.names = FALSE)
+
+anno <- data.frame(fread(file.path(sample_annotation_w_QC_path), header=TRUE), row.names=1)
+
+# determine QC (pipeline provided) columns
+names(sample_annotation) <- gsub(" +", "_", names(sample_annotation)) # replace empty space ` ` with underscore `_`
+qc_cols <- setdiff(names(anno), names(sample_annotation))
+
+# determine metadata (user provided) columns by removing non-numeric columns that are unique for each row (e.g., bam_file)
+sample_annotation <- sample_annotation %>% select(where(\(.x){ n <- dplyr::n_distinct(na.omit(.x)); (is.numeric(.x) || n < nrow(sample_annotation)) }))
+meta_cols <- names(sample_annotation)
+
+# drop columns with no variation (e.g., read_type)
+has_var <- function(x) length(unique(na.omit(x))) > 1
+qc_cols  <- keep(qc_cols,  ~ has_var(anno[[.x]]))
+meta_cols <- keep(meta_cols, ~ has_var(anno[[.x]]))
+
+# collapse "duplicates" (ie redundant columns) only among non-numeric
+num_cols  <- meta_cols[vapply(anno[meta_cols], is.numeric, logical(1))]
+cat_cols  <- setdiff(meta_cols, num_cols)
+sig <- vapply(cat_cols,
+              \(col) paste(match(anno[[col]], unique(anno[[col]])), collapse = "|"),
+              character(1))
+meta_cols <- c(cat_cols[!duplicated(sig)], num_cols)
+
+#### Z-score & cluster QC data ####
+qc_mat <- anno |> select(all_of(qc_cols)) |> scale() |> as.matrix()
+
+row_ord <- hclust(dist(qc_mat))$order
+col_ord <- hclust(dist(t(qc_mat)))$order
+
+qc_long <- as_tibble(qc_mat[row_ord, col_ord], rownames = "sample") |>
+           pivot_longer(-sample, names_to = "metric", values_to = "z")
+
+#### prepare metadata for plotting ####
+meta_long <- anno[row_ord, meta_cols]                                     %>% 
+  mutate(sample = rownames(anno)[row_ord])                                %>% 
+  mutate(across(-sample, as.character))                                   %>% 
+  pivot_longer(-sample, names_to = "meta", values_to = "value")           %>% 
+  group_by(meta)                                                          %>% 
+  mutate(num_val = suppressWarnings(as.numeric(value)),
+         type    = if (all(!is.na(num_val))) "numeric" else "factor",
+         col     = if (type[1] == "numeric") {
+                      scales::col_numeric("plasma",
+                                          domain = range(num_val, na.rm = TRUE))(num_val)
+                    } else {
+                      pal <- scales::hue_pal(l = 65)(n_distinct(value))
+                      setNames(pal, sort(unique(value)))[value]
+                    }
+         )                                                   %>% 
+  ungroup()                                                               %>% 
+  select(-num_val)
+
+#### embed ALL metadata into a tooltip string of interactive plot ####
+meta_txt <- anno[row_ord, meta_cols] %>% mutate(across(everything(), as.character))
+meta_txt <- pmap_chr(meta_txt, \(...) {
+               vals <- c(...)
+               paste(paste(names(vals), vals, sep = ": "), collapse = "<br>")
+           })
+names(meta_txt) <- rownames(anno)[row_ord]
+
+#### plot heatmaps ####
+
+#### QC heatmap ####
+
+# add (un-scaled) metric values (raw) for tooltip of interactive plot
+qc_raw_long <- anno[row_ord, qc_cols] %>%
+               mutate(sample = rownames(anno)[row_ord]) %>% 
+               pivot_longer(-sample, names_to = "metric", values_to = "raw")
+
+qc_long <- qc_long %>% left_join(qc_raw_long, by = c("sample", "metric"))
+
+# keep the clustered order in the plot and add hover-tooltip
+qc_long <- qc_long %>% 
+  mutate(sample = factor(sample,  levels = rownames(qc_mat)[row_ord]), # row-order
+         metric = factor(metric,  levels = colnames(qc_mat)[col_ord]), # col-order
+         hover  = paste0("Sample: ", sample,
+                         "<br>Metric: ", metric,
+                         "<br>Value: ",   signif(raw, 4),
+                         "<br>", meta_txt[as.character(sample)])
+        )
+
+# plot
+p_qc <- ggplot(qc_long, aes(x = metric,
+                            y = sample,
+                            fill = z,
+                            text = hover)) +
+        geom_tile() +
+        scale_x_discrete(limits = colnames(qc_mat)[col_ord]) +           # enforce col order
+        scale_y_discrete(limits = rownames(qc_mat)[row_ord]) +           # enforce row order
+        scale_fill_gradient2(low = "blue", mid = "white", high = "red", name = "z-score",
+                     guide = guide_colourbar(barheight = 2,  # thinner
+                                             barwidth  = 0.15)) +
+        labs(x = NULL, y = NULL, title = "QC metrics (scaled)") +
+        theme(axis.text.x = element_text(angle = 45, hjust = 1),
+              panel.grid = element_blank())
+
+#### metadata heatmap as "annotation" ####
+p_meta <- NULL
+if(length(meta_cols) > 0){
+    # order columns (x) exactly like the QC heatmap
+    meta_long <- meta_long %>% mutate(sample = factor(sample,  levels = rownames(qc_mat)[row_ord]))   # row-order
+    meta_levels <- unique(meta_long$meta)
+    meta_long   <- meta_long %>% mutate(meta = factor(meta, levels = meta_levels))
+    
+    p_meta <- ggplot() +
+              scale_y_discrete(limits = levels(qc_long$sample)) +
+              scale_x_discrete(limits = meta_levels) +
+                labs(x = NULL, y = NULL, title = "Metadata") +
+              theme(axis.text.x = element_text(angle = 45, hjust = 1),
+                    axis.text.y = element_blank(),
+                    axis.title   = element_blank(),
+                    panel.grid   = element_blank())
+    
+    for (v in meta_levels) {
+        dat <- dplyr::filter(meta_long, meta == v)
+    
+        p_meta <- p_meta + ggnewscale::new_scale_fill()   # reset “fill” for this column
+    
+        if (dat$type[1] == "numeric") {                  # continuous legend
+            p_meta <- p_meta +
+                geom_tile(data = dat, aes(x = meta, y = sample, fill = as.numeric(value)), colour = "grey60", linewidth = 0.1) +
+                scale_fill_viridis_c(name = v, option = "plasma",
+                     guide = guide_colourbar(barheight = 2,  # thinner
+                                             barwidth  = 0.15)) 
+        } else {                                         # categorical legend
+            pal <- setNames(dat$col, dat$value)
+
+            # reduce legend in case of more than 10 levels
+            max_items   <- min(10, length(unique(dat$value)))
+            all_levels  <- unique(names(pal))
+            show_levels <- all_levels[1:max_items]
+            
+            p_meta <- p_meta +
+                geom_tile(data = dat, aes(x = meta, y = sample, fill = value), colour = "grey60", linewidth = 0.1) +
+                scale_fill_manual(values = pal, 
+                                  # name = v,
+                                  breaks = show_levels,
+                                  guide = guide_legend(keywidth  = 0.25,
+                                                       keyheight = 0.4,
+                                                       ncol=1,
+                                                       byrow = TRUE,
+                                                       title = ifelse(
+                                                           length(all_levels) <= max_items,
+                                                           v,
+                                                           paste0(v, " (showing ", max_items, "/", length(all_levels), ")")
+                                                           )
+                                                      )
+                                 )
+        }
+    }
+}
+
+#### combine and save plots ####
+p_combined <- if (is.null(p_meta)) p_qc else (p_qc | p_meta) + plot_layout(widths = c(length(qc_cols), length(meta_cols)), guides = "collect") & theme(legend.position = "right")
+
+# determine sizes
+n_rows <- nrow(qc_mat)
+n_cols <- length(qc_cols) + length(meta_cols)
+max_row_label <- max(nchar(rownames(anno)))
+max_col_label <- max(nchar(c(qc_cols, meta_cols)))
+
+height_in <- n_rows * 0.08 + max_col_label * 0.05 + 1
+width_in  <- n_cols * 0.10 + max_row_label * 0.05 + 2
+
+# options(repr.plot.width = width_in, repr.plot.height = height_in)
+# p_combined
+
+ggsave(sample_annotation_plot_path, plot = p_combined, width = width_in, height = height_in, units = "in", dpi = 300)
+
+#### interactive plot ####
+# determine sizes in pixels
+width_px  <- round((length(qc_cols) * 0.10 + max_row_label * 0.05 + 2) * 96)
+height_px <- round(height_in * 96)
+
+p_qc_interactive  <- plotly::ggplotly(p_qc,  tooltip = "text", width = width_px, height = height_px)
+
+# p_qc_interactive
+
+htmlwidgets::saveWidget(p_qc_interactive, sample_annotation_html_path, selfcontained = TRUE, title = "Sample annotation")

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Interactive visualization of hierarchically clustered sample QC metrics and annotation.`