Conversation
|
I've pushed significant progress and the code is starting to grow. I believe commits can be reviewed incrementally for your convenience. I would like to find a way to pass enrichment result files to make_app_from_files, but CLI arguments are not that convenient: The make_app_from_files tool accepts one assay with a list of differential expression tables, one per contrast. For each contrast, there is a list of gene set types. For each gene set type there are up to two files associated (up and down are reported separately in gsea). For 4 contrasts and 3 gene set types I would need to report 24 tsv files. This is not convenient in the command line. My suggestion is to add functional enrichment support to that script by passing a list of gene set types separated by commas in an argument, and passing a file name template. The tool will assume that file names from the enrichment tool follow a predictable naming based on the template, and will substitute contrast information and the gene set to find all the tsv files. This solution avoids bloating the command line arguments and it is flexible enough to work well with the differential abundance nfcore pipeline. I'm happy to get feedback about this idea |
|
Still needs more testing, tests and documentation, but the core work is already here. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for GSEA (Gene Set Enrichment Analysis) output files to shinyngs. The implementation includes a new gene_set_analyses_tool slot that allows users to specify the enrichment tool used (GSEA or ROAST), with automatic detection as the default. The PR updates the gene set enrichment analysis and barcode plot modules to correctly filter tables based on user-specified p-value and FDR thresholds, applying filters to the appropriate columns depending on the enrichment tool used.
Key changes:
- New
gene_set_analyses_toolslot in ExploratorySummarizedExperiment class for tool-specific column handling - Enhanced support for reading GSEA results split into separate up/down regulation files
- Auto-detection heuristic to identify enrichment tool format from column names
- Updated filtering logic in gene set analysis modules to use tool-specific column names
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| R/ExploratorySummarizedExperiment-class.R | Adds gene_set_analyses_tool slot and validation function, modernizes class definition syntax |
| R/ExploratorySummarizedExperimentList-class.R | Modernizes class definition from representation to slots, adds warning suppression for numeric type checking |
| R/accessory.R | Implements GSEA file reading logic, adds get_gst_columns for tool-specific column detection and remove_nulls helper |
| R/genesetanalysistable.R | Updates filtering to use tool-specific column names for p-values, FDR, and direction |
| R/genesetbarcodeplot.R | Updates to retrieve FDR and direction using tool-specific column names |
| exec/make_app_from_files.R | Adds command-line options for enrichment analysis and implements filename template parsing for GSEA results |
| man/*.Rd | Updates documentation for new parameters and slots |
| vignettes/shinyngs.Rmd | Adds explicit library load for better code clarity |
| DESCRIPTION | Updates R version requirement and RoxygenNote version |
| R/shinyngs.R | Modernizes package documentation syntax |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # structure. gene_set_analyses_tool should have a single string | ||
|
|
||
| # Create default structure if necessary: | ||
| out <- list() |
There was a problem hiding this comment.
Unused variable out is created but never used. This line can be removed.
| out <- list() |
| check_gene_set_analyses_tool_consistency <- function(gene_set_analyses, gene_set_analyses_tool) { | ||
| # gene_set_analyses and gene_set_analyses_tool should have the same list of lists | ||
| # structure. gene_set_analyses_tool should have a single string | ||
|
|
||
| # Create default structure if necessary: | ||
| out <- list() | ||
|
|
||
| if (is.null(gene_set_analyses_tool)) { | ||
| gene_set_analyses_tool <- list() | ||
| } | ||
| for (assay_name in names(gene_set_analyses)) { | ||
| if (!assay_name %in% names(gene_set_analyses_tool)) { | ||
| gene_set_analyses_tool[[assay_name]] <- list() | ||
| } | ||
| for (gs_type in names(gene_set_analyses[[assay_name]])) { | ||
| if (! gs_type %in% names(gene_set_analyses_tool[[assay_name]])) { | ||
| gene_set_analyses_tool[[assay_name]][[gs_type]] <- list() | ||
| } | ||
| for (contrast_name in names(gene_set_analyses[[assay_name]][[gs_type]])) { | ||
| if (! contrast_name %in% names(gene_set_analyses_tool[[assay_name]][[gs_type]])) { | ||
| gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]] <- "auto" | ||
| } else { | ||
| tool_name <- gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]] | ||
| if (!is.character(tool_name) || length(tool_name) > 1) { | ||
| stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ", | ||
| gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ", | ||
| paste0(tool_name, collapse=","))) | ||
| } | ||
| if (! tool_name %in% c("auto", "gsea", "roast")) { | ||
| stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ", | ||
| gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ", | ||
| tool_name)) | ||
| } | ||
| } | ||
| } | ||
| # In case gene_set_analyses_tool has other entries, just keep the ones matching gene_set_analyses | ||
| contrasts_ordered <- names(gene_set_analyses[[assay_name]][[gs_type]]) | ||
| gene_set_analyses_tool[[assay_name]][[gs_type]] <- gene_set_analyses_tool[[assay_name]][[gs_type]][contrasts_ordered] | ||
| } | ||
| gene_set_type_names_ordered <- names(gene_set_analyses[[assay_name]]) | ||
| gene_set_analyses_tool[[assay_name]] <- gene_set_analyses_tool[[assay_name]][gene_set_type_names_ordered] | ||
| } | ||
| analysis_names_ordered <- names(gene_set_analyses_tool) | ||
| gene_set_analyses_tool <- gene_set_analyses_tool[analysis_names_ordered] | ||
| gene_set_analyses_tool | ||
| } |
There was a problem hiding this comment.
| check_gene_set_analyses_tool_consistency <- function(gene_set_analyses, gene_set_analyses_tool) { | |
| # gene_set_analyses and gene_set_analyses_tool should have the same list of lists | |
| # structure. gene_set_analyses_tool should have a single string | |
| # Create default structure if necessary: | |
| out <- list() | |
| if (is.null(gene_set_analyses_tool)) { | |
| gene_set_analyses_tool <- list() | |
| } | |
| for (assay_name in names(gene_set_analyses)) { | |
| if (!assay_name %in% names(gene_set_analyses_tool)) { | |
| gene_set_analyses_tool[[assay_name]] <- list() | |
| } | |
| for (gs_type in names(gene_set_analyses[[assay_name]])) { | |
| if (! gs_type %in% names(gene_set_analyses_tool[[assay_name]])) { | |
| gene_set_analyses_tool[[assay_name]][[gs_type]] <- list() | |
| } | |
| for (contrast_name in names(gene_set_analyses[[assay_name]][[gs_type]])) { | |
| if (! contrast_name %in% names(gene_set_analyses_tool[[assay_name]][[gs_type]])) { | |
| gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]] <- "auto" | |
| } else { | |
| tool_name <- gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]] | |
| if (!is.character(tool_name) || length(tool_name) > 1) { | |
| stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ", | |
| gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ", | |
| paste0(tool_name, collapse=","))) | |
| } | |
| if (! tool_name %in% c("auto", "gsea", "roast")) { | |
| stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ", | |
| gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ", | |
| tool_name)) | |
| } | |
| } | |
| } | |
| # In case gene_set_analyses_tool has other entries, just keep the ones matching gene_set_analyses | |
| contrasts_ordered <- names(gene_set_analyses[[assay_name]][[gs_type]]) | |
| gene_set_analyses_tool[[assay_name]][[gs_type]] <- gene_set_analyses_tool[[assay_name]][[gs_type]][contrasts_ordered] | |
| } | |
| gene_set_type_names_ordered <- names(gene_set_analyses[[assay_name]]) | |
| gene_set_analyses_tool[[assay_name]] <- gene_set_analyses_tool[[assay_name]][gene_set_type_names_ordered] | |
| } | |
| analysis_names_ordered <- names(gene_set_analyses_tool) | |
| gene_set_analyses_tool <- gene_set_analyses_tool[analysis_names_ordered] | |
| gene_set_analyses_tool | |
| } | |
| check_gene_set_analyses_tool_consistency <- function(gene_set_analyses, gene_set_analyses_tool) { | |
| if (is.null(gene_set_analyses_tool)) { | |
| gene_set_analyses_tool <- list() | |
| } | |
| valid_tools <- c("auto", "gsea", "roast") | |
| # Recursive function to mirror structure and validate | |
| mirror_and_validate <- function(gsa_node, gst_node, path = "") { | |
| if (is.data.frame(gsa_node)) { | |
| # Leaf node - should have a tool string | |
| tool <- if (is.null(gst_node)) "auto" else gst_node | |
| if (!is.character(tool) || length(tool) != 1 || !tool %in% valid_tools) { | |
| stop(sprintf("Invalid gene_set_analyses_tool at %s: expected one of %s, got %s", | |
| path, paste(valid_tools, collapse=", "), paste(tool, collapse=","))) | |
| } | |
| return(tool) | |
| } | |
| # Recurse through structure | |
| lapply(setNames(names(gsa_node), names(gsa_node)), function(name) { | |
| mirror_and_validate(gsa_node[[name]], gst_node[[name]], paste0(path, "/", name)) | |
| }) | |
| } | |
| mirror_and_validate(gene_set_analyses, gene_set_analyses_tool) | |
| } |
I found this function quite hard to understand- my AI friend suggested this, could you try it?
There was a problem hiding this comment.
Your AI friend gave me a good starting point, although its suggestion was a bit flawed. I believe I have now implemented something easier to read
| # Basic list to pass to object creation | ||
| exp$gene_set_analyses_tool <- check_gene_set_analyses_tool_consistency(exp$gene_set_analyses, exp$gene_set_analyses_tool) | ||
|
|
||
| ese_list$gene_set_analyses <- lapply(exp$gene_set_analyses, function(assay) { | ||
| lapply(assay, function(gene_set_type) { | ||
| lapply(gene_set_type, function(contrast) { | ||
| read.csv(contrast, check.names = FALSE, stringsAsFactors = FALSE, row.names = 1) | ||
| # contrast may be one file name or two file names (up and down), or NULL | ||
| if (is.null(contrast) || length(contrast) == 0) { | ||
| NULL | ||
| } else if (length(contrast) == 1) { | ||
| read.csv(contrast, sep=getSeparator(contrast), | ||
| check.names = FALSE, stringsAsFactors = FALSE, row.names = 1) | ||
| } else if (length(contrast) == 2) { | ||
| # This is useful for GSEA output, that splits up and down in two tsv files. | ||
| # We read both files and set the direction | ||
| up <- read.csv(contrast[["up"]], sep=getSeparator(contrast[["up"]]), | ||
| check.names = FALSE, stringsAsFactors = FALSE, row.names = 1) | ||
| up$Direction <- rep("Up", nrow(up)) | ||
| down <- read.csv(contrast[["down"]], sep=getSeparator(contrast[["down"]]), | ||
| check.names = FALSE, stringsAsFactors = FALSE, row.names = 1) | ||
| down$Direction <- rep("Down", nrow(down)) | ||
| rbind(up, down) | ||
| } else { | ||
| stop("gene_set_analyses should have zero, one or two contrast files per gene_set_type") | ||
| } | ||
| }) | ||
| }) | ||
| }) | ||
|
|
||
| ese_list$gene_set_analyses <- remove_nulls(ese_list$gene_set_analyses) |
There was a problem hiding this comment.
Maybe a helper function here?
read_enrichment_file <- function(contrast_spec) {
if (is.null(contrast_spec) || length(contrast_spec) == 0) {
return(NULL)
}
read_one <- function(path) {
read.csv(path, sep = getSeparator(path), check.names = FALSE,
stringsAsFactors = FALSE, row.names = 1)
}
if (length(contrast_spec) == 1) {
return(read_one(contrast_spec))
}
if (length(contrast_spec) == 2) {
up <- read_one(contrast_spec[["up"]])
up$Direction <- "Up"
down <- read_one(contrast_spec[["down"]])
down$Direction <- "Down"
return(rbind(up, down))
}
stop("gene_set_analyses should have 0, 1, or 2 files per contrast")
}
# Then simplify the nested lapply:
ese_list$gene_set_analyses <- lapply(exp$gene_set_analyses, function(assay) {
lapply(assay, function(gene_set_type) {
lapply(gene_set_type, read_enrichment_file)
})
})
R/accessory.R
Outdated
| # Gene set enrichment format and filtering columns | ||
| # gst: A data frame coming from the gene enrichment tool (roast, gsea) | ||
| # gs_tool: A string, the tool used to get gst ("auto", "gsea", "roast") | ||
| get_gst_columns <- function(gst, gs_tool) { |
There was a problem hiding this comment.
This would probably be easier to maintain with some more separation of concerns. Rather than having the function do lots of things and unpack:
detect_enrichment_tool <- function(gst) {
if ("NOM p-val" %in% colnames(gst)) return("gsea")
if (any(c("p value", "PValue") %in% colnames(gst))) return("roast")
stop("Could not detect enrichment tool from column names")
}
get_column_mapping <- function(gs_tool) {
mappings <- list(
roast = list(pvalue = "PValue", fdr = "FDR", direction = "Direction"),
gsea = list(pvalue = "NOM p-val", fdr = "FDR q-val", direction = "Direction")
)
# Handle roast variant with lowercase "p value"
if (gs_tool == "roast") {
mappings$roast$pvalue <- "PValue" # Check and override below
}
mappings[[gs_tool]]
}
clean_gsea_columns <- function(gst) {
cols_to_remove <- c("GS<br> follow link to MSigDB", "GS DETAILS")
gst[, !(names(gst) %in% cols_to_remove), drop = FALSE]
}
| # "gsea_for_{contrast_name}_using_{geneset_type}_results.tsv" | ||
| # becomes: | ||
| # "gsea_for_disease_vs_healthy_using_geo_bp_results.tsv" | ||
| if (grepl("target\\|reference", opt$enrichment_filename_template)) { |
There was a problem hiding this comment.
We should centralise the path building logic so we only have to maintain it in one place:
build_enrichment_path <- function(template, contrast_info, geneset_type, direction = NULL) {
path <- template
path <- gsub("{contrast_name}", contrast_info$id, path, fixed = TRUE)
path <- gsub("{geneset_type}", geneset_type, path, fixed = TRUE)
if (!is.null(direction)) {
target_val <- if (direction == "up") contrast_info$target else contrast_info$reference
path <- gsub("{target|reference}", target_val, path, fixed = TRUE)
}
path
}
# Then use:
if (grepl("\\{target\\|reference\\}", template)) {
up_file <- build_enrichment_path(template, ctrst_info, geneset_type, "up")
down_file <- build_enrichment_path(template, ctrst_info, geneset_type, "down")
# ... rest of logic
} else {
enrichment_file <- build_enrichment_path(template, ctrst_info, geneset_type)
# ... rest of logic
}
Function will probably need to go in accessory.R
pinin4fjords
left a comment
There was a problem hiding this comment.
Thanks again for the work! I know you're still working so I won't comment too much on lack of tests / docs (though the AI has done a bit of that).
In general it would be good if as much of the new logic as possible is in functions where it can have tests, rather than e.g. in large tranches of logic in the exec script.
|
I agree with your suggestions and I'll follow up implementing them on the following days. Thank you very much for your review! |
We know, and we don't care in this case
The SummarizedExperiment class was removed from GenomicRanges (after having moved it to the SummarizedExperiment package) back in Bioconductor 3.3 [1], released in May 2016 and targetting R 3.3. This commit removes code targeting support for Bioconductor versions older than Bioconductor 3.3. It slightly bumps the R version dependency to explicitly break with that old Bioconductor version. [1] https://bioconductor.org/news/bioc_3_3_release/
Back in Sept 2016, the representation= argument of setClass was documented as old, and the slots= argument was suggested instead [1]. [1] wch/r-source@edadf29 In 2016 it was already old, so now it is considered even older. This commit just updates the syntax to ease maintenance.
Look at gene_set_analyses being used here: https://github.com/pinin4fjords/shinyngs/blob/d2ca950934070c699c839aea45e77053bc461722/R/genesetanalysistable.R#L239 gst <- ese@gene_set_analyses[[assay]][[gene_set_types]][[as.numeric(selected_contrasts)]] It's indexed by assay, geneset type and contrast. Fix the docs so the explanation matches reality.
The gene_set_analyses_tool has a list-based structure, just like gene_set_analyses. gene_set_analyses_tool[[assay]][[gene_set_type]][[contrast]] contains a string, either "auto", "gsea", or "roast". This string can be used by the geneset-related modules to parse and interact with the gene set enrichment tables.
Ensure each given gene_set_analysis has a tool set, and set it to "auto" otherwise.
Just a helper function
Previous library(shinyngs) statements are on eval=FALSE chunks.
When reading csv files from each assay, gene_set type and contrast, allow to not have enrichment results for all combinations.
These are clear documentations and fixes from code review by Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
So it is easier to read
The names are used in the enrichment file name template subsitution, so it is better to avoid character mangling.
|
It took me a long while to find some time to finish this (definitely longer than expected). I implemented all the suggested changes, added unit tests to the accessory functions, improved docstrings and rebased. Happy to get your feedback and hopefully be able to merge this eventually :-) |
|
Looking pretty good now! Do you have some example outputs to show? |
|
I'll find a public dataset and upload some plots as soon as I find some more time. |
This is a draft pull request that will add support for GSEA output files to shinyngs.
Still work in progress.
Status/Roadmap
ExploratorySummarizedExperimentthere is a new slot for thegene_set_analyses_tool. This slot allows the user to specify the tool used for gene enrichment. Depending on the tool, the package processes the enrichment files, applying the pvalue/FDR filters from the user to the corresponding columns. By default a simple auto detect heuristic is implemented.Closes #83