Skip to content

GSEA support#84

Open
zeehio wants to merge 31 commits intopinin4fjords:developfrom
zeehio:gsea-support
Open

GSEA support#84
zeehio wants to merge 31 commits intopinin4fjords:developfrom
zeehio:gsea-support

Conversation

@zeehio
Copy link
Contributor

@zeehio zeehio commented Oct 29, 2025

This is a draft pull request that will add support for GSEA output files to shinyngs.

Still work in progress.

Status/Roadmap

  • Able to read enrichment results when two csv files are given separately (up/down)
  • In the ExploratorySummarizedExperiment there is a new slot for the gene_set_analyses_tool. This slot allows the user to specify the tool used for gene enrichment. Depending on the tool, the package processes the enrichment files, applying the pvalue/FDR filters from the user to the corresponding columns. By default a simple auto detect heuristic is implemented.
  • The gene set enrichment analysis module and the gene set barcode module are updated so they filter the table based on the user given pvalue/FDR values, applying filters to the correct columns.
  • make app from files script
  • documentation and tests

Closes #83

@pinin4fjords pinin4fjords self-requested a review October 31, 2025 12:37
@zeehio
Copy link
Contributor Author

zeehio commented Nov 8, 2025

I've pushed significant progress and the code is starting to grow. I believe commits can be reviewed incrementally for your convenience.

I would like to find a way to pass enrichment result files to make_app_from_files, but CLI arguments are not that convenient: The make_app_from_files tool accepts one assay with a list of differential expression tables, one per contrast.

For each contrast, there is a list of gene set types. For each gene set type there are up to two files associated (up and down are reported separately in gsea). For 4 contrasts and 3 gene set types I would need to report 24 tsv files. This is not convenient in the command line.

My suggestion is to add functional enrichment support to that script by passing a list of gene set types separated by commas in an argument, and passing a file name template. The tool will assume that file names from the enrichment tool follow a predictable naming based on the template, and will substitute contrast information and the gene set to find all the tsv files.

This solution avoids bloating the command line arguments and it is flexible enough to work well with the differential abundance nfcore pipeline.

I'm happy to get feedback about this idea

@zeehio
Copy link
Contributor Author

zeehio commented Nov 9, 2025

Still needs more testing, tests and documentation, but the core work is already here.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for GSEA (Gene Set Enrichment Analysis) output files to shinyngs. The implementation includes a new gene_set_analyses_tool slot that allows users to specify the enrichment tool used (GSEA or ROAST), with automatic detection as the default. The PR updates the gene set enrichment analysis and barcode plot modules to correctly filter tables based on user-specified p-value and FDR thresholds, applying filters to the appropriate columns depending on the enrichment tool used.

Key changes:

  • New gene_set_analyses_tool slot in ExploratorySummarizedExperiment class for tool-specific column handling
  • Enhanced support for reading GSEA results split into separate up/down regulation files
  • Auto-detection heuristic to identify enrichment tool format from column names
  • Updated filtering logic in gene set analysis modules to use tool-specific column names

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
R/ExploratorySummarizedExperiment-class.R Adds gene_set_analyses_tool slot and validation function, modernizes class definition syntax
R/ExploratorySummarizedExperimentList-class.R Modernizes class definition from representation to slots, adds warning suppression for numeric type checking
R/accessory.R Implements GSEA file reading logic, adds get_gst_columns for tool-specific column detection and remove_nulls helper
R/genesetanalysistable.R Updates filtering to use tool-specific column names for p-values, FDR, and direction
R/genesetbarcodeplot.R Updates to retrieve FDR and direction using tool-specific column names
exec/make_app_from_files.R Adds command-line options for enrichment analysis and implements filename template parsing for GSEA results
man/*.Rd Updates documentation for new parameters and slots
vignettes/shinyngs.Rmd Adds explicit library load for better code clarity
DESCRIPTION Updates R version requirement and RoxygenNote version
R/shinyngs.R Modernizes package documentation syntax

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# structure. gene_set_analyses_tool should have a single string

# Create default structure if necessary:
out <- list()
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable out is created but never used. This line can be removed.

Suggested change
out <- list()

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +130 to +175
check_gene_set_analyses_tool_consistency <- function(gene_set_analyses, gene_set_analyses_tool) {
# gene_set_analyses and gene_set_analyses_tool should have the same list of lists
# structure. gene_set_analyses_tool should have a single string

# Create default structure if necessary:
out <- list()

if (is.null(gene_set_analyses_tool)) {
gene_set_analyses_tool <- list()
}
for (assay_name in names(gene_set_analyses)) {
if (!assay_name %in% names(gene_set_analyses_tool)) {
gene_set_analyses_tool[[assay_name]] <- list()
}
for (gs_type in names(gene_set_analyses[[assay_name]])) {
if (! gs_type %in% names(gene_set_analyses_tool[[assay_name]])) {
gene_set_analyses_tool[[assay_name]][[gs_type]] <- list()
}
for (contrast_name in names(gene_set_analyses[[assay_name]][[gs_type]])) {
if (! contrast_name %in% names(gene_set_analyses_tool[[assay_name]][[gs_type]])) {
gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]] <- "auto"
} else {
tool_name <- gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]]
if (!is.character(tool_name) || length(tool_name) > 1) {
stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ",
gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ",
paste0(tool_name, collapse=",")))
}
if (! tool_name %in% c("auto", "gsea", "roast")) {
stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ",
gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ",
tool_name))
}
}
}
# In case gene_set_analyses_tool has other entries, just keep the ones matching gene_set_analyses
contrasts_ordered <- names(gene_set_analyses[[assay_name]][[gs_type]])
gene_set_analyses_tool[[assay_name]][[gs_type]] <- gene_set_analyses_tool[[assay_name]][[gs_type]][contrasts_ordered]
}
gene_set_type_names_ordered <- names(gene_set_analyses[[assay_name]])
gene_set_analyses_tool[[assay_name]] <- gene_set_analyses_tool[[assay_name]][gene_set_type_names_ordered]
}
analysis_names_ordered <- names(gene_set_analyses_tool)
gene_set_analyses_tool <- gene_set_analyses_tool[analysis_names_ordered]
gene_set_analyses_tool
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check_gene_set_analyses_tool_consistency <- function(gene_set_analyses, gene_set_analyses_tool) {
# gene_set_analyses and gene_set_analyses_tool should have the same list of lists
# structure. gene_set_analyses_tool should have a single string
# Create default structure if necessary:
out <- list()
if (is.null(gene_set_analyses_tool)) {
gene_set_analyses_tool <- list()
}
for (assay_name in names(gene_set_analyses)) {
if (!assay_name %in% names(gene_set_analyses_tool)) {
gene_set_analyses_tool[[assay_name]] <- list()
}
for (gs_type in names(gene_set_analyses[[assay_name]])) {
if (! gs_type %in% names(gene_set_analyses_tool[[assay_name]])) {
gene_set_analyses_tool[[assay_name]][[gs_type]] <- list()
}
for (contrast_name in names(gene_set_analyses[[assay_name]][[gs_type]])) {
if (! contrast_name %in% names(gene_set_analyses_tool[[assay_name]][[gs_type]])) {
gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]] <- "auto"
} else {
tool_name <- gene_set_analyses_tool[[assay_name]][[gs_type]][[contrast_name]]
if (!is.character(tool_name) || length(tool_name) > 1) {
stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ",
gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ",
paste0(tool_name, collapse=",")))
}
if (! tool_name %in% c("auto", "gsea", "roast")) {
stop(paste0("Invalid gene_set_analyses_tool. gene_set_analyses_tool for ",
gs_type, " and ", contrast_name, " should be one of 'auto', 'gsea' or 'roast'. Found ",
tool_name))
}
}
}
# In case gene_set_analyses_tool has other entries, just keep the ones matching gene_set_analyses
contrasts_ordered <- names(gene_set_analyses[[assay_name]][[gs_type]])
gene_set_analyses_tool[[assay_name]][[gs_type]] <- gene_set_analyses_tool[[assay_name]][[gs_type]][contrasts_ordered]
}
gene_set_type_names_ordered <- names(gene_set_analyses[[assay_name]])
gene_set_analyses_tool[[assay_name]] <- gene_set_analyses_tool[[assay_name]][gene_set_type_names_ordered]
}
analysis_names_ordered <- names(gene_set_analyses_tool)
gene_set_analyses_tool <- gene_set_analyses_tool[analysis_names_ordered]
gene_set_analyses_tool
}
check_gene_set_analyses_tool_consistency <- function(gene_set_analyses, gene_set_analyses_tool) {
if (is.null(gene_set_analyses_tool)) {
gene_set_analyses_tool <- list()
}
valid_tools <- c("auto", "gsea", "roast")
# Recursive function to mirror structure and validate
mirror_and_validate <- function(gsa_node, gst_node, path = "") {
if (is.data.frame(gsa_node)) {
# Leaf node - should have a tool string
tool <- if (is.null(gst_node)) "auto" else gst_node
if (!is.character(tool) || length(tool) != 1 || !tool %in% valid_tools) {
stop(sprintf("Invalid gene_set_analyses_tool at %s: expected one of %s, got %s",
path, paste(valid_tools, collapse=", "), paste(tool, collapse=",")))
}
return(tool)
}
# Recurse through structure
lapply(setNames(names(gsa_node), names(gsa_node)), function(name) {
mirror_and_validate(gsa_node[[name]], gst_node[[name]], paste0(path, "/", name))
})
}
mirror_and_validate(gene_set_analyses, gene_set_analyses_tool)
}

I found this function quite hard to understand- my AI friend suggested this, could you try it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your AI friend gave me a good starting point, although its suggestion was a bit flawed. I believe I have now implemented something easier to read

Comment on lines +685 to +714
# Basic list to pass to object creation
exp$gene_set_analyses_tool <- check_gene_set_analyses_tool_consistency(exp$gene_set_analyses, exp$gene_set_analyses_tool)

ese_list$gene_set_analyses <- lapply(exp$gene_set_analyses, function(assay) {
lapply(assay, function(gene_set_type) {
lapply(gene_set_type, function(contrast) {
read.csv(contrast, check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
# contrast may be one file name or two file names (up and down), or NULL
if (is.null(contrast) || length(contrast) == 0) {
NULL
} else if (length(contrast) == 1) {
read.csv(contrast, sep=getSeparator(contrast),
check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
} else if (length(contrast) == 2) {
# This is useful for GSEA output, that splits up and down in two tsv files.
# We read both files and set the direction
up <- read.csv(contrast[["up"]], sep=getSeparator(contrast[["up"]]),
check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
up$Direction <- rep("Up", nrow(up))
down <- read.csv(contrast[["down"]], sep=getSeparator(contrast[["down"]]),
check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
down$Direction <- rep("Down", nrow(down))
rbind(up, down)
} else {
stop("gene_set_analyses should have zero, one or two contrast files per gene_set_type")
}
})
})
})

ese_list$gene_set_analyses <- remove_nulls(ese_list$gene_set_analyses)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a helper function here?

read_enrichment_file <- function(contrast_spec) {
  if (is.null(contrast_spec) || length(contrast_spec) == 0) {
    return(NULL)
  }
  
  read_one <- function(path) {
    read.csv(path, sep = getSeparator(path), check.names = FALSE, 
             stringsAsFactors = FALSE, row.names = 1)
  }
  
  if (length(contrast_spec) == 1) {
    return(read_one(contrast_spec))
  }
  
  if (length(contrast_spec) == 2) {
    up <- read_one(contrast_spec[["up"]])
    up$Direction <- "Up"
    down <- read_one(contrast_spec[["down"]])
    down$Direction <- "Down"
    return(rbind(up, down))
  }
  
  stop("gene_set_analyses should have 0, 1, or 2 files per contrast")
}

# Then simplify the nested lapply:
ese_list$gene_set_analyses <- lapply(exp$gene_set_analyses, function(assay) {
  lapply(assay, function(gene_set_type) {
    lapply(gene_set_type, read_enrichment_file)
  })
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :-)

R/accessory.R Outdated
# Gene set enrichment format and filtering columns
# gst: A data frame coming from the gene enrichment tool (roast, gsea)
# gs_tool: A string, the tool used to get gst ("auto", "gsea", "roast")
get_gst_columns <- function(gst, gs_tool) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be easier to maintain with some more separation of concerns. Rather than having the function do lots of things and unpack:

detect_enrichment_tool <- function(gst) {
  if ("NOM p-val" %in% colnames(gst)) return("gsea")
  if (any(c("p value", "PValue") %in% colnames(gst))) return("roast")
  stop("Could not detect enrichment tool from column names")
}

get_column_mapping <- function(gs_tool) {
  mappings <- list(
    roast = list(pvalue = "PValue", fdr = "FDR", direction = "Direction"),
    gsea = list(pvalue = "NOM p-val", fdr = "FDR q-val", direction = "Direction")
  )
  
  # Handle roast variant with lowercase "p value"
  if (gs_tool == "roast") {
    mappings$roast$pvalue <- "PValue"  # Check and override below
  }
  
  mappings[[gs_tool]]
}

clean_gsea_columns <- function(gst) {
  cols_to_remove <- c("GS<br> follow link to MSigDB", "GS DETAILS")
  gst[, !(names(gst) %in% cols_to_remove), drop = FALSE]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# "gsea_for_{contrast_name}_using_{geneset_type}_results.tsv"
# becomes:
# "gsea_for_disease_vs_healthy_using_geo_bp_results.tsv"
if (grepl("target\\|reference", opt$enrichment_filename_template)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should centralise the path building logic so we only have to maintain it in one place:

build_enrichment_path <- function(template, contrast_info, geneset_type, direction = NULL) {
  path <- template
  path <- gsub("{contrast_name}", contrast_info$id, path, fixed = TRUE)
  path <- gsub("{geneset_type}", geneset_type, path, fixed = TRUE)
  if (!is.null(direction)) {
    target_val <- if (direction == "up") contrast_info$target else contrast_info$reference
    path <- gsub("{target|reference}", target_val, path, fixed = TRUE)
  }
  path
}

# Then use:
if (grepl("\\{target\\|reference\\}", template)) {
  up_file <- build_enrichment_path(template, ctrst_info, geneset_type, "up")
  down_file <- build_enrichment_path(template, ctrst_info, geneset_type, "down")
  # ... rest of logic
} else {
  enrichment_file <- build_enrichment_path(template, ctrst_info, geneset_type)
  # ... rest of logic
}

Function will probably need to go in accessory.R

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Owner

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the work! I know you're still working so I won't comment too much on lack of tests / docs (though the AI has done a bit of that).

In general it would be good if as much of the new logic as possible is in functions where it can have tests, rather than e.g. in large tranches of logic in the exec script.

@zeehio
Copy link
Contributor Author

zeehio commented Nov 12, 2025

I agree with your suggestions and I'll follow up implementing them on the following days. Thank you very much for your review!

@zeehio zeehio marked this pull request as ready for review January 25, 2026 18:42
zeehio added 16 commits January 25, 2026 19:44
We know, and we don't care in this case
The SummarizedExperiment class was removed from GenomicRanges (after having moved it to the SummarizedExperiment package) back in Bioconductor 3.3 [1], released in May 2016 and targetting R 3.3.

This commit removes code targeting support for Bioconductor versions older than Bioconductor 3.3.

It slightly bumps the R version dependency to explicitly break with that old Bioconductor version.

[1] https://bioconductor.org/news/bioc_3_3_release/
Back in Sept 2016, the representation= argument of setClass was documented as old, and the slots= argument was suggested instead [1].

[1] wch/r-source@edadf29

In 2016 it was already old, so now it is considered even older.

This commit just updates the syntax to ease maintenance.
Look at gene_set_analyses being used here:

https://github.com/pinin4fjords/shinyngs/blob/d2ca950934070c699c839aea45e77053bc461722/R/genesetanalysistable.R#L239

gst <- ese@gene_set_analyses[[assay]][[gene_set_types]][[as.numeric(selected_contrasts)]]

It's indexed by assay, geneset type and contrast.

Fix the docs so the explanation matches reality.
The gene_set_analyses_tool has a list-based structure, just like gene_set_analyses.

gene_set_analyses_tool[[assay]][[gene_set_type]][[contrast]] contains a string, either "auto", "gsea", or "roast".

This string can be used by the geneset-related modules to parse and interact with the gene set enrichment tables.
Ensure each given gene_set_analysis has a tool set, and set it to "auto" otherwise.
Previous library(shinyngs) statements are on
eval=FALSE chunks.
When reading csv files from each assay, gene_set type and contrast, allow to not have enrichment results for all combinations.
@zeehio
Copy link
Contributor Author

zeehio commented Jan 25, 2026

It took me a long while to find some time to finish this (definitely longer than expected).

I implemented all the suggested changes, added unit tests to the accessory functions, improved docstrings and rebased.

Happy to get your feedback and hopefully be able to merge this eventually :-)

@zeehio zeehio changed the title GSEA support (WIP) GSEA support Jan 25, 2026
@pinin4fjords
Copy link
Owner

Looking pretty good now! Do you have some example outputs to show?

@zeehio
Copy link
Contributor Author

zeehio commented Jan 26, 2026

I'll find a public dataset and upload some plots as soon as I find some more time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GSEA output in shinyngs

3 participants