diff --git a/Microarray/Affymetrix/Array_Annotations/Affymetrix_array_annotations.csv b/Microarray/Affymetrix/Array_Annotations/Affymetrix_array_annotations.csv new file mode 100644 index 00000000..03ea9b68 --- /dev/null +++ b/Microarray/Affymetrix/Array_Annotations/Affymetrix_array_annotations.csv @@ -0,0 +1,3 @@ +array_design,annot_type,annot_filename,download_link,download_date +AFFY E coli Genome 2 0,3prime-IVT,E_coli_2.na36.annot.csv,https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/E_coli_2-na36-annot-csv.zip,2024-06-15 +AFFY GeneChip P. aeruginosa Genome,3prime-IVT,Pae_G1a.na36.annot.csv,https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/Pae_G1a-na36-annot-csv.zip,2024-06-15 diff --git a/Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md b/Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md new file mode 100644 index 00000000..d3e5393a --- /dev/null +++ b/Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md @@ -0,0 +1,1616 @@ +# GeneLab bioinformatics processing pipeline for Affymetrix microarray data + +> **This page holds an overview and instructions for how GeneLab processes Affymetrix microarray datasets. Exact processing commands and GL-DPPD-7114 version used for specific GeneLab datasets (GLDS) are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo).** +> +> \* The pipeline detailed below is currently used for animal and *Arabidopsis thaliana* studies only, it will be updated soon for processing microbe microarray data and other plant data. + +--- + +**Date:** May XX, 2025 +**Revision:** -A +**Document Number:** GL-DPPD-7114-A + +**Submitted by:** +Crystal Han (GeneLab Data Processing Team) + +**Approved by:** +Samrawit Gebre (OSDR Project Manager) +Danielle Lopez (OSDR Deputy Project Manager) +Jonathan Galazka (OSDR Project Scientist) +Amanda Saravia-Butler (GeneLab Science Lead) +Barbara Novak (GeneLab Data Processing Lead) + +--- + +## Updates from previous version + +Updated [Ensembl Reference Files](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) to the following releases: +- Animals: Ensembl release 112 +- Plants: Ensembl plants release 59 +- Bacteria: Ensembl bacteria release 59 + +Software Updates: + +| Program | Previous Version | New Version | +|:--------|:-----------------|:---------------| +|R|4.1.3|4.4.2| +|DT|0.26|0.33| +|dplyr|1.0.10|1.1.4| +|tibble|3.1.8|3.2.1| +|stringr|1.5.0|1.5.1| +|R.utils|2.12.2|2.12.3| +|oligo|1.58.0|1.70.0| +|limma|3.50.3|3.62.2| +|glue|1.6.2|1.8.0| +|biomaRt|2.50.0|2.62.0| +|matrixStats|0.63.0|1.5.0| +|dp_tools|1.3.4|1.3.5| +|Quarto|1.2.313|1.6.40| +|Bioconductor|3.14|3.20| +|tidyverse|1.3.1|2.0.0| + +MA Plots + +- Added support for plotting HTAFeatureSet data + +Custom Annotations + +- Added ability to use custom gene annotations when annotations are not available in Biomart or Ensembl FTP, see [Step 8](#8-probeset-annotations) + +--- + +# Table of contents + +- [Software used](#software-used) +- [General processing overview with example commands](#general-processing-overview-with-example-commands) + - [1. Create Sample RunSheet](#1-create-sample-runsheet) + - [2. Load Data](#2-load-data) + - [2a. Load Libraries and Define Input Parameters](#2a-load-libraries-and-define-input-parameters) + - [2b. Define Custom Functions](#2b-define-custom-functions) + - [2c. Load Metadata and Raw Data](#2c-load-metadata-and-raw-data) + - [2d. Load Annotation Metadata](#2d-load-annotation-metadata) + - [3. Raw Data Quality Assessment](#3-raw-data-quality-assessment) + - [3a. Density Plot](#3a-density-plot) + - [3b. Pseudo Image Plots](#3b-pseudo-image-plots) + - [3c. MA Plots](#3c-ma-plots) + - [3d. Boxplots](#3d-boxplots) + - [4. Background Correction](#4-background-correction) + - [5. Between Array Normalization](#5-between-array-normalization) + - [6. Normalized Data Quality Assessment](#6-normalized-data-quality-assessment) + - [6a. Density Plot](#6a-density-plot) + - [6b. Pseudo Image Plots](#6b-pseudo-image-plots) + - [6c. MA Plots](#6c-ma-plots) + - [6d. Boxplots](#6d-boxplots) + - [7. Probeset Summarization](#7-probeset-summarization) + - [8. Probeset Annotations](#8-probeset-annotations) + - [8a. Get Probeset Annotations](#8a-get-probeset-annotations) + - [8b. Summarize Gene Mapping](#8b-summarize-gene-mapping) + - [8c. Generate Annotated Raw and Normalized Expression Tables](#8c-generate-annotated-raw-and-normalized-expression-tables) + - [9. Perform Probeset Differential Expression (DE)](#9-perform-probeset-differential-expression-de) + - [9a. Generate Design Matrix](#9a-generate-design-matrix) + - [9b. Perform Individual Probeset Level DE](#9b-perform-individual-probeset-level-de) + - [9c. Add Annotation and Stats Columns and Format DE Table](#9c-add-annotation-and-stats-columns-and-format-de-table) + +--- + +# Software used + +|Program|Version|Relevant Links| +|:------|:------:|:-------------| +|R|4.4.2|[https://www.r-project.org/](https://www.r-project.org/)| +|Bioconductor|3.20|[https://bioconductor.org](https://bioconductor.org)| +|tidyverse|2.0.0|[https://www.tidyverse.org](https://www.tidyverse.org) +|DT|0.33|[https://github.com/rstudio/DT](https://github.com/rstudio/DT)| +|dplyr|1.1.4|[https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)| +|tibble|3.2.1|[https://tibble.tidyverse.org](https://tibble.tidyverse.org)| +|stringr|1.5.1|[https://stringr.tidyverse.org](https://stringr.tidyverse.org)| +|R.utils|2.12.3|[https://github.com/HenrikBengtsson/R.utils](https://github.com/HenrikBengtsson/R.utils)| +|oligo|1.70.0|[https://bioconductor.org/packages/3.20/bioc/html/oligo.html](https://bioconductor.org/packages/3.20/bioc/html/oligo.html)| +|limma|3.62.2|[https://bioconductor.org/packages/3.20/bioc/html/limma.html](https://bioconductor.org/packages/3.20/bioc/html/limma.html)| +|glue|1.8.0|[https://glue.tidyverse.org](https://glue.tidyverse.org)| +|biomaRt|2.62.0|[https://bioconductor.org/packages/3.20/bioc/html/biomaRt.html](https://bioconductor.org/packages/3.20/bioc/html/biomaRt.html)| +|matrixStats|1.5.0|[https://github.com/HenrikBengtsson/matrixStats](https://github.com/HenrikBengtsson/matrixStats)| +|statmod|1.5.0|[https://github.com/cran/statmod](https://github.com/cran/statmod)| +|dp_tools|1.3.5|[https://github.com/torres-alexis/dp_tools](https://github.com/torres-alexis/dp_tools)| +|Quarto|1.6.40|[https://quarto.org](https://quarto.org)| +--- + +# General processing overview with example commands + +> Exact processing commands and output files listed in **bold** below are included with each Microarray processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). + +--- + +## 1. Create Sample RunSheet + +> Notes: +> - Rather than running the commands below to create the runsheet needed for processing, the runsheet may also be created manually by following the [file specification](../Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/README.md). +> +> - These command line tools are part of the [dp_tools](https://github.com/J-81/dp_tools) program. + +```bash +### Download the *ISA.zip file from the GeneLab Repository ### + +dpt-get-isa-archive \ + --accession OSD-### + +### Parse the metadata from the *ISA.zip file to create a sample runsheet ### + +dpt-isa-to-runsheet --accession GLDS-### \ + --plugin-dir /path/to/dp_tools__microarray_plugin \ + --isa-archive *ISA.zip +``` + +**Parameter Definitions:** + +- `--accession OSD-###` - OSD accession ID (replace ### with the OSD number being processed), used to retrieve the urls for the ISA archive and raw expression files hosted on the GeneLab Repository +- `--plugin-dir /path/to/dp_tools__microarray_plugin` - specifies the path to the plugin directory defining the dp-tools configuration for the desired assay type. A plugin for both the Affymetrix microarray assay is provided in the [Workflow_Documentation](../Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/) folder +- `--isa-archive` - Specifies the *ISA.zip file for the respective GLDS dataset, downloaded in the `dpt-get-isa-archive` command + + +**Input Data:** + +- No input data required but the OSD accession ID needs to be indicated, which is used to download the respective ISA archive + +**Output Data:** + +- *ISA.zip (compressed ISA directory containing Investigation, Study, and Assay (ISA) metadata files for the respective OSD dataset, used to define sample groups - the *ISA.zip file is located in the [OSD repository](https://osdr.nasa.gov/bio/repo/search?q=&data_source=cgene,alsda&data_type=study) under 'Study Files' -> 'metadata') + +- **{OSD-Accession-ID}_microarray_v{version}_runsheet.csv** (table containing metadata required for processing, version denotes the dp_tools schema used to specify the metadata to extract from the ISA archive) + +
+ +--- + +## 2. Load Data + +> Note: Steps 2 - 9 are done in R + +
+ +### 2a. Load Libraries and Define Input Parameters + +```R +### Install R packages if not already installed ### + +install.packages("tidyverse") +install.packages("R.utils") +install.packages("glue") +install.packages("matrixStats") +install.packages("statmod") +if (!require("BiocManager", quietly = TRUE)) + install.packages("BiocManager") +BiocManager::install(version = "3.20") +BiocManager::install("limma") +BiocManager::install("biomaRt") +BiocManager::install("oligo") + + +## Note: Only dplyr is explicitly loaded. Other library functions are called with explicit namespace (e.g. LIBRARYNAME::FUNCTION) +library(dplyr) # Ensure infix operator is available, methods should still reference dplyr namespace otherwise +options(dplyr.summarise.inform = FALSE) # Don't print out '`summarise()` has grouped output by 'group'. You can override using the `.groups` argument.' + +# Define path to runsheet +runsheet <- "/path/to/runsheet/{OSD-Accession-ID}_microarray_v{version}_runsheet.csv" + +## Set up output structure + +# Output Constants +DIR_RAW_DATA <- "00-RawData" +DIR_NORMALIZED_EXPRESSION <- "01-oligo_NormExp" +DIR_DGE <- "02-limma_DGE" + +dir.create(DIR_RAW_DATA) +dir.create(DIR_NORMALIZED_EXPRESSION) +dir.create(DIR_DGE) + +## Save original par settings +## Par may be temporarily changed for plotting purposes and reset once the plotting is done + +original_par <- par() +options(preferRaster=TRUE) # use Raster when possible to avoid antialiasing artifacts in images + +options(timeout=1000) # ensure enough time for data downloads +``` + +
+ +### 2b. Define Custom Functions + +#### retry_with_delay() +
+ utility function to improve robustness of function calls; used to remedy intermittent internet issues during runtime + + ```R + retry_with_delay <- function(func, ...) { + max_attempts = 5 + initial_delay = 10 + delay_increase = 30 + attempt <- 1 + current_delay <- initial_delay + while (attempt <= max_attempts) { + result <- tryCatch( + expr = func(...), + error = function(e) e + ) + + if (!inherits(result, "error")) { + return(result) + } else { + if (attempt < max_attempts) { + message(paste("Retry attempt", attempt, "failed for function with name <", deparse(substitute(func)) ,">. Retrying in", current_delay, "second(s)...")) + Sys.sleep(current_delay) + current_delay <- current_delay + delay_increase + } else { + stop(paste("Max retry attempts reached. Last error:", result$message)) + } + } + + attempt <- attempt + 1 + } + } + ``` + + **Function Parameter Definitions:** + - `func=` - specifies the function to wrap + - `...` - other arguments passed on to the `func` + + **Returns:** the output of the wrapped function +
+ +#### all_true() +
+ wraps R base::all() function; overrides default behavior for empty input vector + + ```R + all_true <- function(i_vector) { + if ( length(i_vector) == 0 ) { + stop(paste("Input vector is length zero")) + } + all(i_vector) + } + ``` + + **Function Parameter Definitions:** + - `i_vector=` - a vector of logical values + + **Returns:** a logical vector of length 1; `TRUE` if all values are true, `FALSE` otherwise; stops and returns an error if input vector is empty +
+ +#### runsheet_paths_are_URIs() +
+ tests if paths provided in runsheet dataframe are URIs + + ```R + runsheet_paths_are_URIs <- function(df_runsheet) { + all_true(stringr::str_starts(df_runsheet$`Array Data File Path`, "https")) + } + ``` + + **Custom Functions Used:** + - [all_true()](#all_true) + + **Function Parameter Definitions:** + - `df_runsheet=` - a dataframe containing the sample runsheet information + + **Returns:** a logical vector of length 1; `TRUE` if all values in the `Array Data File Path` of the runsheet start with "https", `FALSE` otherwise; stops and returns an error if input vector is empty +
+ +#### download_files_from_runsheet() +
+ downloads the raw data files + + ```R + download_files_from_runsheet <- function(df_runsheet) { + urls <- df_runsheet$`Array Data File Path` + destinationFiles <- df_runsheet$`Array Data File Name` + + mapply(function(url, destinationFile) { + print(paste0("Downloading from '", url, "' TO '", destinationFile, "'")) + if ( file.exists(destinationFile ) ) { + warning(paste( "Using Existing File:", destinationFile )) + } else { + download.file(url, destinationFile) + } + }, urls, destinationFiles) + + destinationFiles # Return these paths + } + ``` + + **Function Parameter Definitions:** + - `df_runsheet=` - a dataframe containing the sample runsheet information + + **Returns:** a list of filenames that were downloaded; same as the `Array Data File Name` in the sample runsheet +
+ +#### fetch_organism_specific_annotation_table() +
+ determines the organism specific annotation file to use based on the provided organism name + + ```R + fetch_organism_specific_annotation_table <- function(organism, annotation_table_link) { + # Uses the latest GeneLab annotations table to find the organism specific annotation file path and ensembl version + # Raises an exception if the organism does not have an associated annotation file or ensembl version yet + + all_organism_table <- read.csv(annotation_table_link) + + annotation_table <- all_organism_table %>% dplyr::filter(species == organism) + + # Guard clause: Ensure annotation_table populated + # Else: raise exception for unsupported organism + if (nrow(annotation_table) == 0 || annotation_table$genelab_annots_link == "" || is.na(annotation_table$ensemblVersion)) { + stop(glue::glue("Organism supplied '{organism}' is not supported. See the following url for supported organisms: {annotation_table_link}. Supported organisms will correspond to a row based on the 'species' column and include a url in the 'genelab_annots_link' column of that row and a version number in the 'ensemblVersion' column.")) + } + + return(annotation_table) + } + ``` + + **Function Parameter Definitions:** + - `organism=` - a string containing the name of the organism (as found in the species column of the GeneLab annotation table) + - `annotation_table_link=` - a string specifying the URL or path to latest GeneLab Annotations file, see [GL-DPPD-7110-A_annotations.csv](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + + **Returns:** a dataframe containing all rows in the GeneLab annotations file that match the specified organism +
+ +#### shortened_organism_name() +
+ shortens organism names, for example 'Homo Sapiens' to 'hsapiens' + + ```R + shortened_organism_name <- function(long_name) { + tokens <- long_name %>% stringr::str_split(" ", simplify = TRUE) + genus_name <- tokens[1] + + species_name <- tokens[2] + + short_name <- stringr::str_to_lower(paste0(substr(genus_name, start = 1, stop = 1), species_name)) + + return(short_name) + } + ``` + + **Function Parameter Definitions:** + - `long_name=` - a string containing the long name of the organism + + **Returns:** a string containing the short name of the organism +
+ +#### get_biomart_attribute() +
+ retrieves resolved biomart attribute source from runsheet dataframe + + ```R + get_biomart_attribute <- function(df_rs) { + # check if runsheet has 'biomart_attribute' column + if ( !is.null(df_rs$`biomart_attribute`) ) { + print("Using attribute name sourced from runsheet") + # Format according to biomart needs + formatted_value <- unique(df_rs$`biomart_attribute`) %>% + stringr::str_replace_all(" ","_") %>% # Replace all spaces with underscore + stringr::str_to_lower() # Lower casing only + return(formatted_value) + } else { + stop("ERROR: Could not find 'biomart_attribute' in runsheet") + } + } + ``` + + **Function Parameter Definitions:** + - `df_rs=` - a dataframe containing the sample runsheet information + + **Returns:** a string containing the formatted value from the `biomart_attribute` column of the runsheet, with all spaces converted to underscores and uppercase converted to lowercase; if no `biomart_attribute` exists in the runsheet, stop and return an error +
+ +#### get_ensembl_genomes_mappings_from_ftp() +
+ obtains mapping table directly from ftp; useful when biomart live service no longer exists for desired version + + ```R + get_ensembl_genomes_mappings_from_ftp <- function(organism, ensembl_genomes_portal, ensembl_genomes_version, biomart_attribute) { + request_url <- glue::glue("https://ftp.ebi.ac.uk/ensemblgenomes/pub/{ensembl_genomes_portal}/release-{ensembl_genomes_version}/mysql/{ensembl_genomes_portal}_mart_{ensembl_genomes_version}/{organism}_eg_gene__efg_{biomart_attribute}__dm.txt.gz") + + print(glue::glue("Mappings file URL: {request_url}")) + + # Create a temporary file name + temp_file <- tempfile(fileext = ".gz") + + # Download the gzipped table file using the download.file function + download.file(url = request_url, destfile = temp_file, method = "libcurl") # Use 'libcurl' to support ftps + + # Uncompress the file + uncompressed_temp_file <- tempfile() + gzcon <- gzfile(temp_file, "rt") + content <- readLines(gzcon) + writeLines(content, uncompressed_temp_file) + close(gzcon) + + + # Load the data into a dataframe + mapping <- read.table(uncompressed_temp_file, # Read the uncompressed file + # Add column names as follows: MAPID, TAIR, PROBESETID + col.names = c("MAPID", "ensembl_gene_id", biomart_attribute), + header = FALSE, # No header in original table + sep = "\t") # Tab separated + + # Clean up temporary files + unlink(temp_file) + unlink(uncompressed_temp_file) + + return(mapping) + } + ``` + + **Function Parameter Definitions:** + - `organism=` - a string containing the name of the organism (formatted using `shortened_organism_name()`) + - `ensembl_genomes_portal=` - a string containing the name of the genomes portal, for example 'plants' + - `ensembl_genomes_version=` - a string containing the version of Ensembl to use + - `biomart_attribute=` - a string containing the biomart attribute (formatted using `get_biomart_attribute()`) + + **Returns:** a dataframe containing the mapping between Ensembl ID and probeset ID, as obtained via FTP +
+ +#### list_to_unique_piped_string() +
+ converts character vector into string denoting unique elements separated by '|' characters + + ```R + list_to_unique_piped_string <- function(str_list) { + #! Convert vector of multi-mapped genes to string separated by '|' characters + #! e.g. c("GO1","GO2","GO2","G03") -> "GO1|GO2|GO3" + return(toString(unique(str_list)) %>% stringr::str_replace_all(pattern = stringr::fixed(", "), replacement = "|")) + } + ``` + + **Function Parameter Definitions:** + - `str_list=` - vector of character elements + + **Returns:** a string containing the unique elements from `str_list` concatenated together, separated by '|' characters +
+ +#### runsheet_to_design_matrix() +
+ loads the GeneLab runsheet into a list of dataframes + + ```R + runsheet_to_design_matrix <- function(runsheet_path) { + # Pull all factors for each sample in the study from the runsheet created in Step 1 + df <- read.csv(runsheet, check.names = FALSE) %>% + dplyr::mutate_all(function(x) iconv(x, "latin1", "ASCII", sub="")) # Convert all characters to ascii, when not possible, remove the character # get only Factor Value columns + factors = as.data.frame(df[,grep("Factor.Value", colnames(df), ignore.case=TRUE)]) + colnames(factors) = paste("factor",1:dim(factors)[2], sep= "_") + + # Load metadata from runsheet csv file + compare_csv = data.frame(sample_id = df[,c("Sample Name")], factors) + + # Create data frame containing all samples and respective factors + study <- as.data.frame(compare_csv[,2:dim(compare_csv)[2]]) + colnames(study) <- colnames(compare_csv)[2:dim(compare_csv)[2]] + rownames(study) <- compare_csv[,1] + + # Format groups and indicate the group that each sample belongs to + if (dim(study)[2] >= 2){ + group<-apply(study,1,paste,collapse = " & ") # concatenate multiple factors into one condition per sample + } else{ + group<-study[,1] + } + group_names <- paste0("(",group,")",sep = "") # human readable group names + group <- sub("^BLOCKER_", "", make.names(paste0("BLOCKER_", group))) # group naming compatible with R models, this maintains the default behaviour of make.names with the exception that 'X' is never prepended to group namesnames(group) <- group_names + names(group) <- group_names + + # Format contrasts table, defining pairwise comparisons for all groups + contrast.names <- combn(levels(factor(names(group))),2) # generate matrix of pairwise group combinations for comparison + contrasts <- apply(contrast.names, MARGIN=2, function(col) sub("^BLOCKER_", "", make.names(paste0("BLOCKER_", stringr::str_sub(col, 2, -2))))) + contrast.names <- c(paste(contrast.names[1,],contrast.names[2,],sep = "v"),paste(contrast.names[2,],contrast.names[1,],sep = "v")) # format combinations for output table files names + contrasts <- cbind(contrasts,contrasts[c(2,1),]) + colnames(contrasts) <- contrast.names + sampleTable <- data.frame(condition=factor(group)) + rownames(sampleTable) <- df[,c("Sample Name")] + + condition <- sampleTable[,'condition'] + names_mapping <- as.data.frame(cbind(safe_name = as.character(condition), original_name = group_names)) + + design <- model.matrix(~ 0 + condition) + design_data <- list( matrix = design, mapping = names_mapping, groups = as.data.frame( cbind(sample = df[,c("Sample Name")], group = group_names) ), contrasts = contrasts ) + return(design_data) + } + ``` + + **Function Parameter Definitions:** + - `runsheet_path=` - a string containing the path to the runsheet generated in [Step 1](#1-create-sample-runsheet) + + **Returns:** a list of R objects containing the sample information and metadata + - `design_data$matrix` - a design (or model) matrix describing the conditions in the dataset + - `design_data$mapping` - a dataframe mapping the human-readable group names to the names of the conditions modified for use in R + - `design_data$groups` - a dataframe of group names and contrasts for each sample + - `design_data$contrasts` - a matrix of all pairwise comparisons of the groups +
+ +#### lm_fit_pairwise() +
+ performs all pairwise comparisons using limma::lmFit() + + ```R + lm_fit_pairwise <- function(norm_data, design) { + # Approach based on limma manual section 17.4 (version 3.52.4) + fit <- limma::lmFit(norm_data, design) + + # Create Contrast Model + fit.groups <- colnames(fit$design)[which(fit$assign == 1)] + combos <- combn(fit.groups,2) + contrasts<-c(paste(combos[1,],combos[2,],sep = "-"),paste(combos[2,],combos[1,],sep = "-")) # format combinations for limma:makeContrasts + cont.matrix <- limma::makeContrasts(contrasts=contrasts,levels=design) + contrast.fit <- limma::contrasts.fit(fit, cont.matrix) + + contrast.fit <- limma::eBayes(contrast.fit,trend=TRUE,robust=TRUE) + return(contrast.fit) + } + ``` + + **Function Parameter Definitions:** + - `norm_data=` - an R object containing log-ratios or log-expression values for a series of arrays, with rows corresponding to genes and columns to samples + - `design=` - the design matrix of the microarray experiment, with rows corresponding to samples and columns to coefficients to be estimated + + **Returns:** an R object of class `MArrayLM` +
+ +#### reformat_names() +
+ reformats column names for consistency across DE analyses tables within GeneLab + + ```R + reformat_names <- function(colname, group_name_mapping) { + new_colname <- colname %>% + stringr::str_replace(pattern = "^P.value.adj.condition", replacement = "Adj.p.value_") %>% + stringr::str_replace(pattern = "^P.value.condition", replacement = "P.value_") %>% + stringr::str_replace(pattern = "^Coef.condition", replacement = "Log2fc_") %>% # This is the Log2FC as per: https://rdrr.io/bioc/limma/man/writefit.html + stringr::str_replace(pattern = "^t.condition", replacement = "T.stat_") %>% + stringr::str_replace(pattern = ".condition", replacement = "v") + + # remap to group names before make.names was applied + unique_group_name_mapping <- unique(group_name_mapping) %>% arrange(-nchar(safe_name)) + for ( i in seq(nrow(unique_group_name_mapping)) ) { + safe_name <- unique_group_name_mapping[i,]$safe_name + original_name <- unique_group_name_mapping[i,]$original_name + new_colname <- new_colname %>% stringr::str_replace(pattern = stringr::fixed(safe_name), replacement = original_name) + } + + return(new_colname) + } + ``` + + **Function Parameter Definitions:** + - `colnames=` - a character vector containing the column names to reformat + - `group_name_mapping=` - a dataframe mapping the original human-readable group names to the R modified safe names + + **Returns:** a character vector containing the formatted column names +
+ +#### generate_prefixed_column_order() +
+ creates a vector of column names based on subject and given prefixes; used for both contrasts and groups column name generation + + ```R + generate_prefixed_column_order <- function(subjects, prefixes) { + # Track order of columns + final_order = c() + + # For each contrast + for (subject in subjects) { + # Generate column names for each prefix and append to final_order + for (prefix in prefixes) { + final_order <- append(final_order, glue::glue("{prefix}{subject}")) + } + } + return(final_order) + } + ``` + + **Function Parameter Definitions:** + - `subjects` - a character vector containing subject strings to add prefixes to + - `prefixes` - a character vector of prefixes to add to the beginning of each subject string + + **Returns:** a character vector with all possible combinations of prefix + subject +
+ +
+ +### 2c. Load Metadata and Raw Data + +```R +df_rs <- read.csv(runsheet, check.names = FALSE) %>% + dplyr::mutate_all(function(x) iconv(x, "latin1", "ASCII", sub="")) # Convert all characters to ascii, when not possible, remove the character + +if ( runsheet_paths_are_URIs(df_rs) ) { + print("Determined Raw Data Locations are URIS") + local_paths <- retry_with_delay(download_files_from_runsheet, df_rs) +} else { + print("Or Determined Raw Data Locations are local paths") + local_paths <- df_rs$`Array Data File Path` +} + +# uncompress files if needed +if ( all_true(stringr::str_ends(local_paths, ".gz")) ) { + print("Determined these files are gzip compressed... uncompressing now") + # This does the uncompression + lapply(local_paths, R.utils::gunzip, remove = FALSE, overwrite = TRUE) + # This removes the .gz extension to get the uncompressed filenames + local_paths <- vapply(local_paths, + stringr::str_replace, # Run this function against each item in 'local_paths' + FUN.VALUE = character(1), # Execpt an character vector as a return + USE.NAMES = FALSE, # Don't use the input to assign names for the returned list + pattern = ".gz$", # first argument for applied function + replacement = "" # second argument for applied function + ) +} + +df_local_paths <- data.frame(`Sample Name` = df_rs$`Sample Name`, `Local Paths` = local_paths, check.names = FALSE) + +# Load raw data into R object +# Retry with delay here to accomodate oligo's automatic loading of annotation packages and occasional internet related failures to load +raw_data <- retry_with_delay( + oligo::read.celfiles, + df_local_paths$`Local Paths`, + sampleNames = df_local_paths$`Sample Name`# Map column names as Sample Names (instead of default filenames) + ) + +# Summarize raw data +print(paste0("Number of Arrays: ", dim(raw_data)[2])) +print(paste0("Number of Probes: ", dim(raw_data)[1])) +``` + +**Custom Functions Used:** + +- [retry_with_delay()](#retry_with_delay) +- [all_true()](#all_true) +- [runsheet_paths_are_URIs()](#runsheet_paths_are_uris) +- [download_files_from_runsheet()](#download_files_from_runsheet) + +**Input Data:** + +- `runsheet` (Path to runsheet, output from [Step 1](#1-create-sample-runsheet)) + +**Output Data:** + +- `df_rs` (R dataframe containing information from the runsheet) +- `raw_data` (R object containing raw microarray data) + + > Note: The raw data R object will be used to generate quality assessment (QA) plots in the next step. + +
+ +### 2d. Load Annotation Metadata + +```R +# If using custom annotation, local_annotation_dir is path to directory containing annotation file and annotation_config_path is path/url to config file +local_annotation_dir <- NULL # +annotation_config_path <- NULL # + +annotation_table_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/GL_RefAnnotTable-A_1.1.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" + +annotation_table <- retry_with_delay(fetch_organism_specific_annotation_table, unique(df_rs$organism), annotation_table_link) + +annotation_file_path <- annotation_table$genelab_annots_link +ensembl_version <- as.character(annotation_table$ensemblVersion) +``` + +**Custom Functions Used:** + +- [retry_with_delay()](#retry_with_delay) +- [fetch_organism_specific_annotation_table()](#fetch_organism_specific_annotation_table) + +**Input Data:** + +- `local_annotation_dir` (Path to local annotation directory if using custom annotations, see [Step 8a](#8a-get-probeset-annotations)) + + > Note: If not using custom annotations, leave `local_annotation_dir` as `NULL`. + +- `annotation_config_path` (URL or path to annotation config file if using custom annotations, see [Step 8a](#8a-get-probeset-annotations)) + + > Note: If not using custom annotations, leave `annotation_config_path` as `NULL`. + +- `df_rs$organism` (organism specified in the runsheet created in [Step 1](#1-create-sample-runsheet)) +- `annotation_table_link` (URL or path to latest GeneLab Annotations file, see [GL-DPPD-7110-A_annotations.csv](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)) + +**Output Data:** + +- `annotation_file_path` (reference organism annotation file url indicated in the 'genelab_annots_link' column of the GeneLab Annotations file provided in `annotation_table_link`) +- `ensembl_version` (reference organism Ensembl version indicated in the 'ensemblVersion' column of the GeneLab Annotations file provided in `annotation_table_link`) + +
+ +--- + +## 3. Raw Data Quality Assessment + +
+ +### 3a. Density Plot + +```R +# Plot settings +par( + xpd = TRUE # Ensure legend can extend past plot area +) + +number_of_sets = ceiling(dim(raw_data)[2] / 30) # Set of 30 samples, used to scale plot +scale_factor = 0.2 # Default scale factor + +if (max(nchar(colnames(raw_data@assayData$exprs))) > 35 & number_of_sets > 1) { # Scale more if sample names are long + scale_factor = if_else(number_of_sets == 2, 0.4, 0.25) +} + +oligo::hist(raw_data, + transfo=log2, # Log2 transform raw intensity values + which=c("all"), + nsample=10000, # Number of probes to plot + main = "Density of raw intensities for multiple arrays") +legend("topright", legend = colnames(raw_data@assayData$exprs), + lty = c(1,2,3,4,5), # Seems like oligo::hist cycles through these first five line types + col = oligo::darkColors(n = ncol(raw_data)), # Ensure legend color is in sync with plot + ncol = number_of_sets, # Set number of columns by number of sets + cex = max(0.35, 1 + scale_factor - (number_of_sets*scale_factor)) # Reduce for each column beyond 1 with minimum of 35% + ) + +# Reset par +par(original_par) +``` + +**Input Data:** + +- `raw_data` (raw data R object created in [Step 2c](#2c-load-metadata-and-raw-data) above) + +**Output Data:** + +- Plot containing the density of raw intensities for each array (lack of overlap indicates a need for normalization) + +
+ +### 3b. Pseudo Image Plots + +```R +for ( i in seq_along(1:ncol(raw_data))) { + oligo::image(raw_data[,i], + transfo = log2, + main = colnames(raw_data)[i] + ) +} +``` + +**Input Data:** + +- `raw_data` (raw data R object created in [Step 2c](#2c-load-metadata-and-raw-data) above) + +**Output Data:** + +- Pseudo images of each array before background correction and normalization + +
+ +### 3c. MA Plots + +```R +if (inherits(raw_data, "GeneFeatureSet")) { + print("Raw data is a GeneFeatureSet, using exprs() to access expression values and adding 0.0001 to avoid log(0)") +} else if (inherits(raw_data, "ExpressionSet") || inherits(raw_data, "ExpressionFeatureSet") || inherits(raw_data, "HTAFeatureSet")) { + print(paste0("Raw data is ", class(raw_data), ". Using default approach for this class for MA Plot")) +} + +if (inherits(raw_data, "GeneFeatureSet")) { + MA_plot <- oligo::MAplot( + exprs(raw_data) + 0.0001, + transfo=log2, + ylim=c(-2, 4), + main="" # This function uses 'main' as a suffix to the sample name. Here we want just the sample name, thus here main is an empty string + ) +} else if (inherits(raw_data, "ExpressionSet") || inherits(raw_data, "ExpressionFeatureSet") || inherits(raw_data, "HTAFeatureSet")) { + MA_plot <- oligo::MAplot( + raw_data, + ylim=c(-2, 4), + main="" # This function uses 'main' as a suffix to the sample name. Here we want just the sample name, thus here main is an empty string + ) +} else { + stop(glue::glue("No strategy for MA plots for {class(raw_data)}")) +} +``` + +**Input Data:** + +- `raw_data` (raw data R object created in [Step 2c](#2c-load-metadata-and-raw-data) above) + +**Output Data:** + +- `MA_plot` (M (log ratio of the subject array vs a pseudo-reference, the median of all other arrays) vs. A (average log expression) plot for each array before background correction and normalization) + +
+ + +### 3d. Boxplots + +```R +max_samplename_length <- max(nchar(colnames(raw_data))) +dynamic_lefthand_margin <- max(max_samplename_length * 0.7, 10) +par( + mar = c(8, dynamic_lefthand_margin, 8, 2) + 0.1, # mar is the margin around the plot. c(bottom, left, top, right) + xpd = TRUE + ) +boxplot <- oligo::boxplot(raw_data[, rev(colnames(raw_data))], # Here we reverse column order to ensure descending order for samples in horizontal boxplot + transfo=log2, # Log2 transform raw intensity values + which=c("all"), + nsample=10000, # Number of probes to plot + las = 1, # las specifies the orientation of the axis labels. 1 = always horizontal + ylab="", + xlab="log2 Intensity", + main = "Boxplot of raw intensities \nfor perfect match and mismatch probes", + horizontal = TRUE + ) +title(ylab = "Sample Name", mgp = c(dynamic_lefthand_margin-2, 1, 0)) +# Reset par +par(original_par) +``` + +**Input Data:** + +- `raw_data` (raw data R object created in [Step 2c](#2c-load-metadata-and-raw-data) above) + +**Output Data:** + +- `boxplot` (Boxplot of raw expression data for each array before background correction and normalization) + +
+ +--- + +## 4. Background Correction + +```R +background_corrected_data <- raw_data %>% oligo::backgroundCorrect(method="rma") +``` + +**Input Data:** + +- `raw_data` (raw data R object created in [Step 2c](#2c-load-metadata-and-raw-data) above) + +**Output Data:** + +- `background_corrected_data` (R object containing background-corrected microarray data) + + > + > Note: Background correction was performed using the oligo `rma` method, specifically "Convolution Background Correction" + +
+ +--- + +## 5. Between Array Normalization + +```R +# Normalize background-corrected data using the quantile method +norm_data <- oligo::normalize(background_corrected_data, + method = "quantile", + target = "core" # Use oligo default: core metaprobeset mappings + ) + +# Summarize background-corrected and normalized data +print(paste0("Number of Arrays: ", dim(norm_data)[2])) +print(paste0("Number of Probes: ", dim(norm_data)[1])) +``` + +**Input Data:** + +- `background_corrected_data` (R object containing background-corrected microarray data created in [Step 4](#4-background-correction) above) + +**Output Data:** + +- `norm_data` (R object containing background-corrected and normalized microarray data) + + > + > Note: Normalization was performed using the `quantile` method, which forces the entire empirical distribution of all arrays to be identical + +
+ +--- + +## 6. Normalized Data Quality Assessment + +
+ +### 6a. Density Plot + +```R +# Plot settings +par( + xpd = TRUE # Ensure legend can extend past plot area +) + +number_of_sets = ceiling(dim(norm_data)[2] / 30) # Set of 30 samples, used to scale plot + +oligo::hist(norm_data, + transfo=log2, # Log2 transform normalized intensity values + which=c("all"), + nsample=10000, # Number of probes to plot + main = "Density of normalized intensities for multiple arrays") +legend("topright", legend = colnames(norm_data@assayData$exprs), + lty = c(1,2,3,4,5), # Seems like oligo::hist cycles through these first five line types + col = oligo::darkColors(n = ncol(norm_data)), # Ensure legend color is in sync with plot + ncol = number_of_sets, # Set number of columns by number of sets + cex = max(0.35, 1 + scale_factor - (number_of_sets*scale_factor)) # Reduce for each column beyond 1 with minimum of 35% + ) + +# Reset par +par(original_par) +``` + +**Input Data:** + +- `norm_data` (R object containing background-corrected and normalized microarray data created in [Step 5](#5-between-array-normalization) above) + +**Output Data:** + +- Plot containing the density of background-corrected and normalized intensities for each array (near complete overlap is expected after normalization) + +
+ +### 6b. Pseudo Image Plots + +```R +for ( i in seq_along(1:ncol(norm_data))) { + oligo::image(norm_data[,i], + transfo = log2, + main = colnames(norm_data)[i] + ) +} +``` + +**Input Data:** + +- `norm_data` (R object containing background-corrected and normalized microarray data created in [Step 5](#5-between-array-normalization) above) + +**Output Data:** + +- Pseudo images of each array after background correction and normalization + +
+ +### 6c. MA Plots + +```R +MA_plot <- oligo::MAplot( + norm_data, + ylim=c(-2, 4), + main="" # This function uses 'main' as a suffix to the sample name. Here we want just the sample name, thus here main is an empty string +) +``` + +**Input Data:** + +- `norm_data` (R object containing background-corrected and normalized microarray data created in [Step 5](#5-between-array-normalization) above) + +**Output Data:** + +- `MA_plot` (M (log ratio of the subject array vs a pseudo-reference, the median of all other arrays) vs. A (average log expression) plot for each array after background correction and normalization) + +
+ +### 6d. Boxplots + +```R +max_samplename_length <- max(nchar(colnames(norm_data))) +dynamic_lefthand_margin <- max(max_samplename_length * 0.7, 10) +par( + mar = c(8, dynamic_lefthand_margin, 8, 2) + 0.1, # mar is the margin around the plot. c(bottom, left, top, right) + xpd = TRUE + ) +boxplot <- oligo::boxplot(norm_data[, rev(colnames(norm_data))], # Here we reverse column order to ensure descending order for samples in horizontal boxplot + transfo=log2, # Log2 transform normalized intensity values + which=c("all"), + nsample=10000, # Number of probes to plot + las = 1, # las specifies the orientation of the axis labels. 1 = always horizontal + ylab="", + xlab="log2 Intensity", + main = "Boxplot of normalized intensities \nfor perfect match and mismatch probes", + horizontal = TRUE + ) +title(ylab = "Sample Name", mgp = c(dynamic_lefthand_margin-2, 1, 0)) +# Reset par +par(original_par) +``` + +**Input Data:** + +- `norm_data` (R object containing background-corrected and normalized microarray data created in [Step 5](#5-between-array-normalization) above) + +**Output Data:** + +- `boxplot` (Boxplot of expression data for each array after background correction and normalization) + +
+ +--- + +## 7. Probeset Summarization + +```R +probeset_level_data <- oligo::rma(norm_data, + normalize=FALSE, + background=FALSE + ) + +# Summarize background-corrected and normalized data +print("Summarized Probeset Level Data Below") +print(paste0("Number of Arrays: ", dim(probeset_level_data)[2])) +print(paste0("Total Number of Probes Assigned To A Probeset: ", dim(oligo::getProbeInfo(probeset_level_data, target="core")['man_fsetid'])[1])) # man_fsetid means 'Manufacturer Probeset ID'. Ref: https://support.bioconductor.org/p/57191/ +print(paste0("Number of Probesets: ", dim(unique(oligo::getProbeInfo(probeset_level_data, target="core")['man_fsetid']))[1])) # man_fsetid means 'Manufacturer Probeset ID'. Ref: https://support.bioconductor.org/p/57191/ +``` + +**Input Data:** + +- `norm_data` (R object containing background-corrected and normalized microarray data created in [Step 5](#5-between-array-normalization) above) + +**Output Data:** + +- `probeset_level_data` (R object containing probeset level expression values after summarization of normalized probeset level data) + +
+ +--- + +## 8. Probeset Annotations + +
+ +### 8a. Get Probeset Annotations + +```R +organism <- shortened_organism_name(unique(df_rs$organism)) +annot_key <- ifelse(organism %in% c("athaliana"), 'TAIR', 'ENSEMBL') + +if (organism %in% c("athaliana")) { + ENSEMBL_VERSION = ensembl_version + ensembl_genomes_portal = "plants" + print(glue::glue("Using ensembl genomes ftp to get specific version of probeset id mapping table. Ensembl genomes portal: {ensembl_genomes_portal}, version: {ENSEMBL_VERSION}")) + expected_attribute_name <- get_biomart_attribute(df_rs) + df_mapping <- retry_with_delay( + get_ensembl_genomes_mappings_from_ftp, + organism = organism, + ensembl_genomes_portal = ensembl_genomes_portal, + ensembl_genomes_version = ENSEMBL_VERSION, + biomart_attribute = expected_attribute_name + ) + + # TAIR from the mapping tables tend to be in the format 'AT1G01010.1' but the raw data has 'AT1G01010' + # So here we remove the '.NNN' from the mapping table where .NNN is any number + df_mapping$ensembl_gene_id <- stringr::str_replace_all(df_mapping$ensembl_gene_id, "\\.\\d+$", "") + + use_custom_annot <- FALSE +} else { + # Use biomart from main Ensembl website which archives keep each release on the live service + # locate dataset + expected_dataset_name <- shortened_organism_name(unique(df_rs$organism)) %>% stringr::str_c("_gene_ensembl") + print(paste0("Expected dataset name: '", expected_dataset_name, "'")) + + expected_attribute_name <- get_biomart_attribute(df_rs) + print(paste0("Expected attribute name: '", expected_attribute_name, "'")) + + # Specify Ensembl version used in current GeneLab reference annotations + ENSEMBL_VERSION <- ensembl_version + + print(glue::glue("Using Ensembl biomart to get specific version of mapping table. Ensembl version: {ENSEMBL_VERSION}")) + + # Check if organism/array design is supported in biomart + use_custom_annot <- TRUE + + ensembl <- biomaRt::useEnsembl(biomart = "genes", version = ENSEMBL_VERSION) + ensembl_datasets <- biomaRt::listDatasets(ensembl) + if (expected_dataset_name %in% ensembl_datasets$dataset) { + ensembl <- biomaRt::useEnsembl(biomart = "genes", dataset = expected_dataset_name, version = ENSEMBL_VERSION) + ensembl_attributes <- biomaRt::listAttributes(ensembl) + if (expected_attribute_name %in% ensembl_attributes$name) { + use_custom_annot <- FALSE + } + } + + if (use_custom_annot) { + unloadNamespace("biomaRt") + } else { + + ensembl <- biomaRt::useEnsembl(biomart = "genes", + dataset = expected_dataset_name, + version = ENSEMBL_VERSION) + print(ensembl) + + # Some probe_ids for affy_hta_2_0 may end in .hg.1 instead of .hg (how it is in biomaRt), leading to 0 results returned + if (expected_attribute_name == 'affy_hta_2_0') { + rownames(probeset_level_data) <- stringr::str_replace(rownames(probeset_level_data), '\\.hg\\.1$', '.hg') + } + + probe_ids <- rownames(probeset_level_data) + + # Create probe map + # Run Biomart Queries in chunks to prevent request timeouts + # Note: If timeout is occuring (possibly due to larger load on biomart), reduce chunk size + CHUNK_SIZE= 1500 + probe_id_chunks <- split(probe_ids, ceiling(seq_along(probe_ids) / CHUNK_SIZE)) + df_mapping <- data.frame() + for (i in seq_along(probe_id_chunks)) { + probe_id_chunk <- probe_id_chunks[[i]] + print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) + chunk_results <- biomaRt::getBM( + attributes = c( + expected_attribute_name, + "ensembl_gene_id" + ), + filters = expected_attribute_name, + values = probe_id_chunk, + mart = ensembl) + + if (nrow(chunk_results) > 0) { + df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results) + } + + Sys.sleep(10) # Slight break between requests to prevent back-to-back requests + } + + } +} + +# At this point, we have df_mapping from either the biomart live service or the ensembl genomes ftp archive depending on the organism +# If no df_mapping obtained (e.g., organism not supported in biomart), use custom annotations; otherwise, merge in-house annotations to df_mapping + +if (use_custom_annot) { + expected_attribute_name <- 'ProbesetID' + + annot_type <- 'NO_CUSTOM_ANNOT' + if (!is.null(local_annotation_dir) && !is.null(annotation_config_path)) { + config_df <- read.csv(annotation_config_path, row.names=1) + if (unique(df_rs$`biomart_attribute`) %in% row.names(config_df)) { + annot_config <- config_df[unique(df_rs$`biomart_attribute`), ] + annot_type <- annot_config$annot_type[[1]] + } else { + warning(paste0("No entry for '", unique(df_rs$`biomart_attribute`), "' in provided config file: ", annotation_config_path)) + } + } else { + warning("Need to provide both local_annotation_dir and annotation_config_path to use custom annotation.") + } + + if (annot_type == '3prime-IVT') { + unique_probe_ids <- read.csv( + file.path(local_annotation_dir, annot_config$annot_filename[[1]]), + skip = 13, header = TRUE, na.strings = c('NA', '---') + )[c('Probe.Set.ID', 'Entrez.Gene', 'Gene.Symbol', 'Gene.Title', 'Ensembl', 'RefSeq.Transcript.ID', 'RefSeq.Protein.ID', 'Gene.Ontology.Biological.Process', 'Gene.Ontology.Cellular.Component', 'Gene.Ontology.Molecular.Function')] + + # Clean columns + unique_probe_ids$Gene.Symbol <- purrr::map_chr(stringr::str_split(unique_probe_ids$Gene.Symbol, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + unique_probe_ids$Gene.Title <- purrr::map_chr(stringr::str_split(unique_probe_ids$Gene.Title, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + unique_probe_ids$Entrez.Gene <- purrr::map_chr(stringr::str_split(unique_probe_ids$Entrez.Gene, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + unique_probe_ids$Ensembl <- purrr::map_chr(stringr::str_split(unique_probe_ids$Ensembl, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + + unique_probe_ids$RefSeq <- paste(unique_probe_ids$RefSeq.Transcript.ID, unique_probe_ids$RefSeq.Protein.ID) + unique_probe_ids$RefSeq <- purrr::map_chr(stringr::str_extract_all(unique_probe_ids$RefSeq, '[A-Z]+_[\\d.]+'), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('^$', NA_character_) + + unique_probe_ids$GO <- paste(unique_probe_ids$Gene.Ontology.Biological.Process, unique_probe_ids$Gene.Ontology.Cellular.Component, unique_probe_ids$Gene.Ontology.Molecular.Function) + unique_probe_ids$GO <- purrr::map_chr(stringr::str_extract_all(unique_probe_ids$GO, '\\d{7}'), ~paste0('GO:', unique(.), collapse = "|")) %>% stringr::str_replace('^GO:$', NA_character_) + + unique_probe_ids <- unique_probe_ids[c('Probe.Set.ID', 'Entrez.Gene', 'Gene.Symbol', 'Gene.Title', 'Ensembl', 'RefSeq', 'GO')] + names(unique_probe_ids) <- c('ProbesetID', 'ENTREZID', 'SYMBOL', 'GENENAME', 'ENSEMBL', 'REFSEQ', 'GOSLIM_IDS') + + unique_probe_ids$STRING_id <- NA_character_ + + gene_col <- 'ENSEMBL' + if (sum(!is.na(unique_probe_ids$ENTREZID)) > sum(!is.na(unique_probe_ids$ENSEMBL))) { + gene_col <- 'ENTREZID' + } + if (sum(!is.na(unique_probe_ids$SYMBOL)) > max(sum(!is.na(unique_probe_ids$ENTREZID)), sum(!is.na(unique_probe_ids$ENSEMBL)))) { + gene_col <- 'SYMBOL' + } + + unique_probe_ids <- unique_probe_ids %>% + dplyr::mutate( + count_gene_mappings = 1 + stringr::str_count(get(gene_col), stringr::fixed("|")), + gene_mapping_source = gene_col + ) + } else if (annot_type == 'custom') { + unique_probe_ids <- read.csv( + file.path(local_annotation_dir, annot_config$annot_filename[[1]]), + header = TRUE, na.strings = c('NA', '') + ) + } else { + annot_cols <- c('ProbesetID', 'ENTREZID', 'SYMBOL', 'GENENAME', 'ENSEMBL', 'REFSEQ', 'GOSLIM_IDS', 'STRING_id', 'count_gene_mappings', 'gene_mapping_source') + unique_probe_ids <- setNames(data.frame(matrix(NA_character_, nrow = 1, ncol = length(annot_cols))), annot_cols) + } +} else { + annot <- read.table( + as.character(annotation_file_path), + sep = "\t", + header = TRUE, + quote = "", + comment.char = "" + ) + + unique_probe_ids <- df_mapping %>% + dplyr::mutate(dplyr::across(!!sym(expected_attribute_name), as.character)) %>% # Ensure probeset ids treated as character type + dplyr::group_by(!!sym(expected_attribute_name)) %>% + dplyr::summarise( + ENSEMBL = list_to_unique_piped_string(ensembl_gene_id) + ) %>% + # Count number of ensembl IDS mapped + dplyr::mutate( + count_gene_mappings = 1 + stringr::str_count(ENSEMBL, stringr::fixed("|")), + gene_mapping_source = annot_key + ) %>% + dplyr::left_join(annot, by = c("ENSEMBL" = annot_key)) +} + +probeset_expression_matrix <- oligo::exprs(probeset_level_data) + +probeset_expression_matrix.gene_mapped <- probeset_expression_matrix %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "ProbesetID") %>% # Ensure rownames (probeset IDs) can be used as join key + dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% + dplyr::mutate( count_gene_mappings := ifelse(is.na(count_gene_mappings), 0, count_gene_mappings) ) %>% + dplyr::mutate( gene_mapping_source := unique(unique_probe_ids$gene_mapping_source) ) +``` + +**Custom Functions Used:** + +- [retry_with_delay()](#retry_with_delay) +- [shortened_organism_name()](#shortened_organism_name) +- [get_biomart_attribute()](#get_biomart_attribute) +- [get_ensembl_genomes_mappings_from_ftp()](#get_ensembl_genomes_mappings_from_ftp) +- [list_to_unique_piped_string()](#list_to_unique_piped_string) + +**Input Data:** + +- `df_rs$organism` (organism specified in the runsheet created in [Step 1](#1-create-sample-runsheet)) +- `df_rs$biomart_attribute` (array design biomart identifier specified in the runsheet created in [Step 1](#1-create-sample-runsheet)) +- `annotation_file_path` (reference organism annotation file url indicated in the 'genelab_annots_link' column of the GeneLab Annotations file provided in `annotation_table_link`, output from [Step 2d](#2d-load-annotation-metadata)) +- `ensembl_version` (reference organism Ensembl version indicated in the 'ensemblVersion' column of the GeneLab Annotations file provided in `annotation_table_link`, output from [Step 2d](#2d-load-annotation-metadata)) +- `annot_key` (keytype to join annotation table and microarray probes, dependent on organism, e.g. mus musculus uses 'ENSEMBL') +- `local_annotation_dir` (path to local annotation directory if using custom annotations, output from [Step 2d](#2d-load-annotation-metadata)) +- `annotation_config_path` (URL or path to annotation config file if using custom annotations, output from [Step 2d](#2d-load-annotation-metadata)) + + > Note: See [Affymetrix_array_annotations.csv](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Microarray/Affymetrix/Array_Annotations/Affymetrix_array_annotations.csv) for the latest config file used at GeneLab. This file can also be created manually by following the [file specification](../Workflow_Documentation/NF_MAAffymetrix/examples/annotations/README.md). + +- `probeset_level_data` (R object containing probeset level expression values after summarization of normalized probeset level data, output from [Step 7](#7-probeset-summarization)) + +**Output Data:** + +- `unique_probe_ids` (R object containing probeset ID to gene annotation mappings) +- `probeset_expression_matrix.gene_mapped` (R object containing probeset level expression values after summarization of normalized probeset level data combined with gene annotations specified by [Biomart](https://bioconductor.org/packages/3.20/bioc/html/biomaRt.html) or custom annotations) + +
+ +### 8b. Summarize Gene Mapping + +```R +# Pie Chart with Percentages +slices <- c( + 'Unique Mapping' = nrow(probeset_expression_matrix.gene_mapped %>% dplyr::filter(count_gene_mappings == 1) %>% dplyr::distinct(ProbesetID)), + 'Multi Mapping' = nrow(probeset_expression_matrix.gene_mapped %>% dplyr::filter(count_gene_mappings > 1) %>% dplyr::distinct(ProbesetID)), + 'No Mapping' = nrow(probeset_expression_matrix.gene_mapped %>% dplyr::filter(count_gene_mappings == 0) %>% dplyr::distinct(ProbesetID)) +) +pct <- round(slices/sum(slices)*100) +chart_names <- names(slices) +chart_names <- glue::glue("{names(slices)} ({slices})") # add count to labels +chart_names <- paste(chart_names, pct) # add percents to labels +chart_names <- paste(chart_names,"%",sep="") # ad % to labels +pie(slices,labels = chart_names, col=rainbow(length(slices)), + main=glue::glue("Mapping to Primary Keytype\n {nrow(probeset_expression_matrix.gene_mapped %>% dplyr::distinct(ProbesetID))} Total Unique Probesets") + ) + +print(glue::glue("Unique Mapping Count: {slices[['Unique Mapping']]}")) +``` + +**Input Data:** + +- `probeset_expression_matrix.gene_mapped` (R object containing probeset level expression values after summarization of normalized probeset level data combined with gene annotations specified by [Biomart](https://bioconductor.org/packages/3.20/bioc/html/biomaRt.html) or custom annotations, output from [Step 8a](#8a-get-probeset-annotations) above) + +**Output Data:** + +- A pie chart denoting the gene mapping rates for each unique probeset ID +- A printout denoting the count of unique mappings for gene mapping + +
+ +### 8c. Generate Annotated Raw and Normalized Expression Tables + +```R +## Reorder columns before saving to file +ANNOTATIONS_COLUMN_ORDER = c( + annot_key, + "SYMBOL", + "GENENAME", + "REFSEQ", + "ENTREZID", + "STRING_id", + "GOSLIM_IDS" +) + +SAMPLE_COLUMN_ORDER <- df_rs$`Sample Name` + +probeset_expression_matrix.gene_mapped <- probeset_expression_matrix.gene_mapped %>% dplyr::rename( !!annot_key := ENSEMBL ) + +## Output column subset file with just normalized probeset level expression values +write.csv( + probeset_expression_matrix.gene_mapped[c( + ANNOTATIONS_COLUMN_ORDER, + "ProbesetID", + "count_gene_mappings", + "gene_mapping_source", + SAMPLE_COLUMN_ORDER) + ], file.path(DIR_NORMALIZED_EXPRESSION, "normalized_expression_probeset_GLmicroarray.csv"), row.names = FALSE) + +## Determine column order for probe level tables + +PROBE_INFO_COLUMN_ORDER = c( + "ProbesetID", + "ProbeID", + "count_gene_mappings", + "gene_mapping_source" +) + +FINAL_COLUMN_ORDER <- c( + ANNOTATIONS_COLUMN_ORDER, + PROBE_INFO_COLUMN_ORDER, + SAMPLE_COLUMN_ORDER +) + +## Generate raw intensity matrix that includes annotations + +background_corrected_data_annotated <- oligo::exprs(background_corrected_data) %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "fid") %>% # Ensure rownames (probeset IDs) can be used as join key + dplyr::mutate(dplyr::across(fid, as.integer)) %>% # Ensure fid is integer type, consistent with getProbeInfo typing + dplyr::right_join(oligo::getProbeInfo(background_corrected_data), by = "fid") %>% # Add 'man_fsetid' via mapping based on fid + dplyr::rename( ProbesetID = man_fsetid ) %>% # Rename from getProbeInfo name to ProbesetID + dplyr::rename( ProbeID = fid ) %>% # Rename from getProbeInfo name to ProbeID + dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% # Join with biomaRt ENSEMBL mappings + dplyr::mutate( count_gene_mappings := ifelse(is.na(count_gene_mappings), 0, count_gene_mappings) ) %>% # Convert NA mapping to 0 + dplyr::mutate( gene_mapping_source := unique(unique_probe_ids$gene_mapping_source) ) %>% + dplyr::rename( !!annot_key := ENSEMBL ) + +## Perform reordering +background_corrected_data_annotated <- background_corrected_data_annotated %>% + dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) + +write.csv(background_corrected_data_annotated, file.path(DIR_RAW_DATA, "raw_intensities_probe_GLmicroarray.csv"), row.names = FALSE) + +## Generate normalized expression matrix that includes annotations +norm_data_matrix_annotated <- oligo::exprs(norm_data) %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "fid") %>% # Ensure rownames (probeset IDs) can be used as join key + dplyr::mutate(dplyr::across(fid, as.integer)) %>% # Ensure fid is integer type, consistent with getProbeInfo typing + dplyr::right_join(oligo::getProbeInfo(norm_data), by = "fid") %>% # Add 'man_fsetid' via mapping based on fid + dplyr::rename( ProbesetID = man_fsetid ) %>% # Rename from getProbeInfo name to ProbesetID + dplyr::rename( ProbeID = fid ) %>% # Rename from getProbeInfo name to ProbeID + dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% + dplyr::mutate( count_gene_mappings := ifelse(is.na(count_gene_mappings), 0, count_gene_mappings) ) %>% # Convert NA mapping to 0 + dplyr::mutate( gene_mapping_source := unique(unique_probe_ids$gene_mapping_source) ) %>% + dplyr::rename( !!annot_key := ENSEMBL ) + +norm_data_matrix_annotated <- norm_data_matrix_annotated %>% + dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) + +write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "normalized_intensities_probe_GLmicroarray.csv"), row.names = FALSE) +``` + +**Input Data:** + +- `df_rs` (R dataframe containing information from the runsheet, output from [Step 2c](#2c-load-metadata-and-raw-data)) +- `annot_key` (keytype to join annotation table and microarray probes, dependent on organism, e.g. mus musculus uses 'ENSEMBL', defined in [Step 8a](#8a-get-probeset-annotations)) +- `probeset_expression_matrix.gene_mapped` (R object containing probeset level expression values after summarization of normalized probeset level data combined with gene annotations specified by [Biomart](https://bioconductor.org/packages/3.20/bioc/html/biomaRt.html) or custom annotations, output from [Step 8a](#8a-get-probeset-annotations) above) +- `background_corrected_data` (R object containing background-corrected microarray data, output from [Step 4](#4-background-correction)) +- `norm_data` (R object containing background-corrected and normalized microarray data, output from [Step 5](#5-between-array-normalization)) +- `unique_probe_ids` (R object containing probeset ID to gene annotation mappings, output from [Step 8a](#8a-get-probeset-annotations)) + +**Output Data:** + +- **normalized_expression_probeset_GLmicroarray.csv** (table containing the background corrected, normalized probeset expression values for each sample. The ProbesetID is the unique index column.) +- **raw_intensities_probe_GLmicroarray.csv** (table containing the background corrected, unnormalized probe intensity values for each sample including gene annotations. The ProbeID is the unique index column.) +- **normalized_intensities_probe_GLmicroarray.csv** (table containing the background corrected, normalized probe intensity values for each sample including gene annotations. The ProbeID is the unique index column.) + +## 9. Perform Probeset Differential Expression (DE) + +> Note: Run differential expression analysis only if there are at least 2 replicates per factor group. + +
+ +### 9a. Generate Design Matrix + +```R +# Loading metadata from runsheet csv file +design_data <- runsheet_to_design_matrix(runsheet) +design <- design_data$matrix + +# Write SampleTable.csv and contrasts.csv file +write.csv(design_data$groups, file.path(DIR_DGE, "SampleTable_GLmicroarray.csv"), row.names = FALSE) +write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts_GLmicroarray.csv")) +``` + +**Custom Functions Used:** + +- [runsheet_to_design_matrix()](#runsheet_to_design_matrix) + +**Input Data:** + +- `runsheet` (path to runsheet, output from [Step 1](#1-create-sample-runsheet)) + +**Output Data:** + +- `design_data` (a list of R objects containing the sample information and metadata + - `design_data$matrix` - the limma study design matrix, indicating the group that each sample belongs to + - `design_data$mapping` - a dataframe of conditions and group names + - `design_data$groups` - a dataframe of samples and group names + - `design_data$contrasts` - a matrix of all pairwise comparisons of the groups) +- `design` (R object containing the limma study design matrix, indicating the group that each sample belongs to) +- **SampleTable_GLmicroarray.csv** (table containing samples and their respective groups) +- **contrasts_GLmicroarray.csv** (table containing all pairwise comparisons) + +
+ +### 9b. Perform Individual Probeset Level DE + +```R +# Calculate results +res <- lm_fit_pairwise(probeset_level_data, design) + +# Print DE table, without filtering +limma::write.fit(res, adjust = 'BH', + file = "INTERIM.csv", + row.names = FALSE, + quote = TRUE, + sep = ",") +``` + +**Custom Functions Used:** + +- [lm_fit_pairwise()](#lm_fit_pairwise) + +**Input Data:** + +- `probeset_level_data` (R object containing probeset level expression values after summarization of normalized probeset level data, output from [Step 7](#7-probeset-summarization)) +- `design` (R object containing the limma study design matrix, indicating the group that each sample belongs to, output from [Step 9a](#9a-generate-design-matrix) above) + +**Output Data:** + +- INTERIM.csv (statistical values from individual probeset level DE analysis, including: + - Log2fc between all pairwise comparisons + - T statistic for all pairwise comparison tests + - P value for all pairwise comparison tests + - Adjusted P value for all pairwise comparison tests) + +
+ +### 9c. Add Annotation and Stats Columns and Format DE Table + +```R +## Reformat Table for consistency across DE analyses tables within GeneLab ## + +# Read in DE table +df_interim <- read.csv("INTERIM.csv") + +# Bind columns from gene mapped expression table +df_interim <- df_interim %>% + dplyr::bind_cols(probeset_expression_matrix.gene_mapped) + +df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition'), group_name_mapping = design_data$mapping) + + +## Add Group Wise Statistics ## + +# Group mean and standard deviations for normalized expression values are computed and added to the table + +unique_groups <- unique(design_data$group$group) +for ( i in seq_along(unique_groups) ) { + current_group <- unique_groups[i] + current_samples <- design_data$group %>% + dplyr::group_by(group) %>% + dplyr::summarize( + samples = sort(unique(sample)) + ) %>% + dplyr::filter( + group == current_group + ) %>% + dplyr::pull() + + print(glue::glue("Computing mean and standard deviation for Group {i} of {length(unique_groups)}")) + print(glue::glue("Group: {current_group}")) + print(glue::glue("Samples in Group: '{toString(current_samples)}'")) + + df_interim <- df_interim %>% + dplyr::mutate( + "Group.Mean_{current_group}" := rowMeans(dplyr::select(., all_of(current_samples))), + "Group.Stdev_{current_group}" := matrixStats::rowSds(as.matrix(dplyr::select(., all_of(current_samples)))), + ) %>% + dplyr::ungroup() %>% + as.data.frame() +} + +df_interim <- df_interim %>% + dplyr::mutate( + "All.mean" := rowMeans(dplyr::select(., all_of(SAMPLE_COLUMN_ORDER))), + "All.stdev" := matrixStats::rowSds(as.matrix(dplyr::select(., all_of(SAMPLE_COLUMN_ORDER)))), + ) %>% + dplyr::ungroup() %>% + as.data.frame() + +print("Remove extra columns from final table") + +# These columns are data mapped to column PROBEID as per the original Manufacturer and can be linked as needed +colnames_to_remove = c( + "AveExpr" # Replaced by 'All.mean' column +) + +df_interim <- df_interim %>% dplyr::select(-any_of(colnames_to_remove)) + +PROBE_INFO_COLUMN_ORDER = c( + "ProbesetID", + "count_gene_mappings", + "gene_mapping_source" +) + +STAT_COLUMNS_ORDER <- generate_prefixed_column_order( + subjects = colnames(design_data$contrasts), + prefixes = c( + "Log2fc_", + "T.stat_", + "P.value_", + "Adj.p.value_" + ) + ) +ALL_SAMPLE_STATS_COLUMNS_ORDER <- c( + "All.mean", + "All.stdev", + "F", + "F.p.value" +) + +GROUP_MEAN_STDEV_COLUMNS_ORDER <- generate_prefixed_column_order( + subjects = unique(design_data$groups$group), + prefixes = c( + "Group.Mean_", + "Group.Stdev_" + ) +) + +FINAL_COLUMN_ORDER <- c( + ANNOTATIONS_COLUMN_ORDER, + PROBE_INFO_COLUMN_ORDER, + SAMPLE_COLUMN_ORDER, + STAT_COLUMNS_ORDER, + ALL_SAMPLE_STATS_COLUMNS_ORDER, + GROUP_MEAN_STDEV_COLUMNS_ORDER +) + +## Assert final column order includes all columns from original table +if (!setequal(FINAL_COLUMN_ORDER, colnames(df_interim))) { + write.csv(FINAL_COLUMN_ORDER, "FINAL_COLUMN_ORDER.csv") + NOT_IN_DF_INTERIM <- paste(setdiff(FINAL_COLUMN_ORDER, colnames(df_interim)), collapse = ":::") + NOT_IN_FINAL_COLUMN_ORDER <- paste(setdiff(colnames(df_interim), FINAL_COLUMN_ORDER), collapse = ":::") + stop(glue::glue("Column reordering attempt resulted in different sets of columns than original. Names unique to 'df_interim': {NOT_IN_FINAL_COLUMN_ORDER}. Names unique to 'FINAL_COLUMN_ORDER': {NOT_IN_DF_INTERIM}.")) +} + +## Perform reordering +df_interim <- df_interim %>% dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) + +# Save to file +write.csv(df_interim, file.path(DIR_DGE, "differential_expression_GLmicroarray.csv"), row.names = FALSE) +``` + +**Custom Functions Used:** + +- [reformat_names()](#reformat_names) +- [generate_prefixed_column_order()](#generate_prefixed_column_order) + +**Input Data:** + +- `design_data` (a list of R objects containing the sample information and metadata, output from [Step 9a](#9a-generate-design-matrix) above) +- INTERIM.csv (statistical values from individual probeset level DE analysis, output from [Step 9b](#9b-perform-individual-probeset-level-de) above) +- `probeset_expression_matrix.gene_mapped` (R object containing probeset level expression values after summarization of normalized probeset level data combined with gene annotations specified by [Biomart](https://bioconductor.org/packages/3.20/bioc/html/biomaRt.html) or custom annotations, output from [Step 8a](#8a-get-probeset-annotations) above) + +**Output Data:** + +- **differential_expression_GLmicroarray.csv** (table containing normalized probeset expression values for each sample, group statistics, Limma probeset DE results for each pairwise comparison, and gene annotations. The ProbesetID is the unique index column.) + +> All steps of the Microarray pipeline are performed using R markdown and the completed R markdown is rendered (via Quarto) as an html file (**NF_MAAffymetrix_v\*_GLmicroarray.html**) and published in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/) for the respective dataset. diff --git a/Microarray/Affymetrix/README.md b/Microarray/Affymetrix/README.md index 6743ea3a..5758b9f2 100644 --- a/Microarray/Affymetrix/README.md +++ b/Microarray/Affymetrix/README.md @@ -1,7 +1,7 @@ # GeneLab bioinformatics processing pipeline for Affymetrix microarray data -> **The document [`GL-DPPD-7114.md`](Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md) holds an overview and example commands for how GeneLab processes Affymetrix microarray datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code is provided for each GLDS dataset along with the processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** +> **The document [`GL-DPPD-7114-A.md`](Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md) holds an overview and example commands for how GeneLab processes Affymetrix microarray datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code is provided for each GLDS dataset along with the processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** --- @@ -20,6 +20,10 @@ - Contains instructions for installing and running the GeneLab NF_MAAffymetrix workflow +* [**Array_Annotations**](Array_Annotations) + + - Contains the custom annotations table used in the GeneLab NF_MAAffymetrix + --- **Developed by:** Jonathan Oribello diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/CHANGELOG.md b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/CHANGELOG.md index de658e22..5414e570 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/CHANGELOG.md +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/CHANGELOG.md @@ -5,11 +5,11 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [1.0.5](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.5/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix) - 2024-08-30 +## [1.0.5](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.5/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix) - 2025-03-03 ### Added -- Add support for bacteria annotations using manufacturer annotations ([#113](https://github.com/nasa/GeneLab_Data_Processing/issues/113)) +- Support for custom annotations, see [specification](examples/annotations/README.md) ([#113](https://github.com/nasa/GeneLab_Data_Processing/issues/113)) - Add option to skip differential expression analysis (`--skipDE`) ([#104](https://github.com/nasa/GeneLab_Data_Processing/issues/104)) ### Changed diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/README.md b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/README.md index f21d727a..2354426e 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/README.md +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/README.md @@ -4,7 +4,7 @@ ### Implementation Tools -The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAffymetrix), [GL-DPPD-7114](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MAAffymetrix) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow. +The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAffymetrix), [GL-DPPD-7114-A](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_MAAffymetrix) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow. ### Workflow & Subworkflows @@ -14,8 +14,8 @@ The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAf --- The NF_MAAffymetrix workflow is composed of three subworkflows as shown in the image above. -Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7114 pipeline -document](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md): +Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7114-A pipeline +document](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md): 1. **Analysis Staging Subworkflow** @@ -26,7 +26,7 @@ document](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md): 2. **Affymetrix Microarray Processing Subworkflow** - Description: - - This subworkflow uses the staged raw data and metadata parameters from the Analysis Staging Subworkflow to generate processed data using the [GL-DPPD-7114 pipeline](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md). + - This subworkflow uses the staged raw data and metadata parameters from the Analysis Staging Subworkflow to generate processed data using the [GL-DPPD-7114-A pipeline](../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md). 1. **V&V Pipeline Subworkflow** @@ -200,7 +200,7 @@ All R code steps and output are rendered within a Quarto document yielding the f The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below: -> Note: The outputs from the Affymetrix Microarray Processing Subworkflow are documented in the [GL-DPPD-7114.md](../../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md) processing protocol. +> Note: The outputs from the Affymetrix Microarray Processing Subworkflow are documented in the [GL-DPPD-7114-A.md](../../../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md) processing protocol. **Analysis Staging Subworkflow** diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/annotations/README.md b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/annotations/README.md new file mode 100644 index 00000000..62131ed3 --- /dev/null +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/annotations/README.md @@ -0,0 +1,30 @@ +# Custom Annotations Specification + +## Description + +* If using custom gene annotations when processing Affymetrix datasets through GeneLab's Affymetrix processing pipeline, a csv config file must be provided as specified below. +* See [Affymetrix_array_annotations.csv](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Microarray/Affymetrix/Array_Annotations/Affymetrix_array_annotations.csv) + for the latest config file used at GeneLab. + + +## Example + +- [config.csv](config.csv) + + +## Required columns + +| Column Name | Type | Description | Example | +|:------------|:-----|:------------|:--------| +| array_design | string | A bioMart attribute identifier denoting the microarray probe/probeset attribute used for annotation mapping. | AFFY E coli Genome 2 0 | +| annot_type | string | Used to determine how the custom annotations are parsed before merging to the data. Currently, only the below are supported: | 3prime-IVT | +| annot_filename | string | Name of the custom annotations file. | E_coli_2.na36.annot.csv | + +## Optional columns +If the file was downloaded from a website, provide the download link used and date +downloaded in additional columns after the required column for traceability. + +| Column Name | Type | Description | Example | +|:------------|:-----|:------------|:--------| +| download_link | string | The URL used to retrieve the annotation file. | https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/E_coli_2-na36-annot-csv.zip | +| download_date | date string | The date the file was retrieved in YYYY-MM-DD format. | 2024-06-15 | diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/annotations/config.csv b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/annotations/config.csv new file mode 100644 index 00000000..5c8dff73 --- /dev/null +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/annotations/config.csv @@ -0,0 +1,3 @@ +array_design,annot_type,annot_filename +AFFY E coli Genome 2 0,3prime-IVT,E_coli_2.na36.annot.csv +AFFY GeneChip P. aeruginosa Genome,3prime-IVT,Pae_G1a.na36.annot.csv diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/OSD-213_microarray_v0_runsheet.csv b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/OSD-213_microarray_v0_runsheet.csv new file mode 100644 index 00000000..f309c854 --- /dev/null +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/OSD-213_microarray_v0_runsheet.csv @@ -0,0 +1,7 @@ +Sample Name,Study Assay Measurement,Study Assay Technology Type,Study Assay Technology Platform,organism,biomart_attribute,Source Name,Label,Array Data File Name,Array Data File Path,Comment[Array Data File Name],Factor Value[Spaceflight],Factor Value[Altered Gravity],Original Sample Name +Atha_Col-0_clsCC_FLT_1G_Rep1,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_1,biotin,GLDS-213_microarray_FC_front.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FC_front.CEL.gz,GLDS-213_microarray_FC_front.CEL.gz,Space Flight,1G by centrifugation,Atha_Col-0_clsCC_FLT_1G_Rep1 +Atha_Col-0_clsCC_FLT_1G_Rep2,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_2,biotin,GLDS-213_microarray_FC_rear.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FC_rear.CEL.gz,GLDS-213_microarray_FC_rear.CEL.gz,Space Flight,1G by centrifugation,Atha_Col-0_clsCC_FLT_1G_Rep2 +Atha_Col-0_clsCC_FLT_uG_Rep1,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_3,biotin,GLDS-213_microarray_FS_front.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FS_front.CEL.gz,GLDS-213_microarray_FS_front.CEL.gz,Space Flight,uG,Atha_Col-0_clsCC_FLT_uG_Rep1 +Atha_Col-0_clsCC_FLT_uG_Rep2,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_4,biotin,GLDS-213_microarray_FS_rear.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_FS_rear.CEL.gz,GLDS-213_microarray_FS_rear.CEL.gz,Space Flight,uG,Atha_Col-0_clsCC_FLT_uG_Rep2 +Atha_Col-0_clsCC_GC_1G_Rep1,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_5,biotin,GLDS-213_microarray_GS_front.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_GS_front.CEL.gz,GLDS-213_microarray_GS_front.CEL.gz,Ground Control,1G on Earth,Atha_Col-0_clsCC_GC_1G_Rep1 +Atha_Col-0_clsCC_GC_1G_Rep2,transcription profiling,DNA microarray,Affymetrix,Arabidopsis thaliana,AFFY ATH1 121501,Culture cells_6,biotin,GLDS-213_microarray_GS_rear.CEL.gz,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-213/download?source=datamanager&file=GLDS-213_microarray_GS_rear.CEL.gz,GLDS-213_microarray_GS_rear.CEL.gz,Ground Control,1G on Earth,Atha_Col-0_clsCC_GC_1G_Rep2 diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/OSD-3_microarray_v0_runsheet.csv b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/OSD-3_microarray_v0_runsheet.csv new file mode 100644 index 00000000..fdc974de --- /dev/null +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/OSD-3_microarray_v0_runsheet.csv @@ -0,0 +1,19 @@ +Sample Name,Study Assay Measurement,Study Assay Technology Type,Study Assay Technology Platform,organism,biomart_attribute,Source Name,Label,Array Data File Name,Array Data File Path,Comment[Array Data File Name],Factor Value[Developmental Stage],Factor Value[Spaceflight],Original Sample Name +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep1,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588931 1,biotin,GLDS-3_microarray_GSM588948.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588948.CEL,GLDS-3_microarray_GSM588948.CEL,third instar larva stage,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep1 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep2,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588932 1,biotin,GLDS-3_microarray_GSM588947.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588947.CEL,GLDS-3_microarray_GSM588947.CEL,third instar larva stage,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep2 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep3,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588933 1,biotin,GLDS-3_microarray_GSM588946.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588946.CEL,GLDS-3_microarray_GSM588946.CEL,third instar larva stage,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep3 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep4,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588934 1,biotin,GLDS-3_microarray_GSM588945.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588945.CEL,GLDS-3_microarray_GSM588945.CEL,third instar larva stage,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep4 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep5,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588935 1,biotin,GLDS-3_microarray_GSM588944.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588944.CEL,GLDS-3_microarray_GSM588944.CEL,third instar larva stage,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep5 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep6,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588936 1,biotin,GLDS-3_microarray_GSM588943.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588943.CEL,GLDS-3_microarray_GSM588943.CEL,third instar larva stage,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep6 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_Adult_Rep1,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588937 1,biotin,GLDS-3_microarray_GSM588942.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588942.CEL,GLDS-3_microarray_GSM588942.CEL,adult,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_Adult_Rep1 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_Adult_Rep2,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588938 1,biotin,GLDS-3_microarray_GSM588941.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588941.CEL,GLDS-3_microarray_GSM588941.CEL,adult,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_Adult_Rep2 +Dmel_Hml-GAL4-UAS-GFP_wo_FLT_Adult_Rep3,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588939 1,biotin,GLDS-3_microarray_GSM588940.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588940.CEL,GLDS-3_microarray_GSM588940.CEL,adult,Space Flight,Dmel_Hml-GAL4-UAS-GFP_wo_FLT_Adult_Rep3 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep1,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588940 1,biotin,GLDS-3_microarray_GSM588939.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588939.CEL,GLDS-3_microarray_GSM588939.CEL,third instar larva stage,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep1 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep2,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588941 1,biotin,GLDS-3_microarray_GSM588938.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588938.CEL,GLDS-3_microarray_GSM588938.CEL,third instar larva stage,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep2 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep3,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588942 1,biotin,GLDS-3_microarray_GSM588937.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588937.CEL,GLDS-3_microarray_GSM588937.CEL,third instar larva stage,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep3 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep4,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588943 1,biotin,GLDS-3_microarray_GSM588936.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588936.CEL,GLDS-3_microarray_GSM588936.CEL,third instar larva stage,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep4 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep5,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588944 1,biotin,GLDS-3_microarray_GSM588935.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588935.CEL,GLDS-3_microarray_GSM588935.CEL,third instar larva stage,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep5 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep6,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588945 1,biotin,GLDS-3_microarray_GSM588934.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588934.CEL,GLDS-3_microarray_GSM588934.CEL,third instar larva stage,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_3rd-Instar-Larva_Rep6 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_Adult_Rep1,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588946 1,biotin,GLDS-3_microarray_GSM588933.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588933.CEL,GLDS-3_microarray_GSM588933.CEL,adult,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_Adult_Rep1 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_Adult_Rep2,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588947 1,biotin,GLDS-3_microarray_GSM588932.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588932.CEL,GLDS-3_microarray_GSM588932.CEL,adult,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_Adult_Rep2 +Dmel_Hml-GAL4-UAS-GFP_wo_GC_Adult_Rep3,transcription profiling,DNA microarray,Affymetrix,Drosophila melanogaster,AFFY Drosophila 2,GSM588948 1,biotin,GLDS-3_microarray_GSM588931.CEL,https://genelab-data.ndc.nasa.gov/geode-py/ws/studies/OSD-3/download?source=datamanager&file=GLDS-3_microarray_GSM588931.CEL,GLDS-3_microarray_GSM588931.CEL,adult,Ground Control,Dmel_Hml-GAL4-UAS-GFP_wo_GC_Adult_Rep3 diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/README.md b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/README.md new file mode 100644 index 00000000..c4927a42 --- /dev/null +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/examples/runsheet/README.md @@ -0,0 +1,23 @@ +# Runsheet Specification + +## Description + +* The Runsheet is a csv file that contains the metadata required for processing Affymetrix datasets through GeneLab's Affymetrix processing pipeline. + + +## Examples + +1. [Runsheet for GLDS-3](OSD-3_microarray_v0_runsheet.csv) +2. [Runsheet for GLDS-213](OSD-213_microarray_v0_runsheet.csv) + + +## Required columns + +| Column Name | Type | Description | Example | +|:------------|:-----|:------------|:--------| +| Sample Name | string | Sample Name, added as a prefix to sample-specific processed data output files. Should not include spaces or weird characters. | Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep1 | +| biomart_attribute | string | A bioMart attribute identifier denoting the microarray probe/probeset attribute used for annotation mapping. | AFFY Drosophila 2 | +| organism | string | Species name used to map to the appropriate gene annotations file. | Drosophila melanogaster | +| Array Data File Path | string (url or local path) | Location of the raw data file for the sample. | /my/data/sample_1.CEL | +| Factor Value[] | string | A set of one or more columns specifying the experimental group the sample belongs to. In the simplest form, a column named 'Factor Value[group]' is sufficient. | Space Flight | +| Original Sample Name | string | Used to map the sample name that will be used for processing to the original sample name. This is often identical except in cases where the original name includes spaces or weird characters. | Dmel_Hml-GAL4-UAS-GFP_wo_FLT_3rd-Instar-Larva_Rep1 | diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/Affymetrix.qmd b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/Affymetrix.qmd index c5f3bec0..2f024992 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/Affymetrix.qmd +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/Affymetrix.qmd @@ -1,6 +1,6 @@ --- title: "Affymetrix Processing" -subtitle: "Workflow Version: NF_MAAffymetrix_1.0.5" +subtitle: "`r paste0('Workflow Version: NF_MAAffymetrix_', params$workflow_version)`" date: now title-block-banner: true format: @@ -14,12 +14,14 @@ format: number-sections: true params: + workflow_version: NULL id: NULL # str, used to name output files runsheet: NULL # str, path to runsheet biomart_attribute: NULL # str, used as a fallback value if 'Array Design REF' column is not found in the runsheet - annotation_file_path: NULL # str, Annotation file from 'genelab_annots_link' column of https://github.com/nasa/GeneLab_Data_Processing/blob/GL_RefAnnotTable_1.0.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv + annotation_file_path: NULL # str, Annotation file from 'genelab_annots_link' column of GeneLab Annotations file ensembl_version: NULL # str, Used to determine ensembl version local_annotation_dir: NULL + annotation_config_path: NULL DEBUG_limit_biomart_query: NULL # int, If supplied, only the first n probeIDs are queried run_DE: 'true' @@ -44,7 +46,11 @@ if (is.null(params$runsheet)) { stop("PARAMETERIZATION ERROR: Must supply runsheet path") } -runsheet = params$runsheet # +runsheet <- params$runsheet # + +# If using custom annotation, local_annotation_dir is path to directory containing annotation file and annotation_config_path is path/url to config file +local_annotation_dir <- params$local_annotation_dir # +annotation_config_path <- params$annotation_config_path # message(params) @@ -65,7 +71,7 @@ dir.create(DIR_DGE) original_par <- par() options(preferRaster=TRUE) # use Raster when possible to avoid antialiasing artifacts in images -options(timeout=1000) +options(timeout=1000) # ensure enough time for data downloads ``` ## Load Metadata and Raw Data @@ -114,7 +120,7 @@ DT::datatable(df_rs) print("Here are the expected comparison groups") # NON_DPPD:END print("Loading Raw Data...") # NON_DPPD -allTrue <- function(i_vector) { +all_true <- function(i_vector) { if ( length(i_vector) == 0 ) { stop(paste("Input vector is length zero")) } @@ -122,13 +128,13 @@ allTrue <- function(i_vector) { } # Define paths to raw data files -runsheetPathsAreURIs <- function(df_runsheet) { - allTrue(stringr::str_starts(df_runsheet$`Array Data File Path`, "https")) +runsheet_paths_are_URIs <- function(df_runsheet) { + all_true(stringr::str_starts(df_runsheet$`Array Data File Path`, "https")) } # Download raw data files -downloadFilesFromRunsheet <- function(df_runsheet) { +download_files_from_runsheet <- function(df_runsheet) { urls <- df_runsheet$`Array Data File Path` destinationFiles <- df_runsheet$`Array Data File Name` @@ -145,9 +151,9 @@ downloadFilesFromRunsheet <- function(df_runsheet) { } -if ( runsheetPathsAreURIs(df_rs) ) { +if ( runsheet_paths_are_URIs(df_rs) ) { print("Determined Raw Data Locations are URIS") - local_paths <- retry_with_delay(downloadFilesFromRunsheet, df_rs) + local_paths <- retry_with_delay(download_files_from_runsheet, df_rs) } else { print("Or Determined Raw Data Locations are local paths") local_paths <- df_rs$`Array Data File Path` @@ -155,7 +161,7 @@ if ( runsheetPathsAreURIs(df_rs) ) { # uncompress files if needed -if ( allTrue(stringr::str_ends(local_paths, ".gz")) ) { +if ( all_true(stringr::str_ends(local_paths, ".gz")) ) { print("Determined these files are gzip compressed... uncompressing now") # This does the uncompression lapply(local_paths, R.utils::gunzip, remove = FALSE, overwrite = TRUE) @@ -180,9 +186,9 @@ DT::datatable(df_local_paths) # Retry with delay here to accomodate oligo's automatic loading of annotation packages and occasional internet related failures to load raw_data <- retry_with_delay( oligo::read.celfiles, - df_local_paths$`Local Paths`, - sampleNames = df_local_paths$`Sample Name`# Map column names as Sample Names (instead of default filenames) - ) + df_local_paths$`Local Paths`, + sampleNames = df_local_paths$`Sample Name`# Map column names as Sample Names (instead of default filenames) + ) print(str(raw_data)) @@ -196,6 +202,9 @@ message(paste0("Number of Probes: ", dim(raw_data)[1])) # NON_DPPD DT::datatable(raw_data$targets, caption = "Sample to File Mapping") DT::datatable(head(raw_data$genes, n = 20), caption = "First 20 rows of raw data file embedded probes to genes table") # NON_DPPD:END + +annotation_file_path <- params$annotation_file_path +ensembl_version <- params$ensembl_version ``` ## QA For Raw Data @@ -461,9 +470,9 @@ par(original_par) #| message: false # Call RMA but skip normalize and background correction since those have already been applied probeset_level_data <- oligo::rma(norm_data, - normalize=FALSE, - background=FALSE, - ) + normalize=FALSE, + background=FALSE + ) # Summarize background-corrected and normalized data print("Summarized Probeset Level Data Below") # NON_DPPD @@ -479,15 +488,13 @@ DT::datatable(head(raw_data$genes, n = 20), caption = "First 20 rows of raw data # NON_DPPD:END ``` -## Perform Probeset Differential Expression and Annotation +## Probeset Annotations -### Probeset Differential Expression (DE) - -#### Add Probeset Annotations +### Get Probeset Annotations ``` {r retrieve-probeset-annotations} #| message: false -shortenedOrganismName <- function(long_name) { +shortened_organism_name <- function(long_name) { #' Convert organism names like 'Homo Sapiens' into 'hsapiens' tokens <- long_name %>% stringr::str_split(" ", simplify = TRUE) genus_name <- tokens[1] @@ -499,7 +506,7 @@ shortenedOrganismName <- function(long_name) { return(short_name) } -getBioMartAttribute <- function(df_rs) { +get_biomart_attribute <- function(df_rs) { #' Returns resolved biomart attribute source from runsheet # NON_DPPD:START #' this either comes from the runsheet or as a fall back, the parameters injected during render @@ -554,151 +561,185 @@ get_ensembl_genomes_mappings_from_ftp <- function(organism, ensembl_genomes_port return(mapping) } +# Convert list of multi-mapped genes to string +list_to_unique_piped_string <- function(str_list) { + #! convert lists into strings denoting unique elements separated by '|' characters + #! e.g. c("GO1","GO2","GO2","G03") -> "GO1|GO2|GO3" + return(toString(unique(str_list)) %>% stringr::str_replace_all(pattern = stringr::fixed(", "), replacement = "|")) +} -organism <- shortenedOrganismName(unique(df_rs$organism)) - -if (organism %in% c('ecoli', 'paeruginosa')) { - expected_attribute_name <- 'ProbesetID' - - annot_file <- c( - 'ecoli' = 'E_coli_2.na36.annot.csv', - 'paeruginosa' = 'Pae_G1a.na36.annot.csv' - ) - - df_mapping <- read.csv( - file.path(params$local_annotation_dir, annot_file[[organism]]), - skip = 13, header = TRUE, na.strings = c('NA', '---') - )[c('Probe.Set.ID', 'Entrez.Gene', 'Gene.Symbol', 'Gene.Title', 'Ensembl', 'RefSeq.Transcript.ID', 'RefSeq.Protein.ID', 'Gene.Ontology.Biological.Process', 'Gene.Ontology.Cellular.Component', 'Gene.Ontology.Molecular.Function')] - - # Clean columns - df_mapping$Gene.Symbol <- purrr::map_chr(stringr::str_split(df_mapping$Gene.Symbol, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) - df_mapping$Gene.Title <- purrr::map_chr(stringr::str_split(df_mapping$Gene.Title, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) - df_mapping$Entrez.Gene <- purrr::map_chr(stringr::str_split(df_mapping$Entrez.Gene, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) - df_mapping$Ensembl <- purrr::map_chr(stringr::str_split(df_mapping$Ensembl, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) - - df_mapping$RefSeq <- paste(df_mapping$RefSeq.Transcript.ID, df_mapping$RefSeq.Protein.ID) - df_mapping$RefSeq <- purrr::map_chr(stringr::str_extract_all(df_mapping$RefSeq, '[A-Z]+_[\\d.]+'), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('^$', NA_character_) - - df_mapping$GO <- paste(df_mapping$Gene.Ontology.Biological.Process, df_mapping$Gene.Ontology.Cellular.Component, df_mapping$Gene.Ontology.Molecular.Function) - df_mapping$GO <- purrr::map_chr(stringr::str_extract_all(df_mapping$GO, '\\d{7}'), ~paste0('GO:', unique(.), collapse = "|")) %>% stringr::str_replace('^GO:$', NA_character_) - df_mapping <- df_mapping[c('Probe.Set.ID', 'Entrez.Gene', 'Gene.Symbol', 'Gene.Title', 'Ensembl', 'RefSeq', 'GO')] - names(df_mapping) <- c('ProbesetID', 'ENTREZID', 'SYMBOL', 'GENENAME', 'ENSEMBL', 'REFSEQ', 'GOSLIM_IDS') +organism <- shortened_organism_name(unique(df_rs$organism)) +annot_key <- ifelse(organism %in% c("athaliana"), 'TAIR', 'ENSEMBL') - df_mapping$STRING_id <- NA_character_ -} else if (organism %in% c("athaliana")) { - ensembl_genomes_version = params$ensembl_version +if (organism %in% c("athaliana")) { + ENSEMBL_VERSION = ensembl_version ensembl_genomes_portal = "plants" - print(glue::glue("Using ensembl genomes ftp to get specific version of probeset id mapping table. Ensembl genomes portal: {ensembl_genomes_portal}, version: {ensembl_genomes_version}")) - expected_attribute_name <- getBioMartAttribute(df_rs) + print(glue::glue("Using ensembl genomes ftp to get specific version of probeset id mapping table. Ensembl genomes portal: {ensembl_genomes_portal}, version: {ENSEMBL_VERSION}")) + expected_attribute_name <- get_biomart_attribute(df_rs) df_mapping <- retry_with_delay( - get_ensembl_genomes_mappings_from_ftp, + get_ensembl_genomes_mappings_from_ftp, organism = organism, ensembl_genomes_portal = ensembl_genomes_portal, - ensembl_genomes_version = ensembl_genomes_version, + ensembl_genomes_version = ENSEMBL_VERSION, biomart_attribute = expected_attribute_name ) # TAIR from the mapping tables tend to be in the format 'AT1G01010.1' but the raw data has 'AT1G01010' # So here we remove the '.NNN' from the mapping table where .NNN is any number df_mapping$ensembl_gene_id <- stringr::str_replace_all(df_mapping$ensembl_gene_id, "\\.\\d+$", "") + + use_custom_annot <- FALSE } else { # Use biomart from main Ensembl website which archives keep each release on the live service # locate dataset - expected_dataset_name <- shortenedOrganismName(unique(df_rs$organism)) %>% stringr::str_c("_gene_ensembl") + expected_dataset_name <- shortened_organism_name(unique(df_rs$organism)) %>% stringr::str_c("_gene_ensembl") print(paste0("Expected dataset name: '", expected_dataset_name, "'")) message(paste0("Expected dataset name: '", expected_dataset_name, "'")) # NON_DPPD + expected_attribute_name <- get_biomart_attribute(df_rs) + print(paste0("Expected attribute name: '", expected_attribute_name, "'")) + message(paste0("Expected attribute name: '", expected_attribute_name, "'")) # NON_DPPD # Specify Ensembl version used in current GeneLab reference annotations - ENSEMBL_VERSION <- params$ensembl_version + ENSEMBL_VERSION <- ensembl_version print(paste0("Searching for Ensembl Version: ", ENSEMBL_VERSION)) # NON_DPPD print(glue::glue("Using Ensembl biomart to get specific version of mapping table. Ensembl version: {ENSEMBL_VERSION}")) - ensembl <- biomaRt::useEnsembl(biomart = "genes", - dataset = expected_dataset_name, - version = ENSEMBL_VERSION) - print(ensembl) - - expected_attribute_name <- getBioMartAttribute(df_rs) - print(paste0("Expected attribute name: '", expected_attribute_name, "'")) - message(paste0("Expected attribute name: '", expected_attribute_name, "'")) # NON_DPPD + # Check if organism/array design is supported in biomart + use_custom_annot <- TRUE - # Some probe_ids for affy_hta_2_0 may end in .hg.1 instead of .hg (how it is in biomaRt), leading to 0 results returned - if (expected_attribute_name == 'affy_hta_2_0') { - rownames(probeset_level_data) <- stringr::str_replace(rownames(probeset_level_data), '\\.hg\\.1$', '.hg') + ensembl <- biomaRt::useEnsembl(biomart = "genes", version = ENSEMBL_VERSION) + ensembl_datasets <- biomaRt::listDatasets(ensembl) + if (expected_dataset_name %in% ensembl_datasets$dataset) { + ensembl <- biomaRt::useEnsembl(biomart = "genes", dataset = expected_dataset_name, version = ENSEMBL_VERSION) + ensembl_attributes <- biomaRt::listAttributes(ensembl) + if (expected_attribute_name %in% ensembl_attributes$name) { + use_custom_annot <- FALSE + } } - probe_ids <- rownames(probeset_level_data) + if (use_custom_annot) { + unloadNamespace("biomaRt") + } else { - # DEBUG:START - if ( is.integer(params$DEBUG_limit_biomart_query) ) { - warning(paste("DEBUG MODE: Limiting query to", params$DEBUG_limit_biomart_query, "entries")) - message(paste("DEBUG MODE: Limiting query to", params$DEBUG_limit_biomart_query, "entries")) - probe_ids <- probe_ids[1:params$DEBUG_limit_biomart_query] - } - # DEBUG:END - - # Create probe map - # Run Biomart Queries in chunks to prevent request timeouts - # Note: If timeout is occuring (possibly due to larger load on biomart), reduce chunk size - CHUNK_SIZE= 1500 - probe_id_chunks <- split(probe_ids, ceiling(seq_along(probe_ids) / CHUNK_SIZE)) - df_mapping <- data.frame() - for (i in seq_along(probe_id_chunks)) { - probe_id_chunk <- probe_id_chunks[[i]] - print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) - message(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) # NON_DPPD - chunk_results <- biomaRt::getBM( - attributes = c( - expected_attribute_name, - "ensembl_gene_id" - ), - filters = expected_attribute_name, - values = probe_id_chunk, - mart = ensembl) - - if (nrow(chunk_results) > 0) { - df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results) + ensembl <- biomaRt::useEnsembl(biomart = "genes", + dataset = expected_dataset_name, + version = ENSEMBL_VERSION) + print(ensembl) + + # Some probe_ids for affy_hta_2_0 may end in .hg.1 instead of .hg (how it is in biomaRt), leading to 0 results returned + if (expected_attribute_name == 'affy_hta_2_0') { + rownames(probeset_level_data) <- stringr::str_replace(rownames(probeset_level_data), '\\.hg\\.1$', '.hg') } - - Sys.sleep(10) # Slight break between requests to prevent back-to-back requests + + probe_ids <- rownames(probeset_level_data) + + # DEBUG:START + if ( is.integer(params$DEBUG_limit_biomart_query) ) { + warning(paste("DEBUG MODE: Limiting query to", params$DEBUG_limit_biomart_query, "entries")) + message(paste("DEBUG MODE: Limiting query to", params$DEBUG_limit_biomart_query, "entries")) + probe_ids <- probe_ids[1:params$DEBUG_limit_biomart_query] + } + # DEBUG:END + + # Create probe map + # Run Biomart Queries in chunks to prevent request timeouts + # Note: If timeout is occuring (possibly due to larger load on biomart), reduce chunk size + CHUNK_SIZE= 1500 + probe_id_chunks <- split(probe_ids, ceiling(seq_along(probe_ids) / CHUNK_SIZE)) + df_mapping <- data.frame() + for (i in seq_along(probe_id_chunks)) { + probe_id_chunk <- probe_id_chunks[[i]] + print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) + message(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})")) # NON_DPPD + chunk_results <- biomaRt::getBM( + attributes = c( + expected_attribute_name, + "ensembl_gene_id" + ), + filters = expected_attribute_name, + values = probe_id_chunk, + mart = ensembl) + + if (nrow(chunk_results) > 0) { + df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results) + } + + Sys.sleep(10) # Slight break between requests to prevent back-to-back requests + } + } } # At this point, we have df_mapping from either the biomart live service or the ensembl genomes ftp archive depending on the organism -``` +# If no df_mapping obtained (e.g., organism not supported in biomart), use custom annotations; otherwise, merge in-house annotations to df_mapping -``` {r reformat-merge-probe-annotations} -# Convert list of multi-mapped genes to string -listToUniquePipedString <- function(str_list) { - #! convert lists into strings denoting unique elements separated by '|' characters - #! e.g. c("GO1","GO2","GO2","G03") -> "GO1|GO2|GO3" - return(toString(unique(str_list)) %>% stringr::str_replace_all(pattern = stringr::fixed(", "), replacement = "|")) -} +if (use_custom_annot) { + expected_attribute_name <- 'ProbesetID' + + annot_type <- 'NO_CUSTOM_ANNOT' + if (!is.null(local_annotation_dir) && !is.null(annotation_config_path)) { + config_df <- read.csv(annotation_config_path, row.names=1) + if (unique(df_rs$`biomart_attribute`) %in% row.names(config_df)) { + annot_config <- config_df[unique(df_rs$`biomart_attribute`), ] + annot_type <- annot_config$annot_type[[1]] + } else { + warning(paste0("No entry for '", unique(df_rs$`biomart_attribute`), "' in provided config file: ", annotation_config_path)) + } + } else { + warning("Need to provide both local_annotation_dir and annotation_config_path to use custom annotation.") + } -annot_key = ifelse(organism %in% c("athaliana"), 'TAIR', 'ENSEMBL') + if (annot_type == '3prime-IVT') { + unique_probe_ids <- read.csv( + file.path(local_annotation_dir, annot_config$annot_filename[[1]]), + skip = 13, header = TRUE, na.strings = c('NA', '---') + )[c('Probe.Set.ID', 'Entrez.Gene', 'Gene.Symbol', 'Gene.Title', 'Ensembl', 'RefSeq.Transcript.ID', 'RefSeq.Protein.ID', 'Gene.Ontology.Biological.Process', 'Gene.Ontology.Cellular.Component', 'Gene.Ontology.Molecular.Function')] -if (organism %in% c('ecoli')) { - unique_probe_ids <- df_mapping %>% - dplyr::mutate( - count_ENTREZID_mappings = 1 + stringr::str_count(ENTREZID, stringr::fixed("|")) - ) + # Clean columns + unique_probe_ids$Gene.Symbol <- purrr::map_chr(stringr::str_split(unique_probe_ids$Gene.Symbol, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + unique_probe_ids$Gene.Title <- purrr::map_chr(stringr::str_split(unique_probe_ids$Gene.Title, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + unique_probe_ids$Entrez.Gene <- purrr::map_chr(stringr::str_split(unique_probe_ids$Entrez.Gene, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + unique_probe_ids$Ensembl <- purrr::map_chr(stringr::str_split(unique_probe_ids$Ensembl, stringr::fixed(' /// ')), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('NA', NA_character_) + + unique_probe_ids$RefSeq <- paste(unique_probe_ids$RefSeq.Transcript.ID, unique_probe_ids$RefSeq.Protein.ID) + unique_probe_ids$RefSeq <- purrr::map_chr(stringr::str_extract_all(unique_probe_ids$RefSeq, '[A-Z]+_[\\d.]+'), ~paste0(unique(.), collapse = "|")) %>% stringr::str_replace('^$', NA_character_) - primary_key <- 'ENTREZID' - primary_key_count <- 'count_ENTREZID_mappings' -} else if (organism %in% c('paeruginosa')) { - unique_probe_ids <- df_mapping %>% + unique_probe_ids$GO <- paste(unique_probe_ids$Gene.Ontology.Biological.Process, unique_probe_ids$Gene.Ontology.Cellular.Component, unique_probe_ids$Gene.Ontology.Molecular.Function) + unique_probe_ids$GO <- purrr::map_chr(stringr::str_extract_all(unique_probe_ids$GO, '\\d{7}'), ~paste0('GO:', unique(.), collapse = "|")) %>% stringr::str_replace('^GO:$', NA_character_) + + unique_probe_ids <- unique_probe_ids[c('Probe.Set.ID', 'Entrez.Gene', 'Gene.Symbol', 'Gene.Title', 'Ensembl', 'RefSeq', 'GO')] + names(unique_probe_ids) <- c('ProbesetID', 'ENTREZID', 'SYMBOL', 'GENENAME', 'ENSEMBL', 'REFSEQ', 'GOSLIM_IDS') + + unique_probe_ids$STRING_id <- NA_character_ + + gene_col <- 'ENSEMBL' + if (sum(!is.na(unique_probe_ids$ENTREZID)) > sum(!is.na(unique_probe_ids$ENSEMBL))) { + gene_col <- 'ENTREZID' + } + if (sum(!is.na(unique_probe_ids$SYMBOL)) > max(sum(!is.na(unique_probe_ids$ENTREZID)), sum(!is.na(unique_probe_ids$ENSEMBL)))) { + gene_col <- 'SYMBOL' + } + + unique_probe_ids <- unique_probe_ids %>% dplyr::mutate( - count_SYMBOL_mappings = 1 + stringr::str_count(SYMBOL, stringr::fixed("|")) + count_gene_mappings = 1 + stringr::str_count(get(gene_col), stringr::fixed("|")), + gene_mapping_source = gene_col ) - - primary_key <- 'SYMBOL' - primary_key_count <- 'count_SYMBOL_mappings' + } else if (annot_type == 'custom') { + unique_probe_ids <- read.csv( + file.path(local_annotation_dir, annot_config$annot_filename[[1]]), + header = TRUE, na.strings = c('NA', '') + ) + } else { + annot_cols <- c('ProbesetID', 'ENTREZID', 'SYMBOL', 'GENENAME', 'ENSEMBL', 'REFSEQ', 'GOSLIM_IDS', 'STRING_id', 'count_gene_mappings', 'gene_mapping_source') + unique_probe_ids <- setNames(data.frame(matrix(NA_character_, nrow = 1, ncol = length(annot_cols))), annot_cols) + } } else { annot <- read.table( - as.character(params$annotation_file_path), + as.character(annotation_file_path), sep = "\t", header = TRUE, quote = "", @@ -710,37 +751,37 @@ if (organism %in% c('ecoli')) { dplyr::mutate(dplyr::across(!!sym(expected_attribute_name), as.character)) %>% # Ensure probeset ids treated as character type dplyr::group_by(!!sym(expected_attribute_name)) %>% dplyr::summarise( - ENSEMBL = listToUniquePipedString(ensembl_gene_id) + ENSEMBL = list_to_unique_piped_string(ensembl_gene_id) ) %>% # Count number of ensembl IDS mapped dplyr::mutate( - count_ENSEMBL_mappings = 1 + stringr::str_count(ENSEMBL, stringr::fixed("|")) + count_gene_mappings = 1 + stringr::str_count(ENSEMBL, stringr::fixed("|")), + gene_mapping_source = annot_key ) %>% dplyr::left_join(annot, by = c("ENSEMBL" = annot_key)) - - - primary_key <- 'ENSEMBL' - primary_key_count <- 'count_ENSEMBL_mappings' } +``` +``` {r reformat-merge-probe-annotations} probeset_expression_matrix <- oligo::exprs(probeset_level_data) -probeset_expression_matrix.biomart_mapped <- probeset_expression_matrix %>% +probeset_expression_matrix.gene_mapped <- probeset_expression_matrix %>% as.data.frame() %>% tibble::rownames_to_column(var = "ProbesetID") %>% # Ensure rownames (probeset IDs) can be used as join key dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% - dplyr::mutate( !!primary_key_count := ifelse(is.na(get(primary_key)), 0, get(primary_key_count)) ) + dplyr::mutate( count_gene_mappings := ifelse(is.na(count_gene_mappings), 0, count_gene_mappings) ) %>% + dplyr::mutate( gene_mapping_source := unique(unique_probe_ids$gene_mapping_source) ) ``` -### Summarize Probeset Mapping +### Summarize Gene Mapping ``` {r summarize-remapping-vs-original-mapping} #| message: false # Pie Chart with Percentages slices <- c( - 'Unique Mapping' = nrow(probeset_expression_matrix.biomart_mapped %>% dplyr::filter(get(primary_key_count) == 1) %>% dplyr::distinct(ProbesetID)), - 'Multi Mapping' = nrow(probeset_expression_matrix.biomart_mapped %>% dplyr::filter(get(primary_key_count) > 1) %>% dplyr::distinct(ProbesetID)), - 'No Mapping' = nrow(probeset_expression_matrix.biomart_mapped %>% dplyr::filter(get(primary_key_count) == 0) %>% dplyr::distinct(ProbesetID)) + 'Unique Mapping' = nrow(probeset_expression_matrix.gene_mapped %>% dplyr::filter(count_gene_mappings == 1) %>% dplyr::distinct(ProbesetID)), + 'Multi Mapping' = nrow(probeset_expression_matrix.gene_mapped %>% dplyr::filter(count_gene_mappings > 1) %>% dplyr::distinct(ProbesetID)), + 'No Mapping' = nrow(probeset_expression_matrix.gene_mapped %>% dplyr::filter(count_gene_mappings == 0) %>% dplyr::distinct(ProbesetID)) ) pct <- round(slices/sum(slices)*100) chart_names <- names(slices) @@ -748,17 +789,104 @@ chart_names <- glue::glue("{names(slices)} ({slices})") # add count to labels chart_names <- paste(chart_names, pct) # add percents to labels chart_names <- paste(chart_names,"%",sep="") # ad % to labels pie(slices,labels = chart_names, col=rainbow(length(slices)), - main=glue::glue("Mapping to Primary Keytype\n {nrow(probeset_expression_matrix.biomart_mapped %>% dplyr::distinct(ProbesetID))} Total Unique Probesets") + main=glue::glue("Mapping to Primary Keytype\n {nrow(probeset_expression_matrix.gene_mapped %>% dplyr::distinct(ProbesetID))} Total Unique Probesets") ) print(glue::glue("Unique Mapping Count: {slices[['Unique Mapping']]}")) message(glue::glue("Unique Mapping Count: {slices[['Unique Mapping']]}")) # NON_DPPD ``` +### Generate Annotated Raw and Normalized Expression Tables + +```{r save-tables} +## Reorder columns before saving to file +ANNOTATIONS_COLUMN_ORDER = c( + annot_key, + "SYMBOL", + "GENENAME", + "REFSEQ", + "ENTREZID", + "STRING_id", + "GOSLIM_IDS" +) + +SAMPLE_COLUMN_ORDER <- df_rs$`Sample Name` + +probeset_expression_matrix.gene_mapped <- probeset_expression_matrix.gene_mapped %>% dplyr::rename( !!annot_key := ENSEMBL ) + +## Output column subset file with just normalized probeset level expression values +write.csv( + probeset_expression_matrix.gene_mapped[c( + ANNOTATIONS_COLUMN_ORDER, + "ProbesetID", + "count_gene_mappings", + "gene_mapping_source", + SAMPLE_COLUMN_ORDER) + ], file.path(DIR_NORMALIZED_EXPRESSION, "normalized_expression_probeset_GLmicroarray.csv"), row.names = FALSE) + +## Determine column order for probe level tables + +PROBE_INFO_COLUMN_ORDER = c( + "ProbesetID", + "ProbeID", + "count_gene_mappings", + "gene_mapping_source" +) + +FINAL_COLUMN_ORDER <- c( + ANNOTATIONS_COLUMN_ORDER, + PROBE_INFO_COLUMN_ORDER, + SAMPLE_COLUMN_ORDER +) + +## Generate raw intensity matrix that includes annotations + +background_corrected_data_annotated <- oligo::exprs(background_corrected_data) %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "fid") %>% # Ensure rownames (probeset IDs) can be used as join key + dplyr::mutate(dplyr::across(fid, as.integer)) %>% # Ensure fid is integer type, consistent with getProbeInfo typing + dplyr::right_join(oligo::getProbeInfo(background_corrected_data), by = "fid") %>% # Add 'man_fsetid' via mapping based on fid + dplyr::rename( ProbesetID = man_fsetid ) %>% # Rename from getProbeInfo name to ProbesetID + dplyr::rename( ProbeID = fid ) %>% # Rename from getProbeInfo name to ProbeID + dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% # Join with biomaRt ENSEMBL mappings + dplyr::mutate( count_gene_mappings := ifelse(is.na(count_gene_mappings), 0, count_gene_mappings) ) %>% # Convert NA mapping to 0 + dplyr::mutate( gene_mapping_source := unique(unique_probe_ids$gene_mapping_source) ) %>% + dplyr::rename( !!annot_key := ENSEMBL ) + +## Perform reordering +background_corrected_data_annotated <- background_corrected_data_annotated %>% + dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) + +write.csv(background_corrected_data_annotated, file.path(DIR_RAW_DATA, "raw_intensities_probe_GLmicroarray.csv"), row.names = FALSE) + +## Generate normalized expression matrix that includes annotations +norm_data_matrix_annotated <- oligo::exprs(norm_data) %>% + as.data.frame() %>% + tibble::rownames_to_column(var = "fid") %>% # Ensure rownames (probeset IDs) can be used as join key + dplyr::mutate(dplyr::across(fid, as.integer)) %>% # Ensure fid is integer type, consistent with getProbeInfo typing + dplyr::right_join(oligo::getProbeInfo(norm_data), by = "fid") %>% # Add 'man_fsetid' via mapping based on fid + dplyr::rename( ProbesetID = man_fsetid ) %>% # Rename from getProbeInfo name to ProbesetID + dplyr::rename( ProbeID = fid ) %>% # Rename from getProbeInfo name to ProbeID + dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% + dplyr::mutate( count_gene_mappings := ifelse(is.na(count_gene_mappings), 0, count_gene_mappings) ) %>% # Convert NA mapping to 0 + dplyr::mutate( gene_mapping_source := unique(unique_probe_ids$gene_mapping_source) ) %>% + dplyr::rename( !!annot_key := ENSEMBL ) + +norm_data_matrix_annotated <- norm_data_matrix_annotated %>% + dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) + +write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "normalized_intensities_probe_GLmicroarray.csv"), row.names = FALSE) +``` + +## Perform Probeset Differential Expression (DE) + ### Generate Design Matrix ``` {r generate-design-matrix} -runsheetToDesignMatrix <- function(runsheet_path) { +#| include: !expr params$run_DE +#| eval: !expr params$run_DE + +runsheet_to_design_matrix <- function(runsheet_path) { df <- read.csv(runsheet, check.names = FALSE) %>% dplyr::mutate_all(function(x) iconv(x, "latin1", "ASCII", sub="")) # Convert all characters to ascii, when not possible, remove the character # get only Factor Value columns factors = as.data.frame(df[,grep("Factor.Value", colnames(df), ignore.case=TRUE)]) @@ -801,14 +929,12 @@ runsheetToDesignMatrix <- function(runsheet_path) { # Loading metadata from runsheet csv file -design_data <- runsheetToDesignMatrix(runsheet) +design_data <- runsheet_to_design_matrix(runsheet) design <- design_data$matrix -if (params$run_DE) { - # Write SampleTable.csv and contrasts.csv file - write.csv(design_data$groups, file.path(DIR_DGE, "SampleTable_GLmicroarray.csv"), row.names = FALSE) - write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts_GLmicroarray.csv")) -} +# Write SampleTable.csv and contrasts.csv file +write.csv(design_data$groups, file.path(DIR_DGE, "SampleTable_GLmicroarray.csv"), row.names = FALSE) +write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts_GLmicroarray.csv")) ``` ### Perform Individual Probeset Level DE @@ -817,7 +943,7 @@ if (params$run_DE) { #| include: !expr params$run_DE #| eval: !expr params$run_DE -lmFitPairwise <- function(norm_data, design) { +lm_fit_pairwise <- function(norm_data, design) { #' Perform all pairwise comparisons #' Approach based on limma manual section 17.4 (version 3.52.4) @@ -836,7 +962,7 @@ lmFitPairwise <- function(norm_data, design) { } # Calculate results -res <- lmFitPairwise(probeset_level_data, design) +res <- lm_fit_pairwise(probeset_level_data, design) DT::datatable(limma::topTable(res)) # NON_DPPD # Print DE table, without filtering @@ -847,89 +973,7 @@ limma::write.fit(res, adjust = 'BH', sep = ",") ``` -### Add Additional Columns and Format DE Table - -```{r save-tables} -## Reorder columns before saving to file -ANNOTATIONS_COLUMN_ORDER = c( - annot_key, - "SYMBOL", - "GENENAME", - "REFSEQ", - "ENTREZID", - "STRING_id", - "GOSLIM_IDS" -) - -SAMPLE_COLUMN_ORDER <- design_data$group %>% dplyr::pull(sample) - -probeset_expression_matrix.biomart_mapped <- probeset_expression_matrix.biomart_mapped %>% dplyr::rename( !!annot_key := ENSEMBL ) - -## Output column subset file with just normalized probeset level expression values -write.csv( - probeset_expression_matrix.biomart_mapped[c( - ANNOTATIONS_COLUMN_ORDER, - "ProbesetID", - primary_key_count, - SAMPLE_COLUMN_ORDER) - ], file.path(DIR_NORMALIZED_EXPRESSION, "normalized_expression_probeset_GLmicroarray.csv"), row.names = FALSE) - -if (params$run_DE) { - ### Generate and export PCA table for GeneLab visualization plots - PCA_raw <- prcomp(t(exprs(probeset_level_data)), scale = FALSE) # Note: expression at the Probeset level is already log2 transformed - write.csv(PCA_raw$x, file.path(DIR_DGE, "visualization_PCA_table_GLmicroarray.csv")) -} - -## Determine column order for probe level tables - -PROBE_INFO_COLUMN_ORDER = c( - "ProbesetID", - "ProbeID", - primary_key_count -) - -FINAL_COLUMN_ORDER <- c( - ANNOTATIONS_COLUMN_ORDER, - PROBE_INFO_COLUMN_ORDER, - SAMPLE_COLUMN_ORDER - ) - -## Generate raw intensity matrix that includes annotations - -background_corrected_data_annotated <- oligo::exprs(background_corrected_data) %>% - as.data.frame() %>% - tibble::rownames_to_column(var = "fid") %>% # Ensure rownames (probeset IDs) can be used as join key - dplyr::mutate(dplyr::across(fid, as.integer)) %>% # Ensure fid is integer type, consistent with getProbeInfo typing - dplyr::right_join(oligo::getProbeInfo(background_corrected_data), by = "fid") %>% # Add 'man_fsetid' via mapping based on fid - dplyr::rename( ProbesetID = man_fsetid ) %>% # Rename from getProbeInfo name to ProbesetID - dplyr::rename( ProbeID = fid ) %>% # Rename from getProbeInfo name to ProbeID - dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% # Join with biomaRt ENSEMBL mappings - dplyr::mutate( !!primary_key_count := ifelse(is.na(get(primary_key)), 0, get(primary_key_count)) ) %>% # Convert NA mapping to 0 - dplyr::rename( !!annot_key := ENSEMBL ) - -## Perform reordering -background_corrected_data_annotated <- background_corrected_data_annotated %>% - dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) - -write.csv(background_corrected_data_annotated, file.path(DIR_RAW_DATA, "raw_intensities_probe_GLmicroarray.csv"), row.names = FALSE) - -## Generate normalized expression matrix that includes annotations -norm_data_matrix_annotated <- oligo::exprs(norm_data) %>% - as.data.frame() %>% - tibble::rownames_to_column(var = "fid") %>% # Ensure rownames (probeset IDs) can be used as join key - dplyr::mutate(dplyr::across(fid, as.integer)) %>% # Ensure fid is integer type, consistent with getProbeInfo typing - dplyr::right_join(oligo::getProbeInfo(norm_data), by = "fid") %>% # Add 'man_fsetid' via mapping based on fid - dplyr::rename( ProbesetID = man_fsetid ) %>% # Rename from getProbeInfo name to ProbesetID - dplyr::rename( ProbeID = fid ) %>% # Rename from getProbeInfo name to ProbeID - dplyr::left_join(unique_probe_ids, by = c("ProbesetID" = expected_attribute_name ) ) %>% - dplyr::mutate( !!primary_key_count := ifelse(is.na(get(primary_key)), 0, get(primary_key_count)) ) %>% # Convert NA mapping to 0 - dplyr::rename( !!annot_key := ENSEMBL ) - -norm_data_matrix_annotated <- norm_data_matrix_annotated %>% - dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER)) - -write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "normalized_intensities_probe_GLmicroarray.csv"), row.names = FALSE) -``` +### Add Annotation and Stats Columns and Format DE Table ``` {r save-de-table} #| message: false @@ -940,9 +984,9 @@ write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "norm # Read in DE table df_interim <- read.csv("INTERIM.csv") -# Bind columns from biomart mapped expression table +# Bind columns from gene mapped expression table df_interim <- df_interim %>% - dplyr::bind_cols(probeset_expression_matrix.biomart_mapped) + dplyr::bind_cols(probeset_expression_matrix.gene_mapped) # Reformat column names reformat_names <- function(colname, group_name_mapping) { @@ -1036,7 +1080,8 @@ df_interim <- df_interim %>% dplyr::select(-any_of(colnames_to_remove)) PROBE_INFO_COLUMN_ORDER = c( "ProbesetID", - primary_key_count + "count_gene_mappings", + "gene_mapping_source" ) generate_prefixed_column_order <- function(subjects, prefixes) { @@ -1071,27 +1116,22 @@ ALL_SAMPLE_STATS_COLUMNS_ORDER <- c( "F.p.value" ) -GROUP_MEAN_COLUMNS_ORDER <- generate_prefixed_column_order( - subjects = unique(design_data$groups$group), - prefixes = c( - "Group.Mean_" - ) - ) -GROUP_STDEV_COLUMNS_ORDER <- generate_prefixed_column_order( +GROUP_MEAN_STDEV_COLUMNS_ORDER <- generate_prefixed_column_order( subjects = unique(design_data$groups$group), prefixes = c( + "Group.Mean_", "Group.Stdev_" - ) ) +) + FINAL_COLUMN_ORDER <- c( ANNOTATIONS_COLUMN_ORDER, PROBE_INFO_COLUMN_ORDER, SAMPLE_COLUMN_ORDER, STAT_COLUMNS_ORDER, ALL_SAMPLE_STATS_COLUMNS_ORDER, - GROUP_MEAN_COLUMNS_ORDER, - GROUP_STDEV_COLUMNS_ORDER - ) + GROUP_MEAN_STDEV_COLUMNS_ORDER +) ## Assert final column order includes all columns from original table if (!setequal(FINAL_COLUMN_ORDER, colnames(df_interim))) { @@ -1171,19 +1211,11 @@ get_versions <- function() { ## Note Libraries that were NOT used during processing versions_buffer <- get_versions() -if (organism %in% c("athaliana")) { - versions_buffer <- glue::glue_collapse(c( - versions_buffer, - glue::glue("- name: biomaRt"), - glue::glue(" version: (Not used for plant datasets)"), - glue::glue(" homepage: https://bioconductor.org/packages/3.14/bioc/html/biomaRt.html"), - glue::glue(" workflow task: PROCESS_AFFYMETRIX") - ), sep = "\n") -} else if (organism %in% c('ecoli', 'paeruginosa')) { +if (organism %in% c("athaliana") || use_custom_annot) { versions_buffer <- glue::glue_collapse(c( versions_buffer, glue::glue("- name: biomaRt"), - glue::glue(" version: (Not used for bacteria datasets)"), + glue::glue(" version: (Not used for this dataset)"), glue::glue(" homepage: https://bioconductor.org/packages/3.14/bioc/html/biomaRt.html"), glue::glue(" workflow task: PROCESS_AFFYMETRIX") ), sep = "\n") diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/checks.py b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/checks.py index fef5126e..aca43c6b 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/checks.py +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/checks.py @@ -319,8 +319,8 @@ def utils_common_constraints_on_dataframe( col_constraints = col_constraints.copy() # limit to only columns of interest - query_df = df[col_set] - for (colname, colseries) in query_df.iteritems(): + query_df = df[list(col_set)] + for (colname, colseries) in query_df.items(): # check non null constraint if col_constraints.pop("nonNull", False) and nonNull(colseries) == False: issues["Failed non null constraint"].append(colname) @@ -398,7 +398,7 @@ def check_dge_table_sample_columns_constraints( ) -> FlagEntry: MINIMUM_COUNT = 0 # data specific preprocess - df_dge = pd.read_csv(dge_table)[samples] + df_dge = pd.read_csv(dge_table)[list(samples)] schema = pa.DataFrameSchema({ sample: pa.Column(float) diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/config.yaml b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/config.yaml index 071eff4f..e3e593fa 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/config.yaml +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/config.yaml @@ -75,7 +75,10 @@ Staging: Sample name is used as a unique sample identifier during processing Example: Atha_Col-0_Root_WT_Ctrl_45min_Rep1_GSM502538 - - ISA Field Name: Label + - ISA Field Name: + - Label + - Parameter Value[label] + - Parameter Value[Label] ISA Table Source: Sample Runsheet Column Name: Label Processing Usage: >- @@ -263,16 +266,6 @@ data assets: resource categories: *DGEAnalysisData - viz PCA table: - processed location: - - *DGEDataDir - - "visualization_PCA_table_GLmicroarray.csv" - - tags: - - processed - - resource categories: *neverPublished - data asset sets: # These assets are not generated in the workflow, but are generated after the workflow PUTATIVE: [] diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/protocol.py b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/protocol.py index e17c9a1d..d55f4f54 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/protocol.py +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix/protocol.py @@ -365,33 +365,6 @@ def validate( """) ) - with vp.component_start( - name="Viz Tables", - description="Extended from the dge tables", - ): - with vp.payload( - payloads=[ - { - "samples": lambda: set(dataset.samples), - "pca_table": lambda: dataset.data_assets[ - "viz PCA table" - ].path, - } - ] - ): - vp.add( - bulkRNASeq.checks.check_viz_pca_table_index_and_columns_exist, - full_description=textwrap.dedent(f""" - - Check: Ensure all samples (row-indices) present and columns PC1, PC2 and PC3 are present - - Reason: - - PCA table should include all samples and PC1, PC2, PC3 (for 3D PCA viz) - - Potential Source of Problems: - - Bug in processing script - - Flag Condition: - - Green: All samples and all columns present - - Halt: At least one sample or column is missing - """) - ) with vp.component_start( name="Processing Report", description="", diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/checks.py b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/checks.py index fef5126e..aca43c6b 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/checks.py +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/checks.py @@ -319,8 +319,8 @@ def utils_common_constraints_on_dataframe( col_constraints = col_constraints.copy() # limit to only columns of interest - query_df = df[col_set] - for (colname, colseries) in query_df.iteritems(): + query_df = df[list(col_set)] + for (colname, colseries) in query_df.items(): # check non null constraint if col_constraints.pop("nonNull", False) and nonNull(colseries) == False: issues["Failed non null constraint"].append(colname) @@ -398,7 +398,7 @@ def check_dge_table_sample_columns_constraints( ) -> FlagEntry: MINIMUM_COUNT = 0 # data specific preprocess - df_dge = pd.read_csv(dge_table)[samples] + df_dge = pd.read_csv(dge_table)[list(samples)] schema = pa.DataFrameSchema({ sample: pa.Column(float) diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/config.yaml b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/config.yaml index e30bf80a..543a4fe6 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/config.yaml +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/bin/dp_tools__affymetrix_skipDE/config.yaml @@ -75,7 +75,10 @@ Staging: Sample name is used as a unique sample identifier during processing Example: Atha_Col-0_Root_WT_Ctrl_45min_Rep1_GSM502538 - - ISA Field Name: Label + - ISA Field Name: + - Label + - Parameter Value[label] + - Parameter Value[Label] ISA Table Source: Sample Runsheet Column Name: Label Processing Usage: >- diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/default.config b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/default.config index b63c8772..70dba72d 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/default.config +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/default.config @@ -41,7 +41,8 @@ params { Parameters that SHOULD NOT be overwritten */ // For now, this particular is good for all organisms listed on the file. - annotation_file_path = "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/GL_RefAnnotTable_1.0.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv" + annotation_file_path = "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/GL_RefAnnotTable-A_1.1.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" + annotation_config_path = "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/Microarray/Affymetrix/Array_Annotations/Affymetrix_array_annotations.csv" /* DEBUG related parameters, not likely useful in production diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/software/by_docker_image.config b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/software/by_docker_image.config index 6744407f..df4735a4 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/software/by_docker_image.config +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/config/software/by_docker_image.config @@ -1,8 +1,8 @@ process { withName: 'PROCESS_AFFYMETRIX' { - container = "quay.io/j_81/gl_images:NF_AffyMP-A_1.0.0-RC7" + container = "quay.io/nasa_genelab/gl-microarray:1.0.0" } withName: 'RUNSHEET_FROM_GLDS|RUNSHEET_FROM_ISA|VV_AFFYMETRIX|GENERATE_MD5SUMS|UPDATE_ISA_TABLES|GENERATE_SOFTWARE_TABLE' { - container = "quay.io/j_81/dp_tools:1.3.4" + container = "quay.io/nasa_genelab/dp_tools:1.3.5" } } diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PARSE_ANNOTATION_TABLE.nf b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PARSE_ANNOTATION_TABLE.nf index a6a39281..74addb06 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PARSE_ANNOTATION_TABLE.nf +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PARSE_ANNOTATION_TABLE.nf @@ -1,6 +1,6 @@ process PARSE_ANNOTATION_TABLE { // Extracts data from this kind of table: - // https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv + // https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv input: val(annotations_csv_url_string) @@ -22,7 +22,7 @@ process PARSE_ANNOTATION_TABLE { organism_key = organism_sci.capitalize().replace("_"," ") // fasta_url = organisms[organism_key][5] // gtf_url = organisms[organism_key][6] - annotations_db_url = organisms[organism_key][9] + annotations_db_url = organisms[organism_key][10] ensemblVersion = organisms[organism_key][3] ensemblSource = organisms[organism_key][4] diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/resources/usr/bin/generate_protocol.sh b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/resources/usr/bin/generate_protocol.sh index 5b6f2357..1748a3a2 100755 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/resources/usr/bin/generate_protocol.sh +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/resources/usr/bin/generate_protocol.sh @@ -34,9 +34,9 @@ organism_list=("Homo sapiens" "Mus musculus" "Rattus norvegicus" "Drosophila mel # Check the value of 'organism' variable and set 'GENE_MAPPING_STEP' accordingly if [[ $organism == "Arabidopsis thaliana" ]]; then GENE_MAPPING_STEP="Ensembl gene ID mappings were retrieved for each probeset using the Plants Ensembl database ftp server (plants.ensembl.org, release 54)." -elif [[ $organism == "Escherichia coli" ]]; then +elif [[ $biomart_attribute == "AFFY E coli Genome 2 0" ]]; then GENE_MAPPING_STEP="Gene annotations were retrieved for each probeset from ThermoFisher (https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/E_coli_2-na36-annot-csv.zip, created March 2016, accessed June 2024)." -elif [[ $organism == "Pseudomonas aeruginosa" ]]; then +elif [[ $biomart_attribute == "AFFY GeneChip P. aeruginosa Genome" ]]; then GENE_MAPPING_STEP="Gene annotations were retrieved for each probeset from ThermoFisher (https://www.thermofisher.com/order/catalog/product/sec/assets?url=TFS-Assets/LSG/Support-Files/Pae_G1a-na36-annot-csv.zip, created March 2016, accessed June 2024)." elif [[ " ${organism_list[*]} " == *"${organism//\"/}"* ]]; then GENE_MAPPING_STEP="Ensembl gene ID mappings were retrieved for each probeset using biomaRt (version ${biomaRt_VERSION}), Ensembl database (ensembl.org, release 107)." @@ -80,7 +80,7 @@ else fi # Read the template file -template="Data were processed as described in GL-DPPD-7114 (https://github.com/nasa/GeneLab_Data_Processing/blob/master/Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md) using NF_MAAffymetrix version $1 (https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAffymetrix_$1/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix). In short, a RunSheet containing raw data file location and processing metadata from the study's *ISA.zip file was generated using dp_tools (version ${dp_tools_VERSION}). The raw array data files were loaded into R (version ${R_VERSION}) using oligo (version ${oligo_VERSION}). Raw data quality assurance density plot, pseudo images, MA plots, and boxplots were generated using oligo (version ${oligo_VERSION}). The raw probe level intensity data was background corrected and normalized across arrays via the oligo (version ${oligo_VERSION}) quantile method. Normalized probe level data quality assurance density plot, pseudo images, MA plots, and boxplots were generated using oligo (version ${oligo_VERSION}). Normalized probe level data was summarized to the probeset level using the oligo (version ${oligo_VERSION}) RMA method. ${GENE_MAPPING_STEP} ${DE_STEP} ${ANNOT_STEP}" +template="Data were processed as described in GL-DPPD-7114-A (https://github.com/nasa/GeneLab_Data_Processing/blob/master/Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md) using NF_MAAffymetrix version $1 (https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAffymetrix_$1/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix). In short, a RunSheet containing raw data file location and processing metadata from the study's *ISA.zip file was generated using dp_tools (version ${dp_tools_VERSION}). The raw array data files were loaded into R (version ${R_VERSION}) using oligo (version ${oligo_VERSION}). Raw data quality assurance density plot, pseudo images, MA plots, and boxplots were generated using oligo (version ${oligo_VERSION}). The raw probe level intensity data was background corrected and normalized across arrays via the oligo (version ${oligo_VERSION}) quantile method. Normalized probe level data quality assurance density plot, pseudo images, MA plots, and boxplots were generated using oligo (version ${oligo_VERSION}). Normalized probe level data was summarized to the probeset level using the oligo (version ${oligo_VERSION}) RMA method. ${GENE_MAPPING_STEP} ${DE_STEP} ${ANNOT_STEP}" # Output the filled template echo "$template" > PROTOCOL_GLmicroarray.txt \ No newline at end of file diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PROCESS_AFFYMETRIX.nf b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PROCESS_AFFYMETRIX.nf index 6f6e3301..32d104a6 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PROCESS_AFFYMETRIX.nf +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/modules/PROCESS_AFFYMETRIX.nf @@ -28,10 +28,12 @@ process PROCESS_AFFYMETRIX { export HOME=\$PWD; quarto render \$PWD/${qmd} \ + -P 'workflow_version:${workflow.manifest.version}' \ -P 'runsheet:${runsheet_csv}' \ -P 'annotation_file_path:${annotation_file_path}' \ -P 'ensembl_version:${ensemblVersion}' \ -P 'local_annotation_dir:${params.referenceStorePath}' \ + -P 'annotation_config_path:${params.annotation_config_path}' \ ${limit_biomart_query_parameter} \ ${run_DE} diff --git a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/nextflow.config b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/nextflow.config index df8f884e..fcd6ebbe 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/nextflow.config +++ b/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/workflow_code/nextflow.config @@ -41,10 +41,10 @@ profiles { manifest { homePage = 'https://github.com/nasa/GeneLab_Data_Processing/tree/master/Microarray/Affymetrix' - description = 'Affymetrix Microarray Workflow for Document GL-DPPD-7114' + description = 'Affymetrix Microarray Workflow for Document GL-DPPD-7114-A' mainScript = 'main.nf' defaultBranch = 'main' - nextflowVersion = '>=23.10.1' + nextflowVersion = '>=24.10.5' version = '1.0.5' } diff --git a/Microarray/Affymetrix/Workflow_Documentation/README.md b/Microarray/Affymetrix/Workflow_Documentation/README.md index 6c28bc73..ce2148b1 100644 --- a/Microarray/Affymetrix/Workflow_Documentation/README.md +++ b/Microarray/Affymetrix/Workflow_Documentation/README.md @@ -1,14 +1,14 @@ # GeneLab RNAseq Workflow Information > ** For the processing pipeline for Affymetrix microarray data, -[`GL-DPPD-7114.md`](../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md), +[`GL-DPPD-7114-A.md`](../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md), GeneLab has wrapped each step of the pipeline into a workflow with validation and verification of output files built in after each step. The table below lists (and links to) each NF_MAAffymetrix version and the corresponding workflow subdirectory, the current NF_MAAffymetrix/workflow implementation is indicated. Each workflow subdirectory contains information about the workflow along with instructions for installation and usage.** ## NF_MAAffymetrix Version and Corresponding Workflow |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| |:---------------|:---------------------------------------------------------|:---------------| -|*[GL-DPPD-7114.md](../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md)|[1.0.4](NF_MAAffymetrix)|23.10.1| +|*[GL-DPPD-7114-A.md](../Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114-A.md)|[NF_MAAffymetrix_1.0.5](NF_MAAffymetrix)|24.10.5| *Current GeneLab Pipeline/Workflow Implementation