Skip to content

Commit 6f2c77b

Browse files
committed
Uncertainties section added
1 parent 7d8c2d3 commit 6f2c77b

File tree

2 files changed

+77
-3
lines changed

2 files changed

+77
-3
lines changed

06-web-tools.Rmd

Lines changed: 77 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ gProfiler is known for its integration of numerous species and databases. It sup
4949
The Gene Ontology (GO) context is divided into three main categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). The analysis identifies which GO terms are significantly enriched, offering insights into the broader biological implications of the gene set. This helps in pinpointing processes such as cellular responses, metabolic pathways, and molecular interactions.
5050

5151
- **Query Info**:
52-
This section includes specifics about the input data, including the total number of queried genes and any identifiers not recognized or mapped. It also details the statistical background used, the chosen organism, and other analysis settings, ensuring transparency and reproducibility of the results.
52+
This section includes specifics about the input data, including the total number of queried genes and any identifiers not recognised or mapped. It also details the statistical background used, the chosen organism, and other analysis settings, ensuring transparency and reproducibility of the results.
5353

5454

5555
#### {-}
@@ -154,7 +154,7 @@ The first number indicates how many proteins in your network are annotated with
154154
Log10(observed / expected). This measure describes how large the enrichment effect is. It’s the ratio between i) the number of proteins in your network that are annotated with a term and ii) the number of proteins that we expect to be annotated with this term in a random network of the same size.
155155

156156
<span style="color:orange;">- Signal:</span>
157-
The signal is defined as a weighted harmonic mean between the observed/expected ratio and -log(FDR). FDR tends to emphasize larger terms due to their potential for achieving lower p-values, while the observed/expected ratio highlights smaller terms, which have a high foreground to background ratio but cannot achieve low FDR values due to their size. The signal measure seeks to balance both metrics for a more intuittive ordering of enriched terms.
157+
The signal is defined as a weighted harmonic mean between the observed/expected ratio and -log(FDR). FDR tends to emphasise larger terms due to their potential for achieving lower p-values, while the observed/expected ratio highlights smaller terms, which have a high foreground to background ratio but cannot achieve low FDR values due to their size. The signal measure seeks to balance both metrics for a more intuittive ordering of enriched terms.
158158

159159
<span style="color:orange;">- False Discovery Rate:</span>
160160
This measure describes how significant the enrichment is. Shown are p-values corrected for multiple testing within each category using the Benjamini–Hochberg procedure.
@@ -491,5 +491,79 @@ When running FEA in Reactome, how do you prefer the analysis methods?
491491
#### {-}
492492

493493

494-
<!-- ## Evaluation Metrics for FEA Methods -->
494+
## Uncertainties of a functional enrichment analsysis
495495

496+
This section summarises the [Wünsch et al., 2023](https://wires.onlinelibrary.wiley.com/doi/full/10.1002/wics.1643) paper, which addresses uncertainties in atypical functional enrichment analysis.
497+
498+
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="From RNA sequencing measurements to the final results: A practical guide to navigating the choices and uncertainties of gene set analysis"}
499+
knitr::include_graphics("images/Wünsch_et_al_2023.png")
500+
```
501+
502+
503+
### Types of FEA
504+
505+
Functional enrichment analysis (FEA) typically involves one of over representation analysis (ORA), gene set enrichment analysis (GSEA) also known as functional class scoring (FCS), and Pathway Topology (PT).
506+
507+
1. ORA
508+
509+
\- ORA methods are the least complex among the three approaches of FEA.
510+
511+
\- ORA methods requires a list of differentially expressed genes that are already analysed in differential expression analysis.
512+
513+
\- The background population, the universe, can be a more general set of gene like those in human genome or more specific from thos observed in an experiment.
514+
515+
\- A contingency table is created and the null distribution is modeled using the hypergeometric distribution.
516+
517+
2. FCS
518+
519+
\- FCS methods aim to aggregate the values of the gene-level statistics (ranks) into gene set-level statistic (enrichment score, ES).
520+
521+
\- FCS can be classified as one of FCS I, those that take the expression data as input or FCS II that take a pre-ranked list of genes as input. With the latter, the information of the conditions (phenotypes) of the samples is lost, as such phenotype permutation cannot be performed leaving the choice of null hypothesis to gene set permutation.
522+
523+
3. PT
524+
525+
\- PT additionally models interactions between the genes. This approach generally scores considerably lower in terms of popularity in the reference database.
526+
527+
### Key considerations
528+
529+
\- Pre-filter expression data: Exclude lowly expressed genes to improve statistical power.
530+
531+
\- Handle gene IDs carefully: Convert gene IDs to the required format and remove any duplicates.
532+
533+
\- Normalise expression data: Address sample-specific biases to enable fair comparisons between samples.
534+
535+
\- Use appropriate methods for differential expression analysis: Recommended methods include limma (voom), DESeq2, and edgeR.
536+
537+
\- Select suitable gene-level statistics: For FCS II, choose metrics like moderated t-statistic to rank genes meaningfully.
538+
539+
\- Adjust for multiple testing: Ensure your analysis includes a correction for multiple hypothesis testing. Some methods require manual adjustments.
540+
541+
\- Choose gene set databases based on biological context: Ensure that the database aligns with the research question and the experimental system.
542+
543+
### Recommendation
544+
545+
- Awareness of Uncertainties:
546+
547+
\- Recognise uncertainties in methods, parameter choices, and data preprocessing when conducting Gene Set Analysis (GSA).
548+
549+
\- Understand that the method's name alone does not capture the full analysis pipeline.
550+
551+
- Clearly document all analysis choices, including methods, parameters, and preprocessing steps.
552+
553+
- Select methods, parameters, and preprocessing steps before starting the analysis to minimise bias.
554+
555+
- Set Technical Parameters:
556+
557+
\- Fix technical parameters like the random seed and number of permutations before running the analysis to ensure reproducibility.
558+
559+
\- Avoid adjusting these parameters to obtain favorable results.
560+
561+
- Avoid Cherry-picking:
562+
563+
\- Refrain from selectively reporting results based on favorable outcomes, as this can lead to over-optimistic and non-reproducible findings.
564+
565+
\- Avoid excessive tweaking of the analysis strategy to fit the data post hoc.
566+
567+
- Use different pipelines or parameter configurations as part of sensitivity analysis to check the consistency of results.
568+
569+
- Share complete analysis workflows, including code and documentation, to allow others to replicate the findings accurately.

images/Wünsch_et_al_2023.png

939 KB
Loading

0 commit comments

Comments
 (0)