Skip to content

#43 Improvements to PCA rule#44

Merged
owenwilkins merged 21 commits intomasterfrom
dev
Mar 4, 2026
Merged

#43 Improvements to PCA rule#44
owenwilkins merged 21 commits intomasterfrom
dev

Conversation

@mikemartinez99
Copy link
Contributor

@mikemartinez99 mikemartinez99 commented Feb 19, 2026

PCA Variance detection enhancement

TL;DR:

  • PCA is calculated using a dynamic number of HVGs (specified by asymptote detection) as well as using ALL genes
  • Switched plotting over to plotnine
  • Added fast clustering
  • Added log file output that informs how many HVGs were used for PCA calculation
  • Added new conda environment

DETAILS

  • Previously, PCA was calculated on 500 top most variable genes, which is not ideal for datasets with a wider variance profile. This script adds scipy.optimize curve_fit to find the horizontal asymptote (i.e., the number of HVGs until variance starts to plateau) as a way to dynamically calculate PCs on each dataset, retaining maximum informative genes. PCA is also calculated using all genes. Output files have the prefix "PCA_top_" and "PCA_all_" respectively. Edge case handling is built in. If a curve is unable to be fit, the script defaults to using all genes.
    This asymptote method is a reasonable heuristic for dynamically calculating PCs. Additionally, a log file is output specifying how many HVGs were used in calculation so we can report this in client results etc...

  • Previous script did not scale data matrix and since we are only using median of ratios normalization followed by log2 + 1, we were not stabilizing the mean-variance relationship. Scaling helps balance this a little better, although VST or rlog should be the preferred method. This script should only serve as a rough diagnostic.
    (Note: I retested all of these normalizations with and without scaling in both this python script and in R and got comparable results with slight numerical differences due to floating point calculations.)

  • Updated plotting to use plotnine for easier ggplot2-style syntax in python (easier debugging)

  • Added fastclusters to improve heatmap clustering speed.

  • Outputs of this script include:

  1. PCA_all_PC1_vs_PC2.png
  2. PCA_all_PC2_vs_PC3.png
  3. PCA_all_PC3_vs_PC4.png
  4. PCA_top_PC1_vs_PC2.png
  5. PCA_top_PC2_vs_PC3.png
  6. PCA_top_PC3_vs_PC4.png
  7. Gene_Variance_Plot.png (shows where the asymptote begins, i.e., # of HVGs used for PCA_top)
  8. PCA_all_PCA_variance_bar.png
  9. PCA_top_PCA_variance_bar.png
  10. Top_Genes_Heatmap.png
  11. pca_hvg.log.txt (How many genes total and how many were used for top PCA calculation, and what was the variance at that index)

Tested interactively and in pipeline context.

Refactor PCA plotting script to use plotnine and improve variance threshold detection. Add faster clustering.
Updated environment configuration for PCA plot library.
Removed TODO comments related to data processing tasks.
@mikemartinez99 mikemartinez99 added the enhancement New feature or request label Feb 19, 2026
owenwilkins and others added 15 commits February 19, 2026 09:37
Test data only has 8 genes in feature counts file. Added contraint in PCA function to not request more components than allowed based on number of samples and number of genes. This was what was breaking the test.
Fix NaN dissimilarity error in heatmap clustermap

Filter out zero-variance rows before Z-score normalization to prevent
division by zero, which produced NaN values that crashed fastcluster
linkage computation. Also drop residual NaN values and zero-variance
columns after scaling. Add fallback to simple unclustered heatmap when
insufficient data remains for hierarchical clustering (e.g., small
test datasets).
add setup tools to multiqc yaml. Old snakemake GitHub actions are triggering this error but it is easier to update the yaml than update the actions which will probably introduce more breaking changes.
Pin to multiqc 1.24.1 to fix bug when using -p for plot exporting.
add kaleido and chromium for static image generation in GitHub.
@mikemartinez99
Copy link
Contributor Author

All tests passing now.
Multiqc conda environment needed to be updated with some additional dependencies required by multiqc for static plot rendering and interactive plot rendering.

@owenwilkins
Copy link
Contributor

owenwilkins commented Mar 2, 2026 via email

@mikemartinez99
Copy link
Contributor Author

I think it was an actions issue. static plot rendering doesnt seem to be supported anymore. I believe Kaleido is called by plotly which is called by the newer version of multiqc.

The other option for a fix was to turn off plot exporting from multiqc. I personally never use those plots in the output but didn't know if others on the team liked having them.

@owenwilkins owenwilkins merged commit 05ad29f into master Mar 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants