#43 Improvements to PCA rule by mikemartinez99 · Pull Request #44 · Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline

mikemartinez99 · 2026-02-19T13:49:28Z

PCA Variance detection enhancement

TL;DR:

PCA is calculated using a dynamic number of HVGs (specified by asymptote detection) as well as using ALL genes
Switched plotting over to plotnine
Added fast clustering
Added log file output that informs how many HVGs were used for PCA calculation
Added new conda environment

DETAILS

Previously, PCA was calculated on 500 top most variable genes, which is not ideal for datasets with a wider variance profile. This script adds scipy.optimize curve_fit to find the horizontal asymptote (i.e., the number of HVGs until variance starts to plateau) as a way to dynamically calculate PCs on each dataset, retaining maximum informative genes. PCA is also calculated using all genes. Output files have the prefix "PCA_top_" and "PCA_all_" respectively. Edge case handling is built in. If a curve is unable to be fit, the script defaults to using all genes.
This asymptote method is a reasonable heuristic for dynamically calculating PCs. Additionally, a log file is output specifying how many HVGs were used in calculation so we can report this in client results etc...
Previous script did not scale data matrix and since we are only using median of ratios normalization followed by log2 + 1, we were not stabilizing the mean-variance relationship. Scaling helps balance this a little better, although VST or rlog should be the preferred method. This script should only serve as a rough diagnostic.
(Note: I retested all of these normalizations with and without scaling in both this python script and in R and got comparable results with slight numerical differences due to floating point calculations.)
Updated plotting to use plotnine for easier ggplot2-style syntax in python (easier debugging)
Added fastclusters to improve heatmap clustering speed.
Outputs of this script include:

PCA_all_PC1_vs_PC2.png
PCA_all_PC2_vs_PC3.png
PCA_all_PC3_vs_PC4.png
PCA_top_PC1_vs_PC2.png
PCA_top_PC2_vs_PC3.png
PCA_top_PC3_vs_PC4.png
Gene_Variance_Plot.png (shows where the asymptote begins, i.e., # of HVGs used for PCA_top)
PCA_all_PCA_variance_bar.png
PCA_top_PCA_variance_bar.png
Top_Genes_Heatmap.png
pca_hvg.log.txt (How many genes total and how many were used for top PCA calculation, and what was the variance at that index)

Tested interactively and in pipeline context.

Refactor PCA plotting script to use plotnine and improve variance threshold detection. Add faster clustering.

Updated environment configuration for PCA plot library.

Removed TODO comments related to data processing tasks.

…in gene variance

Test data only has 8 genes in feature counts file. Added contraint in PCA function to not request more components than allowed based on number of samples and number of genes. This was what was breaking the test.

Fix NaN dissimilarity error in heatmap clustermap Filter out zero-variance rows before Z-score normalization to prevent division by zero, which produced NaN values that crashed fastcluster linkage computation. Also drop residual NaN values and zero-variance columns after scaling. Add fallback to simple unclustered heatmap when insufficient data remains for hierarchical clustering (e.g., small test datasets).

add setup tools to multiqc yaml. Old snakemake GitHub actions are triggering this error but it is easier to update the yaml than update the actions which will probably introduce more breaking changes.

Pin to multiqc 1.24.1 to fix bug when using -p for plot exporting.

add kaleido and chromium for static image generation in GitHub.

mikemartinez99 · 2026-03-01T20:19:58Z

All tests passing now.
Multiqc conda environment needed to be updated with some additional dependencies required by multiqc for static plot rendering and interactive plot rendering.

owenwilkins · 2026-03-02T20:58:43Z

Nice thanks. Do you know what induced the new dependencies? i.e. was this updates on Github Actions side? From: Mike Martinez ***@***.***> Date: Sunday, March 1, 2026 at 3:20 PM To: Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline ***@***.***> Cc: Owen Michael Wilkins ***@***.***>, Review requested ***@***.***> Subject: Re: [Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline] #43 Improvements to PCA rule (PR #44) [https://avatars.githubusercontent.com/u/183527857?s=20&v=4]mikemartinez99 left a comment (Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline#44)<#44 (comment)> All tests passing now. Multiqc conda environment needed to be updated with some additional dependencies required by multiqc for static plot rendering and interactive plot rendering. — Reply to this email directly, view it on GitHub<#44 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEH5WCFUCXFTXRP5GZI46F34OSLQJAVCNFSM6AAAAACVWC56NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSOBQHE2TCOBYGY>. You are receiving this because your review was requested.

mikemartinez99 · 2026-03-02T21:01:32Z

I think it was an actions issue. static plot rendering doesnt seem to be supported anymore. I believe Kaleido is called by plotly which is called by the newer version of multiqc.

The other option for a fix was to turn off plot exporting from multiqc. I personally never use those plots in the output but didn't know if others on the team liked having them.

mikemartinez99 added 4 commits February 19, 2026 08:04

Refactor PCA script with plotnine and enhancements

3b97625

Refactor PCA plotting script to use plotnine and improve variance threshold detection. Add faster clustering.

Rename environment and update dependencies

9d8f943

Updated environment configuration for PCA plot library.

Update PCA plots rule

9ca493a

Clean up TODO comments in Snakefile

528ce27

Removed TODO comments related to data processing tasks.

mikemartinez99 requested a review from owenwilkins February 19, 2026 13:49

mikemartinez99 added the enhancement New feature or request label Feb 19, 2026

owenwilkins and others added 15 commits February 19, 2026 09:37

Merge branch 'master' into dev

94aa763

Correct memory limit syntax in Snakefile resources

ec22108

Fix memory format in resource specifications

121a493

Updated pca_plotting.py to handle edge case for no plateau detection …

7b906f4

…in gene variance

Update pca_plotting.py

b10f674

Test data only has 8 genes in feature counts file. Added contraint in PCA function to not request more components than allowed based on number of samples and number of genes. This was what was breaking the test.

Update multiqc.yaml

4a6137e

add setup tools to multiqc yaml. Old snakemake GitHub actions are triggering this error but it is easier to update the yaml than update the actions which will probably introduce more breaking changes.

Update multiqc.yaml

6667a2b

Pin to multiqc 1.24.1 to fix bug when using -p for plot exporting.

revert multiqc back

f189d4e

Update multiqc.yaml

966be3f

add kaleido and chromium for static image generation in GitHub.

remove chromium from multiqc

32df86b

Update multiqc.yaml

403856b

Update multiqc.yaml

2d05eaa

Update multiqc.yaml

ef64aa2

Update multiqc.yaml

a24b446

mikemartinez99 added 2 commits March 3, 2026 10:55

add slurm_logs output folder to cluster profile

1ccc762

Update job.script.sh to have slurm logs dir creation

eb81311

owenwilkins merged commit 05ad29f into master Mar 4, 2026
9 checks passed

owenwilkins mentioned this pull request Mar 4, 2026

Revert "#43 Improvements to PCA rule" #45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#43 Improvements to PCA rule#44

#43 Improvements to PCA rule#44
owenwilkins merged 21 commits intomasterfrom
dev

mikemartinez99 commented Feb 19, 2026 •

edited

Loading

Uh oh!

mikemartinez99 commented Mar 1, 2026

Uh oh!

owenwilkins commented Mar 2, 2026 via email

Uh oh!

mikemartinez99 commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikemartinez99 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikemartinez99 commented Mar 1, 2026

Uh oh!

owenwilkins commented Mar 2, 2026 via email

Uh oh!

mikemartinez99 commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikemartinez99 commented Feb 19, 2026 •

edited

Loading