Merged
Conversation
Refactor PCA plotting script to use plotnine and improve variance threshold detection. Add faster clustering.
Updated environment configuration for PCA plot library.
Removed TODO comments related to data processing tasks.
Test data only has 8 genes in feature counts file. Added contraint in PCA function to not request more components than allowed based on number of samples and number of genes. This was what was breaking the test.
Fix NaN dissimilarity error in heatmap clustermap Filter out zero-variance rows before Z-score normalization to prevent division by zero, which produced NaN values that crashed fastcluster linkage computation. Also drop residual NaN values and zero-variance columns after scaling. Add fallback to simple unclustered heatmap when insufficient data remains for hierarchical clustering (e.g., small test datasets).
add setup tools to multiqc yaml. Old snakemake GitHub actions are triggering this error but it is easier to update the yaml than update the actions which will probably introduce more breaking changes.
Pin to multiqc 1.24.1 to fix bug when using -p for plot exporting.
add kaleido and chromium for static image generation in GitHub.
Contributor
Author
|
All tests passing now. |
Contributor
|
Nice thanks. Do you know what induced the new dependencies? i.e. was this updates on Github Actions side?
From: Mike Martinez ***@***.***>
Date: Sunday, March 1, 2026 at 3:20 PM
To: Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline ***@***.***>
Cc: Owen Michael Wilkins ***@***.***>, Review requested ***@***.***>
Subject: Re: [Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline] #43 Improvements to PCA rule (PR #44)
[https://avatars.githubusercontent.com/u/183527857?s=20&v=4]mikemartinez99 left a comment (Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline#44)<#44 (comment)>
All tests passing now.
Multiqc conda environment needed to be updated with some additional dependencies required by multiqc for static plot rendering and interactive plot rendering.
—
Reply to this email directly, view it on GitHub<#44 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEH5WCFUCXFTXRP5GZI46F34OSLQJAVCNFSM6AAAAACVWC56NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSOBQHE2TCOBYGY>.
You are receiving this because your review was requested.
|
Contributor
Author
|
I think it was an actions issue. static plot rendering doesnt seem to be supported anymore. I believe Kaleido is called by plotly which is called by the newer version of multiqc. The other option for a fix was to turn off plot exporting from multiqc. I personally never use those plots in the output but didn't know if others on the team liked having them. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PCA Variance detection enhancement
TL;DR:
DETAILS
Previously, PCA was calculated on 500 top most variable genes, which is not ideal for datasets with a wider variance profile. This script adds scipy.optimize curve_fit to find the horizontal asymptote (i.e., the number of HVGs until variance starts to plateau) as a way to dynamically calculate PCs on each dataset, retaining maximum informative genes. PCA is also calculated using all genes. Output files have the prefix "PCA_top_" and "PCA_all_" respectively. Edge case handling is built in. If a curve is unable to be fit, the script defaults to using all genes.
This asymptote method is a reasonable heuristic for dynamically calculating PCs. Additionally, a log file is output specifying how many HVGs were used in calculation so we can report this in client results etc...
Previous script did not scale data matrix and since we are only using median of ratios normalization followed by log2 + 1, we were not stabilizing the mean-variance relationship. Scaling helps balance this a little better, although VST or rlog should be the preferred method. This script should only serve as a rough diagnostic.
(Note: I retested all of these normalizations with and without scaling in both this python script and in R and got comparable results with slight numerical differences due to floating point calculations.)
Updated plotting to use plotnine for easier ggplot2-style syntax in python (easier debugging)
Added fastclusters to improve heatmap clustering speed.
Outputs of this script include:
Tested interactively and in pipeline context.