Skip to content

Add module for running SCimilarity#175

Merged
allyhawkins merged 24 commits intomainfrom
allyhawkins/170-scimilarity
Aug 28, 2025
Merged

Add module for running SCimilarity#175
allyhawkins merged 24 commits intomainfrom
allyhawkins/170-scimilarity

Conversation

@allyhawkins
Copy link
Member

Closes #170

Here I'm adding the module to run SCimilarity on all of the samples. There's only one step here which is just to run SCimilarity and output the annotations as a TSV file. I copied the script that does this from OpenScPCA-analysis without any modifications so the main code to review here is the addition to Nextflow.

  • There are two new parameters, one for the model itself and one for the ontology map file that we created in OpenScPCA-analysis. Since the model file is quite big, I added an empty folder for stub testing and added that path to the stub profile.
  • This module runs on all processed h5ad files for RNA only, so I do have a step that should filter out any adt files.

I am filing this as a draft because I'm having issues getting Nextflow to run the script in the container inside the conda environment. The way that Nextflow launches and runs the image means the conda environment that's installed isn't getting used by default and so it can't find the packages we use in the script. I think I found a solution to when we build the environment to set the default python to the conda environment so that it will work with Nextflow that I'll file as a PR in OpenScPCA-analysis.

Once I'm able to confirm this runs, then I'll request formal review.

@allyhawkins
Copy link
Member Author

I ran through the whole workflow with the simulated data successfully, so this is now ready for review.

@allyhawkins allyhawkins requested a review from sjspielman August 28, 2025 14:37
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Not much to say about the nextflow code for a pretty standard "one process to do the thing" situation, and you got this memo ✅ #178 (comment)

FYI, I didn't carefully review the Python code since I'm assuming it was well-reviewed nextdoor in the analysis repo, let me know if you want me to have a closer look anywhere?

So in the end, my main comment is to restore the workflow bits you commented out during testing.


// cell type scimilarity
cell_type_scimilarity_model = 's3://scpca-references/celltype/scimilarity_references/model_v1.1'
cell_type_scimilarity_ontology_ref_file = 'https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/refs/heads/main/analyses/cell-type-scimilarity/references/scimilarity-mapped-ontologies.tsv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting we'll want to update this one too with a tagged link, same as my NB urls above

main.nf Outdated

// Run the merge workflow
merge_sce(sample_ch)
//merge_sce(sample_ch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, workflow testing 😬

--processed_h5ad_file \$file \
--ontology_map_file ${ontology_map_file} \
--predictions_tsv \$(basename \${file%_rna.h5ad}_scimilarity-celltype-assignments.tsv.gz) \
--seed 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assigned in the file so you probably don't need it, but doesn't

{
"barcode": processed_anndata.obs_names.to_list(),
"scimilarity_celltype_annotation": predictions.values,
"min_dist": nn_stats["min_dist"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this column in this workflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is a stat we are going to use to measure confidence, as recommended by SCimilarity docs, so we want to output it so we can use it for exploratory analysis.

allyhawkins and others added 3 commits August 28, 2025 13:20
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
@allyhawkins
Copy link
Member Author

allyhawkins commented Aug 28, 2025

@sjspielman I ran this through on the real data and all samples completed successfully. I also checked and the results files are now present in the staging bucket.

I restored running the other modules and added a TODO about updating to use the tagged link. This should be ready for another look.

Edit: I meant to say that you do not need to review the python script since it is copied exactly from the script that was reviewed in the analysis repo.

@allyhawkins allyhawkins requested a review from sjspielman August 28, 2025 18:27
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@allyhawkins allyhawkins merged commit 24ee52b into main Aug 28, 2025
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/170-scimilarity branch August 28, 2025 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add module for running SCimilarity

2 participants