Skip to content

Include SCimilarity when assigning consensus cell types#184

Merged
allyhawkins merged 27 commits intomainfrom
allyhawkins/scimilarity-to-consensus
Sep 12, 2025
Merged

Include SCimilarity when assigning consensus cell types#184
allyhawkins merged 27 commits intomainfrom
allyhawkins/scimilarity-to-consensus

Conversation

@allyhawkins
Copy link
Member

Closes #181

Here I'm updating the cell-type-consensus module to pass in the output of cell-type-scimilarity. This was mostly straight forward, but I did have to account for the scimilarity output file being optional. We don't have them for multiplexed samples, so I join using remainder: true and then if the files list is empty, I pass in a dummy file name. Within the script itself, SCimilarity only gets used if the annotations file is found.

For the Rscript, I copied over the updates from 04-assign-consensus-celltypes.R. The script is exactly the same except I updated the output columns to now have both the new and old consensus cell types.

  • the new consensus cell types can now be found in consensus_annotation and consensus_ontology, which is the same column names we were using in this module previously
  • the old consensus cell types were present in the processed objects in consensus_celltype_annotation and consensus_celltype_ontology. If they were present I renamed those columns to be singler_cellassign_celltype_annotation and singler_cellassign_celltype_ontology. If they aren't present then I just fill them in with NA.

I also added in the new marker gene file for all consensus cell types as a parameter and argument to the script.

One other note is that I'm currently using permalinks for all the references, but I'll file an issue to update to tagged links once we have the new release of OpenScPCA-analysis.

@allyhawkins
Copy link
Member Author

Just noting that I was able to run this through on the simulated data successfully:
https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/1NrdtwlYFlqsIr/v2/tasks

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me.

I had a couple questions, the main one being if we want to change the output of the script here to use the same consensus_celltype_annotation as in the final SCE files instead of consensus_annotation. I know this might require other downstream changes, but the inconsistency kind of bothers me, and took me some time to figure out.

The other is a little side question about passing empty arrays instead of NO_FILE files. I am happy to let that one pass and come back to it though!

for library_id in ${library_ids.join(" ")}; do
# find files that have the appropriate library id in file name
sce_file=\$(ls ${library_files} | grep "\${library_id}")
scimilarity_file=\$(ls ${scimilarity_files} | grep "\${library_id}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scimilarity_files is just NO_FILE , then this will be ""; just checking that will work... (But also noting my suggestion to maybe not pass NO_FILE)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this through with simulated data for all libraries so just wanted to note that this does work.

Comment on lines 219 to +227
dplyr::select(
panglao_ontology,
original_panglao_name,
blueprint_ontology,
consensus_annotation,
consensus_ontology
)
cellassign_celltype_annotation = original_panglao_name,
singler_celltype_ontology = blueprint_ontology,
scimilarity_celltype_ontology = scimilarity_ontology,
starts_with(consensus_column_prefix)
) |>
# now just filter to join columns and get unique combinations
dplyr::select(all_of(join_columns), starts_with(consensus_column_prefix)) |>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the two select statements in a row. I assume this is because you need to rename? If so, maybe change the first one to just dplyr:rename().

# use unknown for NA annotation but keep ontology ID as NA
# if the sample type is cell line, keep as NA
dplyr::mutate(consensus_annotation = dplyr::if_else(is.na(consensus_annotation) & (!stringr::str_detect(sample_type, "cell line")), "Unknown", consensus_annotation))
dplyr::mutate(consensus_annotation = dplyr::if_else(is.na(consensus_annotation) & (sample_type != "cell line"), "Unknown", consensus_annotation))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you changed this to require an exact match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! This is because I copied the script from the analysis repo, which had the old code that did not handle multiple sample types. I'll revert this change and also make sure it's up to date in the analysis repo version of this script.

Comment on lines +253 to +268
# rename old consensus cell type columns if they are present
if ("consensus_celltype_annotation" %in% colnames(all_assignments_df)) {
all_assignments_df <- all_assignments_df |>
# rename old consensus columns to avoid confusion
dplyr::rename(
singler_cellassign_consensus_annotation = consensus_celltype_annotation,
singler_cellassign_consensus_ontology = consensus_celltype_ontology
)
} else {
# if no consensus from the object, set to NA
all_assignments_df <- all_assignments_df |>
dplyr::mutate(
singler_cellassign_consensus_annotation = NA,
singler_cellassign_consensus_ontology = NA
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do this at the start instead? Part of this is me wondering if we want to make the output column from this script consensus_celltype_annotation for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of this is me wondering if we want to make the output column from this script consensus_celltype_annotation for consistency.

I just want to note here, this change would break a lot of other spots in OpenScPCA-analysis, so we might want to get a sense of how much first. This is just from a real quick 'n dirty search for consensus_annotation, aka some of these hits may not be parsing cell-type-consensus output, but it's a starting point! https://github.com/search?q=repo%3AAlexsLemonade%2FOpenScPCA-analysis+consensus_annotation&type=code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about this when working on it, but I was also concerned about breaking things downstream that use it. I do think we should prioritize how often these columns are used in scripts that get run in CI vs just exploratory notebooks if we update it though.

I'm honestly 50/50 on if we should update it or not, so if others have strong opinions I'm fine with that. I agree its annoying, but its also helpful to distinguish which column is from the processed objects vs. which column is from the module and prevents any clashes when running this module on processed objects with existing consensus cell types.

I also want to note that eventually the column will be updated in the processed objects that get read in so that it contains the consensus from all three methods. So we should probably update the column names here to be existing_consensus_celltype or something more generic just indicates its the consensus cell types from the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense to me. Absolutely fine to leave consensus_annotation as the output column name here. I would still probably rename at the start though.

@allyhawkins
Copy link
Member Author

@jashapiro I made the small updates to the script you recommended, including moving renaming the older consensus cell type columns. As part of that I renamed them to existing_celltype_annotation and existing_celltype_ontology to future proof for when the objects contain all three methods. I don't want to get confused down the line if we name it singler_cellassign and it doesn't match up with what's actually in the column.

I did not yet change the name from consensus_annotation to consensus_celltype_annotation. I didn't want to do this just yet, because this is the name of the column in the reference files and everything in the existing cell-type-consensus module in OpenScPCA-nf. It also helps distinguish from what's in the processed objects. So unless you feel really strong about changing it, I think we should keep it.

The other thing I'm going to try here is just providing the [] instead of an empty file. I had run this through with the simulated data thinking that what I had here worked, but I realized the simulated data actually has _processed_rna.h5ad for multiplexed samples. So the only way to test that this works is to use the real data. I'm going to test that change with the full scpca data and see what happens.

@allyhawkins
Copy link
Member Author

@jashapiro I was able to succesfully run this through with the real data so this is now ready for re-review.

I did have to make a small change to the scimilarity module because we hadn't run it since changing to use the path(*_scimilarity.tsv) formatting for specifying output. Previously it was running on multiplexed samples just not creating an output file and so there was no error since we define the output file. Turns out we need to actually filter out any libraries that return an empty list for the files before trying to pass it through the process.

Comment on lines +200 to +201
# by default use the lca between cellassign and singler as the consensus cell type
consensus_column_prefix <- "cellassign_singler_pair"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed this at my suggestion then changed it back, so I am confused about what exactly is going on here? I would assume that you would want to keep the existing consensus if there were modifications to be made? Or is this coming from somewhere else?

Copy link
Member Author

@allyhawkins allyhawkins Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this prefix is used to specify which column to grab from the consensus cell type reference. If no scimilarity is present, then the consensus cell type is from the cellassign_singler_pair column in the reference file. If scimilarity is present, then the consensus cell type is from the main consensus_annotation column. So this is separate from naming the ouptut columns.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with just a couple comment updates.

sample_id,
project_id,
sce_files,
scimilarity_files ?: []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad to know this works!

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins allyhawkins merged commit ecba127 into main Sep 12, 2025
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/scimilarity-to-consensus branch September 12, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update cell-type-consensus module to use the output of cell-type-scimilarity

3 participants