Skip to content

Module for exporting openscpca annotations#172

Merged
allyhawkins merged 23 commits intomainfrom
allyhawkins/export-openscpca-annotations
Sep 2, 2025
Merged

Module for exporting openscpca annotations#172
allyhawkins merged 23 commits intomainfrom
allyhawkins/export-openscpca-annotations

Conversation

@allyhawkins
Copy link
Member

Closes #161

Here I'm adding a module to be used for formatting and exporting annotations from OpenScPCA as JSON files to use as input to scpca-nf. The new module is named export_annotations and takes as input a channel with TSV files from annotation modules. It just runs one process which is an R script to create the JSON file using the TSV file as input. The JSON file includes an array of barcodes, cell type annotations, ontology IDs if present, workflow version, release date and the name of the original analysis module. These files are saved to the annotations bucket within the folder corresponding to that data release, following the same organization as the main workflow results output.

I made this into a separate module that gets run at the end using inputs from the annotation modules. This is currently just Ewings, but the idea would be to combine all the annotation outputs into a single channel to use as input to this module. Alternatively, I could run this as a separate process at the end of each annotation module and not in the main workflow, but I thought that would be harder to maintain.

I think the main thing we should figure out here is the approach to passing in the column names for where the annotations are in each TSV. Currently I have it set up so there are two arguments to the script, one for the annotation column and one for the ontology column. This is because right now, we don't have a standard naming convention for what the columns should be named when they are output in the TSV files from an annotation module. For example, the Ewing columns are ewing_annotation and ewing_ontology. I think there are a few different approaches we could take here:

  • Leave the column names as is and specify what columns to use with arguments as I've done here. This requires us to pass in the column names in the input channel to this workflow.
  • Keep arguments for the column names but make those column names parameters to the workflow rather than hardcoding them in the annotation module. This would require some updates to the Rscript for assigning Ewing cell typing but might be easier long term to have the names of the columns in one place.
  • Require the annotations that we want to port over to be in a specific openscpca_celltype_annotation column in the TSV files and remove the arguments alltogether. With this option additional annotation columns can be present, but whatever is getting ported over must be in those columns.
    • My only hesitation with this option is that these columns don't exist yet for the Ewing data. I would want to add these as additional columns with the existing Ewing output rather than overwrite the columns since there is code in OpenScPCA-analysis that uses these annotation columns.

Another thing I wanted to note is that I think we should set a standard expectation of what the output tuples and TSV files should be from cell type annotation modules so that they can be used as input to format the annotations. Once we decide what that is we should update the docs for porting modules to include that information.

I did test this and successfully made JSON files that can be easily read into R as a data frame.

@allyhawkins allyhawkins requested a review from jashapiro August 26, 2025 18:11
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good!

I like keeping the flexibility to name columns flexibly, so having the column names as part of the output seems reasonable to me.

My main suggestion is to wrap up the annotation metadata into a tuple to make keeping track of what goes where for the annotation columns a bit easier and more flexible.

One thought I had at the end was that you could have default values for the columns as well, implemented as something like:

--annotation_column "${annotation_metadata.annotation_column ?: "openscpca_celltype_annotation" }" \

But now maybe I don't think that adds much, since you will require something in the output anyway...

@allyhawkins
Copy link
Member Author

I took the suggestion to wrap the metadata in a tuple, but chose not to add a default for the columns since I want to make sure there actually is a column that should be ported over. Otherwise it doesn't really make sense to have an empty column.

I also added an annotations bucket for the stub profile otherwise it complains when running the Ewing project.

I will file an issue to add some information to docs about how to format the output tuples for annotation modules, but otherwise this should be ready for another look.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a small update to a comment to clarify that we are using a metadata map (dictionary) with specific keys.

Otherwise this looks good (though there is some kind of conflict that needs to be resolved now)

allyhawkins and others added 3 commits September 2, 2025 10:15
@allyhawkins allyhawkins merged commit c232c5d into main Sep 2, 2025
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/export-openscpca-annotations branch September 2, 2025 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a process for exporting cell type annotations to use as input for scpca-nf

2 participants