Module for exporting openscpca annotations by allyhawkins · Pull Request #172 · AlexsLemonade/OpenScPCA-nf

allyhawkins · 2025-08-26T18:11:46Z

Closes #161

Here I'm adding a module to be used for formatting and exporting annotations from OpenScPCA as JSON files to use as input to scpca-nf. The new module is named export_annotations and takes as input a channel with TSV files from annotation modules. It just runs one process which is an R script to create the JSON file using the TSV file as input. The JSON file includes an array of barcodes, cell type annotations, ontology IDs if present, workflow version, release date and the name of the original analysis module. These files are saved to the annotations bucket within the folder corresponding to that data release, following the same organization as the main workflow results output.

I made this into a separate module that gets run at the end using inputs from the annotation modules. This is currently just Ewings, but the idea would be to combine all the annotation outputs into a single channel to use as input to this module. Alternatively, I could run this as a separate process at the end of each annotation module and not in the main workflow, but I thought that would be harder to maintain.

I think the main thing we should figure out here is the approach to passing in the column names for where the annotations are in each TSV. Currently I have it set up so there are two arguments to the script, one for the annotation column and one for the ontology column. This is because right now, we don't have a standard naming convention for what the columns should be named when they are output in the TSV files from an annotation module. For example, the Ewing columns are ewing_annotation and ewing_ontology. I think there are a few different approaches we could take here:

Leave the column names as is and specify what columns to use with arguments as I've done here. This requires us to pass in the column names in the input channel to this workflow.
Keep arguments for the column names but make those column names parameters to the workflow rather than hardcoding them in the annotation module. This would require some updates to the Rscript for assigning Ewing cell typing but might be easier long term to have the names of the columns in one place.
Require the annotations that we want to port over to be in a specific openscpca_celltype_annotation column in the TSV files and remove the arguments alltogether. With this option additional annotation columns can be present, but whatever is getting ported over must be in those columns.
- My only hesitation with this option is that these columns don't exist yet for the Ewing data. I would want to add these as additional columns with the existing Ewing output rather than overwrite the columns since there is code in OpenScPCA-analysis that uses these annotation columns.

Another thing I wanted to note is that I think we should set a standard expectation of what the output tuples and TSV files should be from cell type annotation modules so that they can be used as input to format the annotations. Once we decide what that is we should update the docs for porting modules to include that information.

I did test this and successfully made JSON files that can be easily read into R as a data frame.

jashapiro

Overall this looks good!

I like keeping the flexibility to name columns flexibly, so having the column names as part of the output seems reasonable to me.

My main suggestion is to wrap up the annotation metadata into a tuple to make keeping track of what goes where for the annotation columns a bit easier and more flexible.

One thought I had at the end was that you could have default values for the columns as well, implemented as something like:

--annotation_column "${annotation_metadata.annotation_column ?: "openscpca_celltype_annotation" }" \

But now maybe I don't think that adds much, since you will require something in the output anyway...

modules/cell-type-ewings/main.nf

modules/export-annotations/main.nf

modules/export-annotations/resources/usr/bin/export-celltype-json.R

modules/export-annotations/main.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

…enscpca-annotations

allyhawkins · 2025-08-27T15:39:59Z

I took the suggestion to wrap the metadata in a tuple, but chose not to add a default for the columns since I want to make sure there actually is a column that should be ported over. Otherwise it doesn't really make sense to have an empty column.

I also added an annotations bucket for the stub profile otherwise it complains when running the Ewing project.

I will file an issue to add some information to docs about how to format the output tuples for annotation modules, but otherwise this should be ready for another look.

jashapiro

LGTM, with a small update to a comment to clarify that we are using a metadata map (dictionary) with specific keys.

Otherwise this looks good (though there is some kind of conflict that needs to be resolved now)

main.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

…enscpca-annotations

allyhawkins added 13 commits August 25, 2025 16:12

process and script for exporting json

6586dd2

add annotations bucket

e237f49

format ewing output

b770305

include exporting in main workflow

f1dbd73

add stub to export process

ec1c0ff

temp comment out other steps

c66365c

date not data

ae540fe

make sure column names are strings

2ea6a74

add readme

fde8ed9

add some messages for debugging

a7b9b95

directly export file

090d8da

add annotations bucket to schema

8c7c19a

simplify vector to remove nested lists

091d130

allyhawkins requested a review from jashapiro August 26, 2025 18:11

jashapiro reviewed Aug 26, 2025

View reviewed changes

allyhawkins and others added 7 commits August 27, 2025 10:10

Apply suggestions from code review

7e2adda

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

Merge remote-tracking branch 'origin/main' into allyhawkins/export-op…

a917201

…enscpca-annotations

update comments to use annotation metadata

e738c6f

fix tuple setup for metadata

4dba656

set stub annotations output

37521df

uncomment out other modules

8d2bd16

update comment

499ae90

allyhawkins requested a review from jashapiro August 27, 2025 15:40

allyhawkins mentioned this pull request Aug 27, 2025

Document expected output for cell type annotation modules #174

Closed

jashapiro approved these changes Sep 2, 2025

View reviewed changes

main.nf Outdated Show resolved Hide resolved

allyhawkins and others added 3 commits September 2, 2025 10:15

Update comment

8a33498

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

Merge remote-tracking branch 'origin/main' into allyhawkins/export-op…

9fb81fa

…enscpca-annotations

remove duplicate infercnv module

e8c94e8

allyhawkins merged commit c232c5d into main Sep 2, 2025
3 checks passed

allyhawkins deleted the allyhawkins/export-openscpca-annotations branch September 2, 2025 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module for exporting openscpca annotations#172

Module for exporting openscpca annotations#172
allyhawkins merged 23 commits intomainfrom
allyhawkins/export-openscpca-annotations

allyhawkins commented Aug 26, 2025

Uh oh!

jashapiro left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allyhawkins commented Aug 27, 2025

Uh oh!

jashapiro left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

allyhawkins commented Aug 26, 2025

Uh oh!

jashapiro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allyhawkins commented Aug 27, 2025

Uh oh!

jashapiro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants