Module for exporting openscpca annotations#172
Conversation
jashapiro
left a comment
There was a problem hiding this comment.
Overall this looks good!
I like keeping the flexibility to name columns flexibly, so having the column names as part of the output seems reasonable to me.
My main suggestion is to wrap up the annotation metadata into a tuple to make keeping track of what goes where for the annotation columns a bit easier and more flexible.
One thought I had at the end was that you could have default values for the columns as well, implemented as something like:
--annotation_column "${annotation_metadata.annotation_column ?: "openscpca_celltype_annotation" }" \
But now maybe I don't think that adds much, since you will require something in the output anyway...
modules/export-annotations/resources/usr/bin/export-celltype-json.R
Outdated
Show resolved
Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
…enscpca-annotations
|
I took the suggestion to wrap the metadata in a tuple, but chose not to add a default for the columns since I want to make sure there actually is a column that should be ported over. Otherwise it doesn't really make sense to have an empty column. I also added an annotations bucket for the stub profile otherwise it complains when running the Ewing project. I will file an issue to add some information to docs about how to format the output tuples for annotation modules, but otherwise this should be ready for another look. |
jashapiro
left a comment
There was a problem hiding this comment.
LGTM, with a small update to a comment to clarify that we are using a metadata map (dictionary) with specific keys.
Otherwise this looks good (though there is some kind of conflict that needs to be resolved now)
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
…enscpca-annotations
Closes #161
Here I'm adding a module to be used for formatting and exporting annotations from OpenScPCA as JSON files to use as input to
scpca-nf. The new module is namedexport_annotationsand takes as input a channel with TSV files from annotation modules. It just runs one process which is an R script to create the JSON file using the TSV file as input. The JSON file includes an array of barcodes, cell type annotations, ontology IDs if present, workflow version, release date and the name of the original analysis module. These files are saved to the annotations bucket within the folder corresponding to that data release, following the same organization as the main workflow results output.I made this into a separate module that gets run at the end using inputs from the annotation modules. This is currently just Ewings, but the idea would be to combine all the annotation outputs into a single channel to use as input to this module. Alternatively, I could run this as a separate process at the end of each annotation module and not in the main workflow, but I thought that would be harder to maintain.
I think the main thing we should figure out here is the approach to passing in the column names for where the annotations are in each TSV. Currently I have it set up so there are two arguments to the script, one for the annotation column and one for the ontology column. This is because right now, we don't have a standard naming convention for what the columns should be named when they are output in the TSV files from an annotation module. For example, the Ewing columns are
ewing_annotationandewing_ontology. I think there are a few different approaches we could take here:openscpca_celltype_annotationcolumn in the TSV files and remove the arguments alltogether. With this option additional annotation columns can be present, but whatever is getting ported over must be in those columns.OpenScPCA-analysisthat uses these annotation columns.Another thing I wanted to note is that I think we should set a standard expectation of what the output tuples and TSV files should be from cell type annotation modules so that they can be used as input to format the annotations. Once we decide what that is we should update the docs for porting modules to include that information.
I did test this and successfully made JSON files that can be easily read into R as a data frame.