-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently, the inputs for OpenScPCA-nf modules are channels that contain the following tuple:
[project_id, project_dir]
This makes for a pretty nice compact and universal input, and for modules that depend on nothing but "raw" data it is is pretty easy to work with. It is therefore currently the module workflow's responsibility to filter the inputs down to the portion of the data that is needed for each analysis.
However, that structure may not be workable in the case that one module depends on the output of another module. We have not yet described standards for what a module workflow might emit for consumption by other module workflows, but it seems unlikely that the output would be of the form [project_id, project_dir].
There are a couple reasons for this:
- For efficiency, workflows are likely to split input projects into multiple streams, and joining them back into a single output could be difficult
- For example, the
simulate-sceworkflow largely performs its simulations on a single sample at a time. While it would be possible to rejoin the results into a single directory, this would mean we would need to have a process that downloads the data for all samples within each project and then emits the folder with all of them in it. This would result in a lot of copying without much benefit.
- For example, the
- The contents of an output directory may not be as predictable as inputs, and it may be better to have more fully specified input channels which contain more information about the kind of data that is needed for each workflow.
A motivating example
I already mentioned the simulate-sce module workflow, and it was the main reason I started thinking about this problem in the first place (though as I write this I may have a simpler/better solution, discussed below). The simulate-sce module workflow does not currently emit anything, but we might want it to, in order to use the results from that workflow in downstream workflows. For example, the merge-sce workflow could be used to create simulated merged data.
However, the merge-sce workflow currently expects [project_id, project_dir] as input, and as discussed, this might be a pain to produce from the simulation workflow, which works on a sample-by-sample basis.
An easier output might be something of the form [project_id, sample_id, sample_dir], but then the consuming workflow (merge-sce in this case) might have to perform some form of merging on the input channel if it wants things organized by project. This isn't necessarily terrible to do: we could do that with something like groupTuple() function.
This assumes that we want to pass along the whole directory, which we probably don't need to do all the time. This simulate-sce function produces both SCE and AnnData objects: different workflows might want to consume one or the other of these, and sometimes both? The merge workflow really only needs the SCEs, and of those, only the processed files. So we could instead have a workflow emit multiple channels: a set with [project_id, sample_id, sample_sce] , where we might separate out processed and earlier stage files, and separately [project_id, sample_id, sample_anndata], though because AnnData are likely be split by RNA and other features, sample_anndata might actually be a array of files at times (or in the case that there is more than one library for a sample), though Nextflow generally handles this pretty reasonably.
Other outputs
One can also imagine that some workflows might not output sample-level info at all, or might have multiple files per sample. Another likely scenario is that a workflow outputs some results that need to be combined with other data for a downstream analysis. An example of this might be cell typing or doublet detection: for efficiency we would not want the output of these analyses to be another whole set of annotated SCE files, but instead a table of annotations that we could use to join with the original data in a downstream step.
For outputs like these, we would want to emit them on a sample-by-sample basis, to allow the workflow to use a join() or combine() operation (likely join(), assuming one output per sample, as these are more efficient than combine()).
Thoughts toward a standard
We want to keep the inputs and output from module workflows relatively consistent to make interactions between them as easy as possible. So we should favor relatively short, simple tuples within the channels where we can, especially given the fact that Nextflow does not use named tuples for channel elements. (What happens within a module can get more complex as needed.)
It seems likely that most analyses will have at least some component that is run on a sample-by-sample basis, so it seems likely to make sense to make that the main unit of organization. If we always have the first two elements of the cross-workflow channels be [sample_id, project_id] (in that order to facilitate join operations).
Following those two identifiers, we would have a slot for the main data, which we should probably prefer to be a file or array of files. Workflows that produce multiple kinds of outputs that are not expected to be used together often (such as SCE and AnnData output) can emit more than one channel, with different sets of files. Any emit statements should use descriptive names, such as sample_processed_sce_ch to indicate the contents of the final element(s) of the tuple.
Putting this together, we might have a workflow that looked something like the following (though a real workflow would be more complex!):
process do_annotation {
input:
tuple val(sample_id), val(project_id), path(sample_sce)
output:
tuple val(sample_id), val(project_id), path(annotation_tsv)
}
workflow annotate {
take:
sample_sce_ch // [sample_id, project_id, sample_sce]
main:
do_annotation(sample_sce_ch)
. emit:
sample_annotation_ch = do_annotation.out // [sample_id, project_id, sample_annotation.tsv]
}
A later workflow could then combine the channels above for a later analysis that used the annotations:
workflow use_annotations {
take:
sample_sce_ch // [sample_id, project_id, sample_sce]
sample_annotation_ch // [sample_id, project_id, sample_annotation]
main:
combined_ch = sample_sce_ch.join(sample_annotation_ch)
...
}
The main limitation I see with this plan is that it may not be uncommon for a workflow to also want access to the sample metadata sheet for the project or other project-level data like bulk sequencing. We could pass this in as a separate channel, or include the project_dir in each channel as well, with the expectation that most workflows would disregard that element, or simply pass it along unmodified.
Or just run the workflow twice
While I was originally thinking about this as a way to have a single-shot run of the workflow that would either include or not include the simulation step, followed by running everything else, it may be much simpler to just have two workflows for that use case: One to run the simulation and a second to run afterward to just work with either the real or simulation data buckets as the input. Having written all this, I think this is the way we should probably go, but there may still be value in starting to codify some expectations about workflow outputs for some of the more complex cases.