-
Notifications
You must be signed in to change notification settings - Fork 7
Description
When serializing executions of workflows that take directory parameters, CWLProv does not create corresponding directories in the RO bundle: rather, files are always placed in directories whose name consists of the first two characters of the file's sha1 checksum.
When converting from CWLProv we recreate the original directories, giving them a name obtained by concatenating the sorted checksums of all contained files and computing the checksum of the concatenation. This means that directories with the same contents end up being mapped to the same directory in the output RO-Crate. This is especially convenient to avoid data duplication between workflow parameters and tool parameters: for instance, when a directory is an input of the workflow and also of the first step.
However, there are cases where we might not want to do that. For instance, suppose that a workflow takes an array of two directories as input:
cwlVersion: v1.2
class: Workflow
requirements:
ScatterFeatureRequirement: {}
inputs:
dir_array: Directory[]
outputs: []
steps:
date_step:
label: Prints date of input dirs
scatter: dir
in:
dir: dir_array
out: []
run: dirdate.cwlWhere dirdate.cwl is:
cwlVersion: v1.2
class: CommandLineTool
baseCommand: [date, "-r"]
inputs:
dir:
type: Directory
inputBinding:
position: 1
outputs: []Suppose the workflow is launched with the following parameters:
dir_array:
- class: Directory
location: foo
- class: Directory
location: barWhere foo and bar have the same contents, e.g., they both contain a text file whose content is the string "dummy". What we currently get in the RO-Crate is:
{
"@id": "packed.cwl#main/dir_array",
"@type": "FormalParameter",
"additionalType": "Dataset",
"multipleValues": "True",
"name": "dir_array"
},
...
{
"@id": "#pv-main/dir_array",
"@type": "PropertyValue",
"exampleOfWork": {
"@id": "packed.cwl#main/dir_array"
},
"name": "dir_array",
"value": [
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
},
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
}
]
},
...
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/",
"@type": "Dataset",
"alternateName": "foo",
"exampleOfWork": {
"@id": "packed.cwl#dirdate.cwl/dir"
},
"hasPart": [
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/0c8b9d6f753e8d8ec9276bfe98e993a133847642"
}
]
},Note that the duplicate id in the value of #pv-main/dir_array is a bug: the list should contain only one copy, since the duplicate makes no sense in the RO-Crate JSON-LD. Also, the Dataset has an alternateName of "foo", while "bar" does not appear in the metadata. Thus, in this case, the representation does not reflect the fact that the workflow took a list of two distinct directories as input.