Skip to content

Handle distinct directories with same content #26

@simleo

Description

@simleo

When serializing executions of workflows that take directory parameters, CWLProv does not create corresponding directories in the RO bundle: rather, files are always placed in directories whose name consists of the first two characters of the file's sha1 checksum.

When converting from CWLProv we recreate the original directories, giving them a name obtained by concatenating the sorted checksums of all contained files and computing the checksum of the concatenation. This means that directories with the same contents end up being mapped to the same directory in the output RO-Crate. This is especially convenient to avoid data duplication between workflow parameters and tool parameters: for instance, when a directory is an input of the workflow and also of the first step.

However, there are cases where we might not want to do that. For instance, suppose that a workflow takes an array of two directories as input:

cwlVersion: v1.2
class: Workflow
requirements:
  ScatterFeatureRequirement: {}

inputs:
  dir_array: Directory[]
outputs: []

steps:
  date_step:
    label: Prints date of input dirs
    scatter: dir
    in:
      dir: dir_array
    out: []
    run: dirdate.cwl

Where dirdate.cwl is:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [date, "-r"]

inputs:
  dir:
    type: Directory
    inputBinding:
      position: 1
outputs: []

Suppose the workflow is launched with the following parameters:

dir_array:
  - class: Directory
    location: foo
  - class: Directory
    location: bar

Where foo and bar have the same contents, e.g., they both contain a text file whose content is the string "dummy". What we currently get in the RO-Crate is:

{
    "@id": "packed.cwl#main/dir_array",
    "@type": "FormalParameter",
    "additionalType": "Dataset",
    "multipleValues": "True",
    "name": "dir_array"
},
...
{
    "@id": "#pv-main/dir_array",
    "@type": "PropertyValue",
    "exampleOfWork": {
        "@id": "packed.cwl#main/dir_array"
    },
    "name": "dir_array",
    "value": [
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
        },
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
        }
    ]
},
...
{
    "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/",
    "@type": "Dataset",
    "alternateName": "foo",
    "exampleOfWork": {
        "@id": "packed.cwl#dirdate.cwl/dir"
    },
    "hasPart": [
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/0c8b9d6f753e8d8ec9276bfe98e993a133847642"
        }
    ]
},

Note that the duplicate id in the value of #pv-main/dir_array is a bug: the list should contain only one copy, since the duplicate makes no sense in the RO-Crate JSON-LD. Also, the Dataset has an alternateName of "foo", while "bar" does not appear in the metadata. Thus, in this case, the representation does not reflect the fact that the workflow took a list of two distinct directories as input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions