Skip to content

Commit 84dcdf0

Browse files
committed
move prov docs out of README
1 parent 728bcb8 commit 84dcdf0

File tree

2 files changed

+361
-362
lines changed

2 files changed

+361
-362
lines changed

CWLProv.rst

Lines changed: 361 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,361 @@
1+
Provenance capture
2+
------------------
3+
4+
It is possible to capture the full provenance of a workflow execution to
5+
a folder, including intermediate values:
6+
7+
cwltool --provenance revsort-run-1/ tests/wf/revsort.cwl tests/wf/revsort-job.json
8+
9+
Who executed the workflow?
10+
^^^^^^^^^^^^^^^^^^^^^^^^^^
11+
12+
Optional parameters are available to capture information about *who* executed the workflow *where*:
13+
14+
cwltool --orcid https://orcid.org/0000-0002-1825-0097 \
15+
--full-name "Alice W Land" \
16+
--enable-user-provenance --enable-host-provenance \
17+
--provenance revsort-run-1/ \
18+
tests/wf/revsort.cwl tests/wf/revsort-job.json
19+
20+
These parameters are opt-in as they track person-identifiable information.
21+
The options ``--enable-user-provenance`` and ``--enable-host-provenance`` will
22+
pick up account/machine info from where ``cwltool`` is executed (e.g.
23+
UNIX username). This may get the full name of the user wrong, in which case
24+
``--full-name`` can be supplied.
25+
26+
For consistent tracking it is recommended to apply for
27+
an `ORCID <https://orcid.org/>`__ identifier and provide it as above,
28+
since ``--enable-user-provenance --enable-host-provenance``
29+
are only able to identify the local machine account.
30+
31+
It is possible to set the shell environment variables
32+
`ORCID` and `CWL_FULL_NAME` to avoid supplying ``--orcid``
33+
or `--full-name` for every workflow run,
34+
for instance by augmenting the ``~/.bashrc`` or equivalent:
35+
36+
export ORCID=https://orcid.org/0000-0002-1825-0097
37+
export CWL_FULL_NAME="Stian Soiland-Reyes"
38+
39+
Care should be taken to preserve spaces when setting `--full-name` or `CWL_FULL_NAME`.
40+
41+
42+
CWLProv folder structure
43+
^^^^^^^^^^^^^^^^^^^^^^^^
44+
45+
The CWLProv folder structure under revsort-run-1 is a
46+
`Research Object <http://www.researchobject.org/>`__
47+
that conforms to the `RO BagIt profile <https://w3id.org/ro/bagit>`__
48+
and contains `PROV <https://www.w3.org/TR/prov-overview/>`__
49+
traces detailing the execution of the workflow and its steps.
50+
51+
52+
A rough overview of the CWLProv folder structure:
53+
54+
* ``bagit.txt`` - bag marker for `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-14>`__.
55+
* ``bag-info.txt`` - minimal bag metadata. ``The External-Identifier`` key shows which `arcp <https://tools.ietf.org/id/draft-soilandreyes-arcp-03.html>`__ can be used as base URI within the folder bag.
56+
* ``manifest-*.txt`` - checksums of files under data/ (algorithms subject to change)
57+
* ``tagmanifest-*.txt`` - checksums of the remaining files (algorithms subject to change)
58+
* ``metadata/manifest.json`` - `Research Object manifest <https://w3id.org/bundle/#manifest>`__ as JSON-LD. Types and relates files within bag.
59+
* ``metadata/provenance/primary.cwlprov*`` - `PROV <https://www.w3.org/TR/prov-overview/>`__ trace of main workflow execution in alternative PROV and RDF formats
60+
* ``data/`` - bag payload, workflow/step input/output data files (content-addressable)
61+
* ``data/32/327fc7aedf4f6b69a42a7c8b808dc5a7aff61376`` - a data item with checksum ``327fc7aedf4f6b69a42a7c8b808dc5a7aff61376`` (checksum algorithm is subject to change)
62+
* ``workflow/packed.cwl`` - The ``cwltool --pack`` standalone version of the executed workflow
63+
* ``workflow/primary-job.json`` - Job input for use with packed.cwl (references ``data/*``)
64+
* ``snapshot/`` - Direct copies of original files used for execution, but may have broken relative/absolute paths
65+
66+
67+
See the `CWLProv paper <https://doi.org/10.5281/zenodo.1208477>`__ for more details.
68+
69+
Research Object manifest
70+
^^^^^^^^^^^^^^^^^^^^^^^^
71+
72+
The file ``metadata/manifest.json`` follows the structure defined for `Research Object Bundles <https://w3id.org/bundle/#manifest>` - but
73+
note that ``.ro/`` is instead called ``metadata/`` as this conforms to the `RO BagIt profile <https://w3id.org/ro/bagit>`__.
74+
75+
Some of the keys of the CWLProv manifest are explained below::
76+
77+
"@context": [
78+
{
79+
"@base": "arcp://uuid,67f38794-d24a-435f-bd4a-0242a56a581b/metadata/"
80+
},
81+
"https://w3id.org/bundle/context"
82+
]
83+
84+
This `JSON-LD context <https://json-ld.org/>`__ enables consumers to alternatively consume the JSON file as Linked Data with absolute identifiers.
85+
The key for that is the ``@base`` which means URIs within this JSON file are relative to the ``metadata/`` folder
86+
within this Research Object bag, and the external JSON-LD .
87+
88+
Output from ``cwltool`` should follow the JSON structure shown beyond; however interested consumer may alternatively parse it as JSON-LD with a RDF triple store like `Apache Jena <https://jena.apache.org/download/>`__ for further querying.
89+
90+
The manifest lists which software version created the Research Object - we will hear more from this UUID later::
91+
92+
"createdBy": {
93+
"uri": "urn:uuid:7c9d9e88-666b-4977-85f4-c02da08a942d",
94+
"name": "cwltool 1.0.20180416145054"
95+
}
96+
97+
Secondly the manifest lists the person who "authored the run" - that is put the workflow and inputs together with cwltool::
98+
99+
"authoredBy": {
100+
"orcid": "https://orcid.org/0000-0002-1825-0097",
101+
"name": "Stian Soiland-Reyes"
102+
}
103+
104+
Note that the author of the workflow run may differ from the author of the workflow definition.
105+
106+
The list of aggregates are the main resources that this Research Object transports::
107+
108+
"aggregates": [
109+
{
110+
"uri": "urn:hash::sha1:53870991af88a6d678cbeed3255bb65993c52925",
111+
...
112+
},
113+
{ "provenance/primary.cwlprov.xml",
114+
...
115+
},
116+
{
117+
"uri": "../workflow/packed.cwl",
118+
"createdBy": {
119+
"uri": "urn:uuid:7c9d9e88-666b-4977-85f4-c02da08a942d",
120+
"name": "cwltool 1.0.20180416145054"
121+
},
122+
"conformsTo": "https://w3id.org/cwl/",
123+
"mediatype": "text/x+yaml; charset=\"UTF-8\"",
124+
"createdOn": "2018-04-16T18:27:09.513824"
125+
},
126+
{
127+
"uri": "../snapshot/hello-workflow.cwl",
128+
"conformsTo": "https://w3id.org/cwl/",
129+
"mediatype": "text/x+yaml; charset=\"UTF-8\"",
130+
"createdOn": "2018-04-04T13:29:55.717707"
131+
}
132+
133+
134+
Beyond being a listing of file names and identifiers, this also lists formats and light-weight provenance. We note that the
135+
CWL file is marked to conform to the https://w3id.org/cwl/ CWL specification.
136+
137+
Some of the files like ``packed.cwl`` have been created by cwltool as part of the run, while others have been created before the run "outside".
138+
Note that ``cwltool`` is currently unable to extract the original authors and contributors of the original files, this is planned for future versions.
139+
140+
Under ``annotations`` we see that the main point of this whole research object (``/`` aka ``arcp://uuid,67f38794-d24a-435f-bd4a-0242a56a581b/``)
141+
is to describe something called ``urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b``::
142+
143+
"annotations": [
144+
{
145+
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
146+
"content": "/",
147+
"oa:motivatedBy": {
148+
"@id": "oa:describing"
149+
}
150+
},
151+
152+
153+
We will later see that this is the UUID for the workflow run. A workflow run is an *activity*,
154+
something that happens - it can't be directly saved to a file. However it can be *described* in
155+
different ways, in this case as CWLProv provenance::
156+
157+
158+
{
159+
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
160+
"content": [
161+
"provenance/primary.cwlprov.xml",
162+
"provenance/primary.cwlprov.nt",
163+
"provenance/primary.cwlprov.ttl",
164+
"provenance/primary.cwlprov.provn",
165+
"provenance/primary.cwlprov.jsonld",
166+
"provenance/primary.cwlprov.json"
167+
],
168+
"oa:motivatedBy": {
169+
"@id": "http://www.w3.org/ns/prov#has_provenance"
170+
}
171+
172+
Finally the research object wants to highlight the workflow file::
173+
174+
{
175+
"about": "workflow/packed.cwl",
176+
"oa:motivatedBy": {
177+
"@id": "oa:highlighting"
178+
}
179+
},
180+
181+
182+
And links the run ID ``67f38794..`` to the ```primary-job.json`` and ``packed.cwl``::
183+
184+
{
185+
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
186+
"content": [
187+
"workflow/packed.cwl",
188+
"workflow/primary-job.json"
189+
],
190+
"oa:motivatedBy": {
191+
"@id": "oa:linking"
192+
}
193+
}
194+
195+
Note: ``oa:motivatedBy`` in CWLProv are subject to change.
196+
197+
198+
PROV profile
199+
^^^^^^^^^^^^
200+
201+
The underlying model and information of the `PROV <https://www.w3.org/TR/prov-overview/>`__
202+
files under ``metadata/provenance`` is the same, but is made available in multiple
203+
serialization formats:
204+
205+
* primary.cwlprov.provn -- `PROV-N <https://www.w3.org/TR/prov-n/>`__ Textual Provenance Notation
206+
* primary.cwlprov.xml -- `PROV-XML <https://www.w3.org/TR/prov-xml/>`__
207+
* primary.cwlprov.json -- `PROV-JSON <https://www.w3.org/Submission/prov-json/>`__
208+
* primary.cwlprov.jsonld -- `PROV-O <https://www.w3.org/TR/prov-o/>`__ as `JSON-LD <https://json-ld.org/>`__ (``@context`` subject to change)
209+
* primary.cwlprov.ttl -- `PROV-O <https://www.w3.org/TR/prov-o/>`__ as `RDF Turtle <https://www.w3.org/TR/turtle/>`__
210+
* primary.cwlprov.nt -- `PROV-O <https://www.w3.org/TR/prov-o/>`__ as `RDF N-Triples <https://www.w3.org/TR/n-triples/>`__
211+
212+
The below extracts use the PROV-N syntax for brevity.
213+
214+
CWLPROV namespaces
215+
^^^^^^^^^^^^^^^^^^
216+
217+
Note that the identifiers must be expanded with the defined ``prefix``-es when comparing across serializations.
218+
These set which vocabularies ("namespaces") are used by the CWLProv statements::
219+
220+
prefix data <urn:hash::sha1:>
221+
prefix input <arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/workflow/primary-job.json#>
222+
prefix cwlprov <https://w3id.org/cwl/prov#>
223+
prefix wfprov <http://purl.org/wf4ever/wfprov#>
224+
prefix sha256 <nih:sha-256;>
225+
prefix schema <http://schema.org/>
226+
prefix wfdesc <http://purl.org/wf4ever/wfdesc#>
227+
prefix orcid <https://orcid.org/>
228+
prefix researchobject <arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/>
229+
prefix id <urn:uuid:>
230+
prefix wf <arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/workflow/packed.cwl#>
231+
prefix foaf <http://xmlns.com/foaf/0.1/>
232+
233+
Note that the `arcp <https://tools.ietf.org/id/draft-soilandreyes-arcp-03.html>`__ base URI will correspond to the UUID of each master workflow run.
234+
235+
Account who launched cwltool
236+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
237+
238+
If `--enable-user-provenance` was used, the local machine acccount (e.g. Windows or UNIX user name) who started ``cwltool`` is tracked::
239+
240+
agent(id:855c6823-bbe7-48a5-be37-b0f07f20c495, [foaf:accountName="stain", prov:type='foaf:OnlineAccount', prov:label="stain"])
241+
242+
It is assumed that the account was under the control of the named person (in PROV terms "actedOnBehalfOf")::
243+
244+
agent(id:433df002-2584-462a-80b0-cf90b97e6e07, [prov:label="Stian Soiland-Reyes",
245+
prov:type='prov:Person', foaf:account='id:8815e39c-9711-4105-bf52-dbc016c8028f'])
246+
actedOnBehalfOf(id:8815e39c-9711-4105-bf52-dbc016c8028f, id:433df002-2584-462a-80b0-cf90b97e6e07, -)
247+
248+
However we do not have an identifier for neither the account or the person, so every ``cwltool`` run will yield new UUIDs.
249+
250+
With --enable-user-provenance it is possible to associate the account with a hostname::
251+
252+
agent(id:855c6823-bbe7-48a5-be37-b0f07f20c495, [cwlprov:hostname="biggie", prov:type='foaf:OnlineAccount', prov:location="biggie"])
253+
254+
Note that the hostname is often non-global or variable (e.g. on cloud instances or virtual machines),
255+
and thus may be unreliable when considering ``cwltool`` executions on multiple hosts.
256+
257+
If the ``--orcid`` parameter or ``ORCID`` shell variable is included, then the person associated
258+
with the local machine account is uniquely identified, no matter where the workflow was executed::
259+
260+
agent(orcid:0000-0002-1825-0097, [prov:type='prov:Person', prov:label="Stian Soiland-Reyes",
261+
foaf:account='id:855c6823-bbe7-48a5-be37-b0f07f20c495'])
262+
263+
actedOnBehalfOf(id:855c6823-bbe7-48a5-be37-b0f07f20c495', orcid:0000-0002-1825-0097, -)
264+
265+
The running of `cwltool` itself makes it the workflow engine. It is the machine account who launched the cwltool (not necessarily the person behind it)::
266+
267+
agent(id:7c9d9e88-666b-4977-85f4-c02da08a942d, [prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool 1.0.20180416145054"])
268+
wasStartedBy(id:855c6823-bbe7-48a5-be37-b0f07f20c495, -, id:9c3d4d1f-473d-468f-a6f2-1ef4de571a7f, 2018-04-16T18:27:09.428090)
269+
270+
Starting a workflow
271+
^^^^^^^^^^^^^^^^^^^
272+
273+
The main job of the cwltool execution is to run a workflow, here the activity for ``workflow/packed.cwl#main``::
274+
275+
activity(id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.428165, -, [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
276+
wasStartedBy(id:67f38794-d24a-435f-bd4a-0242a56a581b, -, id:7c9d9e88-666b-4977-85f4-c02da08a942d, 2018-04-16T18:27:09.428285)
277+
278+
Now what is that workflow again? Well a tiny bit of prospective provenance is included::
279+
280+
entity(wf:main, [prov:type='prov:Plan', prov:type='wfdesc:Workflow', prov:label="Prospective provenance"])
281+
entity(wf:main, [prov:label="Prospective provenance", wfdesc:hasSubProcess='wf:main/step0'])
282+
entity(wf:main/step0, [prov:type='wfdesc:Process', prov:type='prov:Plan'])
283+
284+
But we can also expand the `wf` identifiers to find that we are talking about
285+
``arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/workflow/packed.cwl#`` - that is
286+
the ``main`` workflow in the file `workflow/packed.cwl` of the Research Object.
287+
288+
Running workflow steps
289+
^^^^^^^^^^^^^^^^^^^^^^
290+
291+
A workflow will contain some steps, each execution of these are again nested activities::
292+
293+
activity(id:6c7c04ea-dcc8-40d2-92a4-7705f7286756, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main"])
294+
wasStartedBy(id:6c7c04ea-dcc8-40d2-92a4-7705f7286756, -, id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.430883)
295+
activity(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step0"])
296+
wasAssociatedWith(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, wf:main/step0)
297+
298+
Again we see the link back to the workflow plan, the workflow execution of ``#main/step0`` in this case.
299+
Note that depending on scattering etc there might
300+
be multiple activities for a single step in the workflow definition.
301+
302+
Data inputs (usage)
303+
^^^^^^^^^^^^^^^^^^^
304+
305+
This activities uses some data at the input ``message``::
306+
307+
activity(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step0"])
308+
used(id:a583b025-9a16-49ce-8515-f3249eb2aacf, data:53870991af88a6d678cbeed3255bb65993c52925, 2018-04-16T18:27:09.433743, [prov:role='wf:main/step0/message'])
309+
310+
Data files within a workflow execution are identified using ``urn:hash::sha1:`` URIs derived from their sha1 checksum (checksum algorithm and prefix subject to change)::
311+
312+
entity(data:53870991af88a6d678cbeed3255bb65993c52925, [prov:type='wfprov:Artifact', prov:value="Hei7"])
313+
314+
Small values (typically those provided on the command line may be present as `prov:value`. The corresponding
315+
``data/`` file within the Research Object has a content-addressable filename based on the checksum; but it is also
316+
possible to look up this independent from the corresponding ``metadata/manifest.json`` aggregation::
317+
318+
"aggregates": [
319+
{
320+
"uri": "urn:hash::sha1:53870991af88a6d678cbeed3255bb65993c52925",
321+
"bundledAs": {
322+
"uri": "arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/data/53/53870991af88a6d678cbeed3255bb65993c52925",
323+
"folder": "/data/53/",
324+
"filename": "53870991af88a6d678cbeed3255bb65993c52925"
325+
}
326+
},
327+
328+
Data outputs (generation)
329+
^^^^^^^^^^^^^^^^^^^^^^^^^
330+
331+
Similarly a step typically generates some data, here ``response``::
332+
333+
activity(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step0"])
334+
wasGeneratedBy(data:53870991af88a6d678cbeed3255bb65993c52925, id:a583b025-9a16-49ce-8515-f3249eb2aacf, 2018-04-16T18:27:09.438236, [prov:role='wf:main/step0/response'])
335+
336+
In the hello world example this is interesting because it is the same data output as-is, but typically the outputs will each have different checksums (and thus different identifiers).
337+
338+
The step is ended::
339+
340+
wasEndedBy(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.438482)
341+
342+
343+
In this case the step output is also a workflow output ``response``, so the data is also generated by the workflow activity::
344+
345+
activity(id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.428165, -, [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
346+
wasGeneratedBy(data:53870991af88a6d678cbeed3255bb65993c52925, id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.439323, [prov:role='wf:main/response'])
347+
348+
Ending the workflow
349+
^^^^^^^^^^^^^^^^^^^
350+
351+
Finally the overall workflow ``#main`` also ends::
352+
353+
activity(id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.428165, -, [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
354+
agent(id:7c9d9e88-666b-4977-85f4-c02da08a942d, [prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool 1.0.20180416145054"])
355+
wasEndedBy(id:67f38794-d24a-435f-bd4a-0242a56a581b, -, id:7c9d9e88-666b-4977-85f4-c02da08a942d, 2018-04-16T18:27:09.445785)
356+
357+
Note that the end of the outer ``cwltool`` activity is not recorded, as cwltool is still running at the point of writing out this provenance.
358+
359+
Currently the provenance trace do not distinguish executions within nested workflows; it is planned that these will be tracked in separate files under ``metadata/provenance/``.
360+
361+

0 commit comments

Comments
 (0)