Skip to content

Commit 23428e0

Browse files
committed
Blog post: On provenance
1 parent 460638b commit 23428e0

File tree

6 files changed

+231
-0
lines changed

6 files changed

+231
-0
lines changed
6.01 KB
Loading
21.3 KB
Loading
42.5 KB
Loading
61.6 KB
Loading
93.1 KB
Loading
Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
---
2+
blogpost: true
3+
category: Blog
4+
tags: provenance
5+
author: Sebastiaan Huber
6+
date: 2023-11-01
7+
---
8+
9+
# On provenance
10+
11+
One of the main defining characteristics of AiiDA is its focus on provenance.
12+
It aims to provide its users with the necessary tools to preserve the provenance of data that is produced by automated workflows.
13+
As a tool, AiiDA cannot enforce nor guarantee "perfect" provenance, but AiiDA simply encourages and enables its users to keep provenance as complete as possible and as detailed as necessary.
14+
What this means exactly is not so much a technical question as it is a philosophical question, and is always going to be use-case specific.
15+
In this blog post, I will discuss ideas concerning provenance that are hopefully useful to users of AiiDA when they are designing their workflows.
16+
In the following, I assume that the reader is already familiar with the basic concept of provenance as it is implemented in AiiDA and explained [in the documentation](https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/provenance/concepts.html).
17+
18+
As mentioned in the introduction, the provenance that is kept by AiiDA is not "perfect".
19+
That is to say, given a provenance graph created by AiiDA, it is not yet possible to fully reproduce the results in an automated way.
20+
Although all the inputs to the various processes are captured, the computing environments in which those processes took place, are not fully captured.
21+
AiiDA has the concepts of `Code`s and `Computer`s, which represent the code that processed the input and the compute environment in which it took place, but these are mere *symbolic* references.
22+
The recently added support for container technologies ([added in v2.1](https://github.com/aiidateam/aiida-core/blob/main/CHANGELOG.md#support-for-running-code-in-containers)) has made a great step in the direction of making perfect provenance possible, but for the time being, we will have to be content with a limited version.
23+
24+
It is important to note though, that not having perfect provenance is not the end of the world.
25+
Having any kind of provenance is often better than having not provenance at all.
26+
We should be wary not to fall victim to a myopic provenance puritism and remind ourselves that tracking provenance is not a goal in and of itself.
27+
Rather it is a solution to a particular problem: making computational results reproducible.
28+
29+
Imagine that we have some data: for the purpose of this example, we will take a simple dictionary.
30+
This dictionary can be stored in AiiDA's provenance graph by wrapping it in a `Dict` node:
31+
```python
32+
from aiida import orm
33+
dict1 = orm.Dict({'key': 'value'}).store()
34+
dict2 = orm.Dict(dict1.get_dict())
35+
dict2['key'] = 'other_value'
36+
dict2.store()
37+
dict2.get_dict()
38+
>>> {'key': 'other_value'}
39+
```
40+
We can now generate the provenance of `dict2` using the following `verdi` command:
41+
```console
42+
verdi node graph generate <PK>
43+
```
44+
which generates the following image:
45+
46+
![image_01](../pics/2023-11-20-on-provenance/image_01.png)
47+
48+
The updated dictionary appears isolated in the provenance graph: the fact that it was created by modifying another dictionary was not captured.
49+
The simplest way to capture the modifications of data is by warpping it in a [`calcfunction`](https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/calculations/concepts.html#calculation-functions):
50+
```python
51+
from aiida import engine, orm
52+
@engine.calcfunction
53+
def update_dict(dictionary):
54+
updated = dictionary.get_dict()
55+
updated['key'] = 'other_value'
56+
return orm.Dict(updated)
57+
dict1 = orm.Dict({'key': 'value'})
58+
dict2 = update_dict(dict1)
59+
dict2.get_dict()
60+
>>> {'key': 'other_value'}
61+
```
62+
If we recreate the provenance graph for `dict2` in this example, we get something like the following:
63+
64+
![image_02](../pics/2023-11-20-on-provenance/image_02.png)
65+
66+
The modification of the original `Dict` into another `Dict` has now been captured and is represented by the `update_dict` node in the provenance graph.
67+
Given that the source code of the `update_dict` function is stored, together with the original input, the produced output `Dict` can now be reproduced.
68+
69+
Of course output nodes can, and very often will in real scenarios, be modified themselves and in turn become inputs to calculations:
70+
71+
```python
72+
from aiida import engine, orm
73+
@engine.calcfunction
74+
def add_random_key(dictionary):
75+
import secrets
76+
token = secrets.token_hex(2)
77+
updated = dictionary.get_dict()
78+
updated[token] = token
79+
return orm.Dict(updated)
80+
dict1 = orm.Dict()
81+
dict2 = add_random_key(dict1)
82+
dict3 = add_random_key(dict2)
83+
dict3.get_dict()
84+
>>> {'14b4': '14b4', '38f8': '38f8'}
85+
```
86+
87+
![image_03](../pics/2023-11-20-on-provenance/image_03.png)
88+
89+
From this follows that, as a general rule of thumb, tracking the provenance of _inputs_ is just as important as that of outputs.
90+
However, in practice, there are pragmatic justifications for making an exception to this rule.
91+
To demonstrate this, we need to consider a more complex example that more closely resembles real-world use cases.
92+
For the following example, we imagine a workflow, implemented by a `WorkChain`, that wraps a subprocess.
93+
The exact nature of the subprocess is irrelevant, so we take a very straightforward `calcfunction` that simply returns the same content of its input dictionary as a stand-in.
94+
The workflow exposes the inputs of the subprocess, but _also_ adds a particular parameter as an explicit input.
95+
This is typically done to make the workflow easier to use for users as in this way they don't have to know exactly where in the sub process' input namespace the parameter is supposed to go.
96+
97+
98+
```{note}
99+
For a concrete example, see the [`only_initialization` input](https://github.com/aiidateam/aiida-quantumespresso/blob/74bbaa22b383b3323fcc3d41ad5b82fa89895c92/src/aiida_quantumespresso/workflows/ph/base.py#L35) of the `PhBaseWorkChain` of the `aiida-quantumespresso` plugin.
100+
```
101+
102+
```python
103+
from aiida import engine, orm
104+
@engine.calcfunction
105+
def some_subprocess(parameters: orm.Dict):
106+
"""Example subprocess that simply returns a dict with the same content as the ``parameters`` input."""
107+
return orm.Dict(parameters.get_dict())
108+
class SomeWorkChain(engine.WorkChain):
109+
@classmethod
110+
def define(cls, spec):
111+
super().define(spec)
112+
spec.expose_inputs(some_subprocess, namespace='sub')
113+
spec.input('some_parameter', valid_type=orm.Str, serializer=orm.to_aiida_type)
114+
spec.outline(cls.run_subprocess)
115+
spec.output('parameters')
116+
def run_subprocess(self):
117+
inputs = self.exposed_inputs(some_subprocess, 'sub')
118+
parameters = inputs.parameters.get_dict()
119+
parameters['some_parameter'] = self.inputs.some_parameter.value
120+
inputs['parameters'] = parameters
121+
result = some_subprocess(**inputs)
122+
self.out('parameters', result)
123+
inputs = {
124+
'sub': {
125+
'parameters': {
126+
'some_parameter': 'value',
127+
}
128+
},
129+
'some_parameter': 'other_value'
130+
}
131+
results, node = engine.run.get_node(SomeWorkChain, **inputs)
132+
results['parameters'].get_dict()
133+
>>> {'some_parameter': 'other_value'}
134+
```
135+
136+
![image_04](../pics/2023-11-20-on-provenance/image_04.png)
137+
***Fig. 4**: The provenance graph generated by running `SomeWorkChain` where the modification of the input `parameters` is done in a work chain step and so is not explicitly captured.*
138+
139+
140+
```python
141+
from aiida import engine, orm
142+
143+
@engine.calcfunction
144+
def some_subprocess(parameters: orm.Dict):
145+
"""Example subprocess that returns a dict with the same content as `parameters` input."""
146+
return orm.Dict(parameters.get_dict())
147+
148+
149+
class SomeWorkChain(engine.WorkChain):
150+
151+
@classmethod
152+
def define(cls, spec):
153+
super().define(spec)
154+
spec.expose_inputs(some_subprocess, namespace='sub')
155+
spec.input('some_parameter', valid_type=orm.Str, serializer=orm.to_aiida_type)
156+
spec.outline(cls.run_subprocess)
157+
spec.output('parameters')
158+
159+
@staticmethod
160+
@engine.calcfunction
161+
def prepare_parameters(parameters, some_parameter):
162+
parameters = parameters.get_dict()
163+
parameters['some_parameter'] = some_parameter.value
164+
return orm.Dict(parameters)
165+
166+
def run_subprocess(self):
167+
inputs = self.exposed_inputs(some_subprocess, 'sub')
168+
inputs['parameters'] = self.prepare_parameters(
169+
inputs.pop('parameters'),
170+
self.inputs.some_parameter
171+
)
172+
result = some_subprocess(**inputs)
173+
self.out('parameters', result)
174+
175+
176+
inputs = {
177+
'sub': {
178+
'parameters': {
179+
'some_parameter': 'value',
180+
}
181+
},
182+
'some_parameter': 'other_value'
183+
}
184+
results, node = engine.run.get_node(SomeWorkChain, **inputs)
185+
results['parameters'].get_dict()
186+
```
187+
188+
```{note}
189+
Support for adding [process functions as class member functions](https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/processes/functions.html#as-class-member-methods) was added in AiiDA v2.3.
190+
```
191+
192+
![image_05](../pics/2023-11-20-on-provenance/image_05.png)
193+
***Fig. 5**: The provenance graph generated by running `SomeWorkChain` where the modification of the input `parameters` is explicitly captured through the `prepare_parameters` function.*
194+
195+
Now the modification of the original input parameters `8440037c` is properly captured by the `prepare_parameters` function.
196+
But this approach has a downside: the `calcfunction` that was introduced `_hard-codes` the key of the parameter that needs to be updated.
197+
What if that key needs to be different?
198+
199+
We could simply make another function that updates another key.
200+
However, this runs the risk that we will end up with a whole slew of trivial `calcfunctions` that make the code harder to read and that will have to be maintained.
201+
An alternative would be to try to make the `calcfunction` more flexible, for example, by allowing to define the key as an input to the function:
202+
```python
203+
@engine.calcfunction
204+
def prepare_parameters(parameters, key, some_parameter):
205+
parameters = parameters.get_dict()
206+
parameters[key.value] = some_parameter.value
207+
return orm.Dict(parameters)
208+
```
209+
But what if the `parameters` is a nested dictionary and the key to be replaced is in a subnamespace?
210+
Or what if the value to replace is _itself_ a dictionary and it needs to either completely replace the existing key or recursively be merged into it?
211+
212+
We quickly see that it is not trivial to come up with a straightforward solution.
213+
And this brings us back to the original question: why are we even doing this?
214+
We are trying to wrap the input parameter modification in a `calcfunction` because _in general_ it is a good thing to keep provenance.
215+
But we should remind ourselves that, as stated in the introduction, tracking provenance is a _means_ to an end, and not the goal itself.
216+
The real goal is to capture all the necessary information to make it possible to retrace how data came into existence.
217+
218+
When we take a step back and look at the provenance graph of our example `SomeWorkChain`, the real output of interest is the final `Dict(297882e1)` output node.
219+
The input parameters `Dict(717a7b67)` are also important, as that will have an influence on the output, however, exactly how that node came into existence does not really matter.
220+
Imagine that `some_subprocess` would have been called directly, without a wrapping workflow, the input `parameters` would have been defined by the user and would _also_ appear out of thin air.
221+
222+
That being said, even in the case where the `parameters` are updated without an explicit `prepare_parameters` calcfunction, its origin is still captured _indirectly_ through the provenance of the `WorkChain`.
223+
224+
The situation can be summarized as follows:
225+
226+
* It is _possible_ to perfectly capture the modification of the input `parameters`
227+
* However, depending on the exact use-case, this can require many (complex) `calcfunctions` which hurts readability and maintainability of the code
228+
* Explicitly capturing parameters modification does not provide any additional valuable information in understanding the origin of the final output of interest
229+
230+
The conclusion in this particular example then is that the non-negligible cost of explicitly tracking the parameters' modification by the workflow is not worth the menial gain of information, if there even is any.
231+

0 commit comments

Comments
 (0)