A plugin that enables provenance tracking in Dask
Dowload the repo from https://github.com/HPCI-Lab/yprov4dask
and install the
plugin in your python environment by running pip install .
from within the
root folder of the repository.
Install the package in development mode using pip install -e .
if you want
new modifications to the code to be reflected immediately in your environment.
If the -e
option is missing, you will have to reinstall the plugin every time
you modify it.
To use the plugin, simply import it into your code, instantiate the plugin and register it with your Dask client.
from dask.distributed import Client
from prov_tracking import ProvTracker
# Can be avoided if working with Jupyter notebooks
if __name__ == '__main__':
# Create you Dask client with any client options you want
client = Client()
# The plugin creation is as simple as this,
# no parameter is required
plugin = ProvTracker()
# The plugin requires a reference to the scheduler,
# but you must register it with the client before providing it
client.register_plugin(plugin)
plugin.start(client.scheduler)
# Your analysis...
# When the client is closed the provenance document is
# automatically generated. If you define you're custom cluster,
# the document is registered when the cluster is closed.
client.close()
The plugin can only track what comes through the Dask scheduler, so if you're computations are not translated in Dask tasks, you won't see anything in your provenance document. For example, if you open a dataset with xarray
and you want to track its provenance, always make sure that it is using a DaskArray
under the hood. If you use xr.open_dataset
, you can ensure that by providing some value for the chunks
argument. Even chunks={}
is fine, even
tho that may produce a really inefficient arrangment.
Upon plugin initialization you can provide the following options:
destination: str
: folder in which the provenance document is saved. The file is always namedyprov4wfs.json
. Defaults to./output
.keep_traceback: bool
: tells if the plugin should register the traceback of the exceptions generated by failed tasks. Defaults toFalse
.rich_types: bool
: tells if datatypes of values such be richer, e.g. for tuples, track the type of each element instead of just saying that the value is a tuple. Defaults toFalse
.jupyter_tracking: bool
: tells if the plugin should try to record in the provenance document the information about what cell of the notebook generated each activity. Defaults toTrue
. Notice how this option creaed an additional thread that communicates with the Jupyter kernel.
You can also provide all kwargs accepted by prov.model.ProvDocument.serialize
. For instance, indent
if provided with an interger value allows the generation of more human-readable documents with lines indented according to the parameter.