yprov4dask

A plugin that enables provenance tracking in Dask

Environment setup

Dowload the repo from https://github.com/HPCI-Lab/yprov4dask and install the plugin in your python environment by running pip install . from within the root folder of the repository.

Note for developers

Install the package in development mode using pip install -e . if you want new modifications to the code to be reflected immediately in your environment. If the -e option is missing, you will have to reinstall the plugin every time you modify it.

Usage

To use the plugin, simply import it into your code, instantiate the plugin and register it with your Dask client.

Example

from dask.distributed import Client
from prov_tracking import ProvTracker

# Can be avoided if working with Jupyter notebooks
if __name__ == '__main__':
    # Create you Dask client with any client options you want
    client = Client()
    # The plugin creation is as simple as this,
    # no parameter is required
    plugin = ProvTracker()
    # The plugin requires a reference to the scheduler,
    # but you must register it with the client before providing it
    client.register_plugin(plugin)
    plugin.start(client.scheduler)

    # Your analysis...

    # When the client is closed the provenance document is
    # automatically generated. If you define you're custom cluster,
    # the document is registered when the cluster is closed.
    client.close()

Note

The plugin can only track what comes through the Dask scheduler, so if you're computations are not translated in Dask tasks, you won't see anything in your provenance document. For example, if you open a dataset with xarray and you want to track its provenance, always make sure that it is using a DaskArray under the hood. If you use xr.open_dataset, you can ensure that by providing some value for the chunks argument. Even chunks={} is fine, even tho that may produce a really inefficient arrangment.

Additional options

Upon plugin initialization you can provide the following options:

destination: str: folder in which the provenance document is saved. The file is always named yprov4wfs.json. Defaults to ./output.
keep_traceback: bool: tells if the plugin should register the traceback of the exceptions generated by failed tasks. Defaults to False.
rich_types: bool: tells if datatypes of values such be richer, e.g. for tuples, track the type of each element instead of just saying that the value is a tuple. Defaults to False.
jupyter_tracking: bool: tells if the plugin should try to record in the provenance document the information about what cell of the notebook generated each activity. Defaults to True. Notice how this option creaed an additional thread that communicates with the Jupyter kernel.

You can also provide all kwargs accepted by prov.model.ProvDocument.serialize. For instance, indent if provided with an interger value allows the generation of more human-readable documents with lines indented according to the parameter.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.vscode		.vscode
examples		examples
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dask-prov.ttl		dask-prov.ttl
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

yprov4dask

Environment setup

Note for developers

Usage

Example

Note

Additional options

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

HPCI-Lab/yprov4dask

Folders and files

Latest commit

History

Repository files navigation

yprov4dask

Environment setup

Note for developers

Usage

Example

Note

Additional options

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages