Skip to content

HPCI-Lab/yprov4dask

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yprov4dask

A plugin that enables provenance tracking in Dask

Environment setup

Dowload the repo from https://github.com/HPCI-Lab/yprov4dask and install the plugin in your python environment by running pip install . from within the root folder of the repository.

Note for developers

Install the package in development mode using pip install -e . if you want new modifications to the code to be reflected immediately in your environment. If the -e option is missing, you will have to reinstall the plugin every time you modify it.

Usage

To use the plugin, simply import it into your code, instantiate the plugin and register it with your Dask client.

Example

from dask.distributed import Client
from prov_tracking import ProvTracker

# Can be avoided if working with Jupyter notebooks
if __name__ == '__main__':
    # Create you Dask client with any client options you want
    client = Client()
    # The plugin creation is as simple as this,
    # no parameter is required
    plugin = ProvTracker()
    # The plugin requires a reference to the scheduler,
    # but you must register it with the client before providing it
    client.register_plugin(plugin)
    plugin.start(client.scheduler)

    # Your analysis...

    # When the client is closed the provenance document is
    # automatically generated. If you define you're custom cluster,
    # the document is registered when the cluster is closed.
    client.close()

Note

The plugin can only track what comes through the Dask scheduler, so if you're computations are not translated in Dask tasks, you won't see anything in your provenance document. For example, if you open a dataset with xarray and you want to track its provenance, always make sure that it is using a DaskArray under the hood. If you use xr.open_dataset, you can ensure that by providing some value for the chunks argument. Even chunks={} is fine, even tho that may produce a really inefficient arrangment.

Additional options

Upon plugin initialization you can provide the following options:

  • destination: str: folder in which the provenance document is saved. The file is always named yprov4wfs.json. Defaults to ./output.
  • keep_traceback: bool: tells if the plugin should register the traceback of the exceptions generated by failed tasks. Defaults to False.
  • rich_types: bool: tells if datatypes of values such be richer, e.g. for tuples, track the type of each element instead of just saying that the value is a tuple. Defaults to False.
  • jupyter_tracking: bool: tells if the plugin should try to record in the provenance document the information about what cell of the notebook generated each activity. Defaults to True. Notice how this option creaed an additional thread that communicates with the Jupyter kernel.

You can also provide all kwargs accepted by prov.model.ProvDocument.serialize. For instance, indent if provided with an interger value allows the generation of more human-readable documents with lines indented according to the parameter.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •