Skip to content

Commit 88a05bb

Browse files
committed
filled out explanation sections with subheadings
1 parent e6c488f commit 88a05bb

File tree

8 files changed

+447
-281
lines changed

8 files changed

+447
-281
lines changed
Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,37 @@
11
Dynamic construction
22
====================
33

4-
Work in progress...
4+
Pydra workflows are constructed dynamically by workflow "constructor" functions. These
5+
functions can use any valid Python code, allowing rich and complex workflows to be
6+
constructed based on the inputs to the workflow. For example, a workflow constructor
7+
could include conditional branches, loops, or other control flow structures, to tailor
8+
the workflow to the specific inputs provided.
9+
10+
11+
Lazy fields
12+
-----------
13+
14+
Pydra workflows are constructed by the assignment of "lazy field" placeholders from
15+
the outputs of upstream nodes to the inputs of downstream nodes. These placeholders,
16+
which are instances of the :class:`pydra.engine.specs.LazyField` class, are replaced
17+
by the actual values they represent when the workflow is run.
18+
19+
20+
Caching of workflow construction
21+
--------------------------------
22+
23+
Workflows are constructed just before they are executed to produce a Directed Acyclic Graph
24+
(DAG) of nodes. Tasks are generated from these nodes as upstream inputs become available
25+
and added to the execution stack. If the workflow has been split, either at the top-level,
26+
in an upstream node or at the current node, then a separate task will be generated for
27+
split.
28+
29+
30+
Nested workflows and lazy conditionals
31+
--------------------------------------
32+
33+
Since lazy fields are only evaluated at runtime, they can't be used in conditional
34+
statements that construct the workflow. However, if there is a section of a workflow
35+
that needs to be conditionally included or excluded based on upstream outputs, that
36+
section can be implemented in a nested workflow and that upstream be connected to the
37+
nested workflow.
Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,33 @@
1-
Containers and environments
2-
===========================
1+
Software environments
2+
=====================
3+
4+
Pydra supports running tasks within encapsulated software environments, such as Docker_
5+
and Singularity_ containers. This can be specified at runtime or during workflow
6+
construction, and allows tasks to be run in environments that are isolated from the
7+
host system, and that have specific software dependencies.
8+
9+
The environment a task runs within is specified by the ``environment`` argument passed
10+
to the execution call (e.g. ``my_task(plugin="cf", environment="docker")``) or in the
11+
``workflow.add()`` call in workflow constructors.
12+
13+
Specifying at execution
14+
-----------------------
15+
16+
Work in progress...
17+
18+
19+
Specifying at workflow construction
20+
-----------------------------------
321

422
Work in progress...
23+
24+
25+
26+
Implementing new environment types
27+
----------------------------------
28+
29+
Work in progress...
30+
31+
32+
.. _Docker: https://www.docker.com/
33+
.. _Singularity: https://sylabs.io/singularity/
Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,61 @@
1-
Hashing and caching
2-
===================
1+
Caches and hashes
2+
=================
3+
4+
In Pydra, each task is run within its own working directory. If a task completes
5+
successfully, their outputs are stored within this working directory. Working directories
6+
are created within a cache directory, which is specified when the task is executed, and
7+
named according to the hash of the task's inputs. This means that if the same task is
8+
executed with the same inputs, the same working directory will be used, and instead of the task
9+
being rerun, the outputs from the previous run will be reused.
10+
11+
In this manner, incomplete workflows can be resumed from where they left off, and completed
12+
workflows can be rerun without having to rerun all of the tasks. This is particularly useful
13+
when working with datasets that are to be analysed in several different ways with
14+
common intermediate steps, or when debugging workflows that have failed part way through.
15+
16+
17+
Hash calculations
18+
-----------------
19+
20+
Hashes are calculated for different types of objects in different ways. For example, the
21+
hash of a string is simply the hash of the string itself, whereas the hash of a dictionary
22+
is the hash of all the file names and contents within the directory. Implementations for
23+
most common types are provided in the :mod:`pydra.utils.hash` module, but custom types
24+
can be hashed by providing a custom ``bytes_repr`` function (see
25+
:ref:`Registering custom bytes_repr functions`).
26+
27+
A cache dictionary, is passed each ``bytes_repr`` call that maps an objects id (i.e.
28+
as returned by the built-in ``id()`` function) to the hash, to avoid infinite recursions
29+
in the case of circular references.
30+
31+
The byte representation of each object is hashed using the BlakeB cryptographic algorithm,
32+
and these hashes are then combined to create a hash of the entire inputs object.
33+
34+
35+
File hash caching by mtime
36+
--------------------------
37+
38+
To avoid having to recalculate the hash of large files between runs, file hashes themselves
39+
are cached in a platform specific user directory. These hashes are stored within small
40+
files named by yet another hash of the file-system path an mtime of the file. This means that
41+
the contents of a file should only need to be hashed once unless it is modified.
42+
43+
.. note::
44+
45+
Due to limitations in mtime resolution on different platforms (e.g. 1 second on Linux,
46+
potentially 2 seconds on Windows), it is conceivable that a file could be modified,
47+
hashed, and then modified again within resolution period, causing the hash to be
48+
invalid. Therefore, cached hashes are only used once the mtime resolution period
49+
has lapsed since it was last modified, and may be recalculated in some rare cases.
50+
51+
52+
Registering custom bytes_repr functions
53+
---------------------------------------
54+
55+
Work in progress...
56+
57+
58+
Cache misses due to unstable hashes
59+
-----------------------------------
360

461
Work in progress...
Lines changed: 57 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11
Typing and file-formats
22
=======================
33

4-
Work in progress...
4+
Pydra implements strong(-ish) type-checking at workflow construction time so some errors
5+
can be caught before workflows are run on potentially expensive computing resources.
6+
Input and output fields of tasks can be typed using Python annotations.
7+
Unlike how they are typically used, in Pydra these type annotations are not just for
8+
documentation and linting purposes, but are used to enforce the types of the inputs
9+
and outputs of tasks and workflows at workflow construction and runtime.
510

6-
Pydra implements strong(-ish) type-checking at workflow construction time, which can
7-
include file types.
8-
9-
Coercion
10-
--------
11+
.. note::
1112

13+
With the exception of fields containing file-system paths, which should be typed
14+
a FileFormats_ class, types don't need to be specified if not desired.
1215

1316
File formats
1417
------------
@@ -18,18 +21,58 @@ files, by the extensible collection of file format classes. These classes can be
1821
used to specify the format of a file in a task input or output, and can be used
1922
to validate the format of a file at runtime.
2023

21-
It is important to use a FileFormats_ type, when specifying fields that represent
22-
a path to an existing file-system object (In most cases, it is sufficient to use the generic ``fileformats.generic.File``,
23-
``fileformats.generic.File``, class
24+
It is important to use a FileFormats_ type instead of a ``str`` or ``pathlib.Path``,
25+
when defining a field that take paths to file-system objects, because otherwise only
26+
the file path, not the file contents, will be used in the hash used to locate the cache
27+
(see :ref:`Caches and hashes`). However, in most cases, it is sufficient to use the
28+
generic ``fileformats.generic.File``, ``fileformats.generic.Directory``, or the even
29+
more generic ``fileformats.generic.FsObject`` or ``fileformats.generic.FileSet`` classes.
30+
31+
The only cases where it isn't sufficient to use generic classes, is when there are
32+
implicit header or side cars assumed to be present adjacent to the primary file (e.g.
33+
a NIfTI file with an associated JSON sidecar file). Because in these cases, the
34+
header/sidecar file(s) will not be included in the hash calculation and may not be included in the
35+
movement of the "file set" between working directories. In these cases, you need to use the
36+
specific file format classes, such as ``fileformats.nifti.NiftiGzX``, which will check
37+
to see if the header/sidecar files are present.
38+
39+
Coercion
40+
--------
41+
42+
Pydra will attempt to coerce the input to the correct type if it is not already, for example
43+
if a tuple is provided to a field that is typed as a list, Pydra will convert the tuple to a list
44+
before the task is run. By default the following coercions will be automatically
45+
applied between the following types:
46+
47+
* ty.Sequence → ty.Sequence
48+
* ty.Mapping → ty.Mapping
49+
* Path → os.PathLike
50+
* str → os.PathLike
51+
* os.PathLike → Path
52+
* os.PathLike → str
53+
* ty.Any → MultiInputObj
54+
* int → float
55+
* field.Integer → float
56+
* int → field.Decimal
57+
58+
In addition to this, ``fileformats.fields.Singular`` (see FileFormats_)
59+
can be coerced to and from their primitive types and Numpy ndarrays and primitive types
60+
can be coerced to and from Python sequences and built-in types, respectively.
2461

2562
Superclass auto-casting
2663
-----------------------
2764

28-
Not wanting the typing to get in the way by being unnecessarily strict,
29-
upstream fields that are typed as super classes (or as ``typing.Any`` by default)
30-
of the task input they are connected to will be automatically cast to the subclass
31-
when the task is run. This allows workflows and tasks to be easily connected together
32-
regardless of how specific typing is defined in the task definition.
65+
Pydra is designed so that strict and specific typing can be used, but is not
66+
unnecessarily strict, if it proves too burdensome. Therefore, upstream fields that are
67+
typed as super classes (or as ``typing.Any`` by default) of the task input they are
68+
connected to will be automatically cast to the subclass when the task is run.
69+
This allows workflows and tasks to be easily connected together
70+
regardless of how specific typing is defined in the task definition. This includes
71+
file format types, so a task that expects a ``fileformats.medimage.NiftiGz`` file can
72+
be connected to a task that outputs a ``fileformats.generic.File`` file.
73+
Therefore, the only cases where a typing error will be raised are when the upstream
74+
field can't be cast or coered to the downstream field, e.g. a ``fileformats.medimage.DicomSeries``
75+
cannot be cast to a ``fileformats.medimage.Nifti`` file.
3376

3477

3578
.. _FileFormats: https://arcanaframework.github.io/fileformats

new-docs/source/howto/create-task-package.ipynb

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Create a task package"
7+
"# Create a task package\n",
8+
"\n",
9+
"Work in progress..."
810
]
911
},
1012
{

new-docs/source/howto/port-from-nipype.ipynb

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Port interfaces from Nipype"
7+
"# Port interfaces from Nipype\n",
8+
"\n",
9+
"Work in progress..."
810
]
911
},
1012
{

new-docs/source/index.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ The power of Pydra lies in ease of constructing workflows, containing complex
1313
multiparameter map-reduce operations, in Python code and the use of a global cache (see
1414
:ref:`Design philosophy` for the rationale behind its design).
1515

16-
Pydra's key features are:
16+
**Key features**:
1717

18-
* Combine diverse tasks, implemented in `Python functions <./tutorial/3-python.html>`__ or stand-alone `shell commands <./tutorial/4-shell.html>`__, into coherent `workflows <./tutorial/5-workflow.html>`__
18+
* Combine diverse tasks (`Python functions <./tutorial/3-python.html>`__ or `shell commands <./tutorial/4-shell.html>`__) into coherent `workflows <./tutorial/5-workflow.html>`__
1919
* Map-reduce like semantics (see :ref:`Splitting and combining`)
2020
* Dynamic workflow construction using Python code (see :ref:`Dynamic construction`)
21-
* Modular execution systems for varied deployment on cloud, HPC, etc... (see `Execution options <./tutorial/2-advanced-execution.html>`__)
22-
* Support for the execution of tasks in containerized environments (see :ref:`Containers and environments`)
23-
* Global caching to reduce recomputation (see :ref:`Hashing and caching`)
21+
* Modular backends for deployment on different execution platforms (e.g. cloud, HPC, etc...) (see `Execution options <./tutorial/2-advanced-execution.html>`__)
22+
* Support for the execution of tasks in containerized environments (see :ref:`Software environments`)
23+
* Global caching to reduce recomputation (see :ref:`Caches and hashes`)
2424
* Support for strong type-checking, including file types, at workflow construction time (see :ref:`Typing and file-formats`)
2525

2626

0 commit comments

Comments
 (0)