Skip to content

Commit 389ee99

Browse files
authored
Merge pull request #325 from djarecka/doc/userguide
[Doc] Adding a first version of a user guide
2 parents e2f0849 + 833a754 commit 389ee99

File tree

13 files changed

+595
-1
lines changed

13 files changed

+595
-1
lines changed

docs/changes.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,26 @@
11
Release Notes
22
=============
33

4+
0.8.0
5+
-----
6+
7+
* refactoring template formatting for ``input_spec``
8+
* fixing issues with input fields with extension (and using them in templates)
9+
* adding simple validators to input spec (using ``attr.validator``)
10+
* adding ``create_dotfile`` for workflows, that creates graphs as dotfiles (can convert to other formats if dot available)
11+
* adding a simple user guide with ``input_spec`` description
12+
* expanding docstrings for ``State``, ``audit`` and ``messanger``
13+
* updating syntax to newer python
14+
15+
0.7.0
16+
-----
17+
18+
* refactoring the error handling by padra: improving raised errors, removing nodes from the workflow graph that can't be run
19+
* refactoring of the ``input_spec``: adapting better to the nipype interfaces
20+
* switching from ``pkg_resources.declare_namespace`` to the stdlib ``pkgutil.extend_path``
21+
* moving ``readme`` to rst format
22+
23+
424
0.6.2
525
-----
626

docs/combiner.rst

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
Grouping Task's Output
2+
=======================
3+
4+
In addition to the splitting the input, *Pydra* supports grouping
5+
or combining the output resulting from the splits.
6+
In order to achieve this for a *Task*, a user can specify a *combiner*.
7+
This can be set by calling ``combine`` method.
8+
Note, the *combiner* only makes sense when a *splitter* is
9+
set first. When *combiner=x*, all values are combined together within one list,
10+
and each element of the list represents an output of the *Task* for the specific
11+
value of the input *x*. Splitting and combining for this example can be written
12+
as follows:
13+
14+
.. math::
15+
16+
S = x &:& ~x=[x_1, x_2, ..., x_n] \mapsto x=x_1, x=x_2, ..., x=x_n, \\
17+
C = x &:& ~out(x_1), ...,out(x_n) \mapsto out_{comb}=[out(x_1), ...out(x_n)],
18+
19+
where `S` represents the *splitter*, *C* represents the *combiner*, :math:`x` is the input field,
20+
:math:`out(x_i)` represents the output of the *Task* for :math:`x_i`, and :math:`out_{comb}`
21+
is the final output after applying the *combiner*.
22+
23+
In the situation where input has multiple fields and an *outer splitter* is used,
24+
there are various ways of combining the output.
25+
Taking as an example the task from the previous section,
26+
user might want to combine all the outputs for one specific value of :math:`x_i` and
27+
all the values of :math:`y`.
28+
In this situation, the combined output would be a two dimensional list, each
29+
inner list for each value of :math:`x`. This can be written as follow:
30+
31+
.. math::
32+
33+
C = y &:& ~out(x_1, y1), out(x_1, y2), ...out(x_n, y_m) \\
34+
&\longmapsto& ~[[out(x_1, y_1), ..., out(x_1, y_m)], \\
35+
&& ~..., \\
36+
&& ~[out(x_n, y_1), ..., out(x_n, y_m)]].
37+
38+
39+
40+
41+
.. figure:: images/nd_spl_3_comb1.png
42+
:figclass: h!
43+
:scale: 75%
44+
45+
46+
47+
However, for the same task the user might want to combine
48+
all values of :math:`x` for specific values of :math:`y`.
49+
One may also need to combine all the values together.
50+
This can be achieved by providing a list of fields, :math:`[x, y]` to the combiner.
51+
When a full combiner is set, i.e. all the fields from
52+
the splitter are also in the combiner, the output is a one dimensional list:
53+
54+
.. math::
55+
56+
C = [x, y] : out(x_1, y1), ...out(x_n, y_m) \longmapsto [out(x_1, y_1), ..., out(x_n, y_m)].
57+
58+
59+
.. figure:: images/nd_spl_3_comb3.png
60+
:figclass: h!
61+
:scale: 75%
62+
63+
These are the basic examples of the *Pydra*'s *splitter-combiner* concept. It
64+
is important to note, that *Pydra* allows for mixing *splitters* and *combiners*
65+
on various levels of a dataflow. They can be set on a single *Task* or a *Workflow*.
66+
They can be passed from one *Task* to following *Tasks* within the *Workflow*.

docs/components.rst

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
Dataflows Components: Task and Workflow
2+
=======================================
3+
A *Task* is the basic runnable component of *Pydra* and is described by the
4+
class ``TaskBase``. A *Task* has named inputs and outputs, thus allowing
5+
construction of dataflows. It can be hashed and executes in a specific working
6+
directory. Any *Pydra*'s *Task* can be used as a function in a script, thus allowing
7+
dual use in *Pydra*'s *Workflows* and in standalone scripts. There are several
8+
classes that inherit from ``TaskBase`` and each has a different application:
9+
10+
11+
Function Tasks
12+
--------------
13+
14+
* ``FunctionTask`` is a *Task* that executes Python functions. Most Python functions
15+
declared in an existing library, package, or interactively in a terminal can
16+
be converted to a ``FunctionTask`` by using *Pydra*'s decorator - ``mark.task``.
17+
18+
.. code-block:: python
19+
20+
import numpy as np
21+
from pydra import mark
22+
fft = mark.annotate({'a': np.ndarray,
23+
'return': float})(np.fft.fft)
24+
fft_task = mark.task(fft)()
25+
result = fft_task(a=np.random.rand(512))
26+
27+
28+
`fft_task` is now a *Pydra* *Task* and result will contain a *Pydra*'s ``Result`` object.
29+
In addition, the user can use Python's function annotation or another *Pydra*
30+
decorator --- ``mark.annotate`` in order to specify the output. In the
31+
following example, we decorate an arbitrary Python function to create named
32+
outputs:
33+
34+
.. code-block:: python
35+
36+
@mark.task
37+
@mark.annotate(
38+
{"return": {"mean": float, "std": float}}
39+
)
40+
def mean_dev(my_data):
41+
import statistics as st
42+
return st.mean(my_data), st.stdev(my_data)
43+
44+
result = mean_dev(my_data=[...])()
45+
46+
When the *Task* is executed `result.output` will contain two attributes: `mean`
47+
and `std`. Named attributes facilitate passing different outputs to
48+
different downstream nodes in a dataflow.
49+
50+
51+
.. _shell_command_task:
52+
53+
Shell Command Tasks
54+
-------------------
55+
56+
* ``ShellCommandTask`` is a *Task* used to run shell commands and executables.
57+
It can be used with a simple command without any arguments, or with specific
58+
set of arguments and flags, e.g.:
59+
60+
.. code-block:: python
61+
62+
ShellCommandTask(executable="pwd")
63+
64+
ShellCommandTask(executable="ls", args="my_dir")
65+
66+
The *Task* can accommodate more complex shell commands by allowing the user to
67+
customize inputs and outputs of the commands.
68+
One can generate an input
69+
specification to specify names of inputs, positions in the command, types of
70+
the inputs, and other metadata.
71+
As a specific example, FSL's BET command (Brain
72+
Extraction Tool) can be called on the command line as:
73+
74+
.. code-block:: python
75+
76+
bet input_file output_file -m
77+
78+
Each of the command argument can be treated as a named input to the
79+
``ShellCommandTask``, and can be included in the input specification.
80+
As shown next, even an output is specified by constructing
81+
the *out_file* field form a template:
82+
83+
.. code-block:: python
84+
85+
bet_input_spec = SpecInfo(
86+
name="Input",
87+
fields=[
88+
( "in_file", File,
89+
{ "help_string": "input file ...",
90+
"position": 1,
91+
"mandatory": True } ),
92+
( "out_file", str,
93+
{ "help_string": "name of output ...",
94+
"position": 2,
95+
"output_file_template":
96+
"{in_file}_br" } ),
97+
( "mask", bool,
98+
{ "help_string": "create binary mask",
99+
"argstr": "-m", } ) ],
100+
bases=(ShellSpec,) )
101+
102+
ShellCommandTask(executable="bet",
103+
input_spec=bet_input_spec)
104+
105+
More details are in the :ref:`Input Specification section`.
106+
107+
Container Tasks
108+
---------------
109+
* ``ContainerTask`` class is a child class of ``ShellCommandTask`` and serves as
110+
a parent class for ``DockerTask`` and ``SingularityTask``. Both *Container Tasks*
111+
run shell commands or executables within containers with specific user defined
112+
environments using Docker_ and Singularity_ software respectively.
113+
This might be extremely useful for users and projects that require environment
114+
encapsulation and sharing.
115+
Using container technologies helps improve scientific
116+
workflows reproducibility, one of the key concept behind *Pydra*.
117+
118+
These *Container Tasks* can be defined by using
119+
``DockerTask`` and ``SingularityTask`` classes directly, or can be created
120+
automatically from ``ShellCommandTask``, when an optional argument
121+
``container_info`` is used when creating a *Shell Task*. The following two
122+
types of syntax are equivalent:
123+
124+
.. code-block:: python
125+
126+
DockerTask(executable="pwd", image="busybox")
127+
128+
ShellCommandTask(executable="ls",
129+
container_info=("docker", "busybox"))
130+
131+
Workflows
132+
---------
133+
* ``Workflow`` - is a subclass of *Task* that provides support for creating *Pydra*
134+
dataflows. As a subclass, a *Workflow* acts like a *Task* and has inputs, outputs,
135+
is hashable, and is treated as a single unit. Unlike *Tasks*, workflows embed
136+
a directed acyclic graph. Each node of the graph contains a *Task* of any type,
137+
including another *Workflow*, and can be added to the *Workflow* simply by calling
138+
the ``add`` method. The connections between *Tasks* are defined by using so
139+
called *Lazy Inputs* or *Lazy Outputs*. These are special attributes that allow
140+
assignment of values when a *Workflow* is executed rather than at the point of
141+
assignment. The following example creates a *Workflow* from two *Pydra* *Tasks*.
142+
143+
.. code-block:: python
144+
145+
# creating workflow with two input fields
146+
wf = Workflow(input_spec=["x", "y"])
147+
# adding a task and connecting task's input
148+
# to the workflow input
149+
wf.add(mult(name="mlt",
150+
x=wf.lzin.x, y=wf.lzin.y))
151+
# adding anoter task and connecting
152+
# task's input to the "mult" task's output
153+
wf.add(add2(name="add", x=wf.mlt.lzout.out))
154+
# setting worflow output
155+
wf.set_output([("out", wf.add.lzout.out)])
156+
157+
158+
Task's State
159+
------------
160+
All Tasks, including Workflows, can have an optional attribute representing an instance of the State class.
161+
This attribute controls the execution of a Task over different input parameter sets.
162+
This class is at the heart of Pydra's powerful Map-Reduce over arbitrary inputs of nested dataflows feature.
163+
The State class formalizes how users can specify arbitrary combinations.
164+
Its functionality is used to create and track different combinations of input parameters,
165+
and optionally allow limited or complete recombinations.
166+
In order to specify how the inputs should be split into parameter sets, and optionally combined after
167+
the Task execution, the user can set splitter and combiner attributes of the State class.
168+
169+
.. code-block:: python
170+
171+
task_with_state =
172+
add2(x=[1, 5]).split("x").combine("x")
173+
174+
In this example, the ``State`` class is responsible for creating a list of two
175+
separate inputs, *[{x: 1}, {x:5}]*, each run of the *Task* should get one
176+
element from the list.
177+
The results are grouped back when returning the result from the *Task*.
178+
While this example
179+
illustrates mapping and grouping of results over a single parameter, *Pydra*
180+
extends this to arbitrary combinations of input fields and downstream grouping
181+
over nested dataflows. Details of how splitters and combiners power *Pydra*'s
182+
scalable dataflows are described in the next section.
183+
184+
185+
186+
.. _Docker: https://www.docker.com/
187+
.. _Singularity: https://www.singularity.lbl.gov/

docs/images/nd_spl_1.png

30.2 KB
Loading

docs/images/nd_spl_3.png

25.9 KB
Loading

docs/images/nd_spl_3_comb1.png

26.7 KB
Loading

docs/images/nd_spl_3_comb3.png

27.5 KB
Loading

docs/images/nd_spl_4.png

16.5 KB
Loading

docs/index.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,78 @@
66
Welcome to Pydra: A simple dataflow engine with scalable semantics's documentation!
77
===================================================================================
88

9+
Pydra is a new lightweight dataflow engine written in Python.
10+
Pydra is developed as an open-source project in the neuroimaging community,
11+
but it is designed as a general-purpose dataflow engine to support any scientific domain.
12+
13+
Scientific workflows often require sophisticated analyses that encompass a large collection
14+
of algorithms.
15+
The algorithms, that were originally not necessarily designed to work together,
16+
and were written by different authors.
17+
Some may be written in Python, while others might require calling external programs.
18+
It is a common practice to create semi-manual workflows that require the scientists
19+
to handle the files and interact with partial results from algorithms and external tools.
20+
This approach is conceptually simple and easy to implement, but the resulting workflow
21+
is often time consuming, error-prone and difficult to share with others.
22+
Consistency, reproducibility and scalability demand scientific workflows
23+
to be organized into fully automated pipelines.
24+
This was the motivation behind Pydra - a new dataflow engine written in Python.
25+
26+
The Pydra package is a part of the second generation of the Nipype_ ecosystem
27+
--- an open-source framework that provides a uniform interface to existing neuroimaging
28+
software and facilitates interaction between different software components.
29+
The Nipype project was born in the neuroimaging community, and has been helping scientists
30+
build workflows for a decade, providing a uniform interface to such neuroimaging packages
31+
as FSL_, ANTs_, AFNI_, FreeSurfer_ and SPM_.
32+
This flexibility has made it an ideal basis for popular preprocessing tools,
33+
such as fMRIPrep_ and C-PAC_.
34+
The second generation of Nipype ecosystem is meant to provide additional flexibility
35+
and is being developed with reproducibility, ease of use, and scalability in mind.
36+
Pydra itself is a standalone project and is designed as a general-purpose dataflow engine
37+
to support any scientific domain.
38+
39+
The goal of Pydra is to provide a lightweight dataflow engine for computational graph construction,
40+
manipulation, and distributed execution, as well as ensuring reproducibility of scientific pipelines.
41+
In Pydra, a dataflow is represented as a directed acyclic graph, where each node represents a Python
42+
function, execution of an external tool, or another reusable dataflow.
43+
The combination of several key features makes Pydra a customizable and powerful dataflow engine:
44+
45+
- Composable dataflows: Any node of a dataflow graph can be another dataflow, allowing for nested
46+
dataflows of arbitrary depths and encouraging creating reusable dataflows.
47+
48+
- Flexible semantics for creating nested loops over input sets: Any Task or dataflow can be run
49+
over input parameter sets and the outputs can be recombined (similar concept to Map-Reduce_ model,
50+
but Pydra extends this to graphs with nested dataflows).
51+
52+
- A content-addressable global cache: Hash values are computed for each graph and each Task.
53+
This supports reusing of previously computed and stored dataflows and Tasks.
54+
55+
- Support for Python functions and external (shell) commands: Pydra can decorate and use existing
56+
functions in Python libraries alongside external command line tools, allowing easy integration
57+
of existing code and software.
58+
59+
- Native container execution support: Any dataflow or Task can be executed in an associated container
60+
(via Docker or Singularity) enabling greater consistency for reproducibility.
61+
62+
- Auditing and provenance tracking: Pydra provides a simple JSON-LD-based message passing mechanism
63+
to capture the dataflow execution activties as a provenance graph. These messages track inputs
64+
and outputs of each task in a dataflow, and the resources consumed by the task.
65+
66+
.. _Nipype: https://nipype.readthedocs.io/en/latest/
67+
.. _FSL: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FSL
68+
.. _ANTs: http://stnava.github.io/ANTs/
69+
.. _AFNI: https://afni.nimh.nih.gov/
70+
.. _FreeSurfer: https://surfer.nmr.mgh.harvard.edu/
71+
.. _SPM: https://www.fil.ion.ucl.ac.uk/spm/
72+
.. _fMRIPrep: https://fmriprep.org/en/stable/
73+
.. _C-PAC: https://fcp-indi.github.io/docs/latest/index
74+
.. _Map-Reduce: https://en.wikipedia.org/wiki/MapReduce
75+
976
.. toctree::
1077
:maxdepth: 2
1178
:caption: Contents:
1279

80+
user_guide
1381
changes
1482
api
1583

0 commit comments

Comments
 (0)