filled out explanation sections with subheadings

tclose · tclose · commit 88a05bb82685 · 2025-01-07T19:32:24.000+11:00
diff --git a/new-docs/source/explanation/conditional-lazy.rst b/new-docs/source/explanation/conditional-lazy.rst
@@ -1,4 +1,37 @@
 Dynamic construction
 ====================
 
-Work in progress...
+Pydra workflows are constructed dynamically by workflow "constructor" functions. These
+functions can use any valid Python code, allowing rich and complex workflows to be
+constructed based on the inputs to the workflow. For example, a workflow constructor
+could include conditional branches, loops, or other control flow structures, to tailor
+the workflow to the specific inputs provided.
+
+
+Lazy fields
+-----------
+
+Pydra workflows are constructed by the assignment of "lazy field" placeholders from
+the outputs of upstream nodes to the inputs of downstream nodes. These placeholders,
+which are instances of the :class:`pydra.engine.specs.LazyField` class, are replaced
+by the actual values they represent when the workflow is run.
+
+
+Caching of workflow construction
+--------------------------------
+
+Workflows are constructed just before they are executed to produce a Directed Acyclic Graph
+(DAG) of nodes. Tasks are generated from these nodes as upstream inputs become available
+and added to the execution stack. If the workflow has been split, either at the top-level,
+in an upstream node or at the current node, then a separate task will be generated for
+split.
+
+
+Nested workflows and lazy conditionals
+--------------------------------------
+
+Since lazy fields are only evaluated at runtime, they can't be used in conditional
+statements that construct the workflow. However, if there is a section of a workflow
+that needs to be conditionally included or excluded based on upstream outputs, that
+section can be implemented in a nested workflow and that upstream be connected to the
+nested workflow.
diff --git a/new-docs/source/explanation/environments.rst b/new-docs/source/explanation/environments.rst
@@ -1,4 +1,33 @@
-Containers and environments
-===========================
+Software environments
+=====================
+
+Pydra supports running tasks within encapsulated software environments, such as Docker_
+and Singularity_ containers. This can be specified at runtime or during workflow
+construction, and allows tasks to be run in environments that are isolated from the
+host system, and that have specific software dependencies.
+
+The environment a task runs within is specified by the ``environment`` argument passed
+to the execution call (e.g. ``my_task(plugin="cf", environment="docker")``) or in the
+``workflow.add()`` call in workflow constructors.
+
+Specifying at execution
+-----------------------
+
+Work in progress...
+
+
+Specifying at workflow construction
+-----------------------------------
 
 Work in progress...
+
+
+
+Implementing new environment types
+----------------------------------
+
+Work in progress...
+
+
+.. _Docker: https://www.docker.com/
+.. _Singularity: https://sylabs.io/singularity/
diff --git a/new-docs/source/explanation/hashing-caching.rst b/new-docs/source/explanation/hashing-caching.rst
@@ -1,4 +1,61 @@
-Hashing and caching
-===================
+Caches and hashes
+=================
+
+In Pydra, each task is run within its own working directory. If a task completes
+successfully, their outputs are stored within this working directory. Working directories
+are created within a cache directory, which is specified when the task is executed, and
+named according to the hash of the task's inputs. This means that if the same task is
+executed with the same inputs, the same working directory will be used, and instead of the task
+being rerun, the outputs from the previous run will be reused.
+
+In this manner, incomplete workflows can be resumed from where they left off, and completed
+workflows can be rerun without having to rerun all of the tasks. This is particularly useful
+when working with datasets that are to be analysed in several different ways with
+common intermediate steps, or when debugging workflows that have failed part way through.
+
+
+Hash calculations
+-----------------
+
+Hashes are calculated for different types of objects in different ways. For example, the
+hash of a string is simply the hash of the string itself, whereas the hash of a dictionary
+is the hash of all the file names and contents within the directory. Implementations for
+most common types are provided in the :mod:`pydra.utils.hash` module, but custom types
+can be hashed by providing a custom ``bytes_repr`` function (see
+:ref:`Registering custom bytes_repr functions`).
+
+A cache dictionary, is passed each ``bytes_repr`` call that maps an objects id (i.e.
+as returned by the built-in ``id()`` function) to the hash, to avoid infinite recursions
+in the case of circular references.
+
+The byte representation of each object is hashed using the BlakeB cryptographic algorithm,
+and these hashes are then combined to create a hash of the entire inputs object.
+
+
+File hash caching by mtime
+--------------------------
+
+To avoid having to recalculate the hash of large files between runs, file hashes themselves
+are cached in a platform specific user directory. These hashes are stored within small
+files named by yet another hash of the file-system path an mtime of the file. This means that
+the contents of a file should only need to be hashed once unless it is modified.
+
+.. note::
+
+    Due to limitations in mtime resolution on different platforms (e.g. 1 second on Linux,
+    potentially 2 seconds on Windows), it is conceivable that a file could be modified,
+    hashed, and then modified again within resolution period, causing the hash to be
+    invalid. Therefore, cached hashes are only used once the mtime resolution period
+    has lapsed since it was last modified, and may be recalculated in some rare cases.
+
+
+Registering custom bytes_repr functions
+---------------------------------------
+
+Work in progress...
+
+
+Cache misses due to unstable hashes
+-----------------------------------
 
 Work in progress...
diff --git a/new-docs/source/explanation/typing.rst b/new-docs/source/explanation/typing.rst
@@ -1,14 +1,17 @@
 Typing and file-formats
 =======================
 
-Work in progress...
+Pydra implements strong(-ish) type-checking at workflow construction time so some errors
+can be caught before workflows are run on potentially expensive computing resources.
+Input and output fields of tasks can be typed using Python annotations.
+Unlike how they are typically used, in Pydra these type annotations are not just for
+documentation and linting purposes, but are used to enforce the types of the inputs
+and outputs of tasks and workflows at workflow construction and runtime.
 
-Pydra implements strong(-ish) type-checking at workflow construction time, which can
-include file types.
-
-Coercion
---------
+.. note::
 
+    With the exception of fields containing file-system paths, which should be typed
+    a FileFormats_ class, types don't need to be specified if not desired.
 
 File formats
 ------------
@@ -18,18 +21,58 @@ files, by the extensible collection of file format classes. These classes can be
 used to specify the format of a file in a task input or output, and can be used
 to validate the format of a file at runtime.
 
-It is important to use a FileFormats_ type, when specifying fields that represent
-a path to an existing file-system object (In most cases, it is sufficient to use the generic ``fileformats.generic.File``,
-``fileformats.generic.File``, class
+It is important to use a FileFormats_ type instead of a ``str`` or ``pathlib.Path``,
+when defining a field that take paths to file-system objects, because otherwise only
+the file path, not the file contents, will be used in the hash used to locate the cache
+(see :ref:`Caches and hashes`). However, in most cases, it is sufficient to use the
+generic ``fileformats.generic.File``, ``fileformats.generic.Directory``, or the even
+more generic ``fileformats.generic.FsObject`` or ``fileformats.generic.FileSet`` classes.
+
+The only cases where it isn't sufficient to use generic classes, is when there are
+implicit header or side cars assumed to be present adjacent to the primary file (e.g.
+a NIfTI file with an associated JSON sidecar file). Because in these cases, the
+header/sidecar file(s) will not be included in the hash calculation and may not be included in the
+movement of the "file set" between working directories. In these cases, you need to use the
+specific file format classes, such as ``fileformats.nifti.NiftiGzX``, which will check
+to see if the header/sidecar files are present.
+
+Coercion
+--------
+
+Pydra will attempt to coerce the input to the correct type if it is not already, for example
+if a tuple is provided to a field that is typed as a list, Pydra will convert the tuple to a list
+before the task is run. By default the following coercions will be automatically
+applied between the following types:
+
+* ty.Sequence → ty.Sequence
+* ty.Mapping → ty.Mapping
+* Path → os.PathLike
+* str → os.PathLike
+* os.PathLike → Path
+* os.PathLike → str
+* ty.Any → MultiInputObj
+* int → float
+* field.Integer → float
+* int → field.Decimal
+
+In addition to this, ``fileformats.fields.Singular`` (see FileFormats_)
+can be coerced to and from their primitive types and Numpy ndarrays and primitive types
+can be coerced to and from Python sequences and built-in types, respectively.
 
 Superclass auto-casting
 -----------------------
 
-Not wanting the typing to get in the way by being unnecessarily strict,
-upstream fields that are typed as super classes  (or as ``typing.Any`` by default)
-of the task input they are connected to will be automatically cast to the subclass
-when the task is run. This allows workflows and tasks to be easily connected together
-regardless of how specific typing is defined in the task definition.
+Pydra is designed so that strict and specific typing can be used, but is not
+unnecessarily strict, if it proves too burdensome. Therefore, upstream fields that are
+typed as super classes  (or as ``typing.Any`` by default) of the task input they are
+connected to will be automatically cast to the subclass when the task is run.
+This allows workflows and tasks to be easily connected together
+regardless of how specific typing is defined in the task definition. This includes
+file format types, so a task that expects a ``fileformats.medimage.NiftiGz`` file can
+be connected to a task that outputs a ``fileformats.generic.File`` file.
+Therefore, the only cases where a typing error will be raised are when the upstream
+field can't be cast or coered to the downstream field, e.g. a ``fileformats.medimage.DicomSeries``
+cannot be cast to a ``fileformats.medimage.Nifti`` file.
 
 
 .. _FileFormats: https://arcanaframework.github.io/fileformats
diff --git a/new-docs/source/howto/create-task-package.ipynb b/new-docs/source/howto/create-task-package.ipynb
@@ -4,7 +4,9 @@
             "cell_type": "markdown",
             "metadata": {},
             "source": [
-                "# Create a task package"
+                "# Create a task package\n",
+                "\n",
+                "Work in progress..."
             ]
         },
         {
diff --git a/new-docs/source/howto/port-from-nipype.ipynb b/new-docs/source/howto/port-from-nipype.ipynb
@@ -4,7 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Port interfaces from Nipype"
+    "# Port interfaces from Nipype\n",
+    "\n",
+    "Work in progress..."
    ]
   },
   {
diff --git a/new-docs/source/index.rst b/new-docs/source/index.rst
@@ -13,14 +13,14 @@ The power of Pydra lies in ease of constructing workflows, containing complex
 multiparameter map-reduce operations, in Python code and the use of a global cache (see
 :ref:`Design philosophy` for the rationale behind its design).
 
-Pydra's key features are:
+**Key features**:
 
-* Combine diverse tasks, implemented in `Python functions <./tutorial/3-python.html>`__ or stand-alone `shell commands <./tutorial/4-shell.html>`__, into coherent `workflows <./tutorial/5-workflow.html>`__
+* Combine diverse tasks (`Python functions <./tutorial/3-python.html>`__ or `shell commands <./tutorial/4-shell.html>`__) into coherent `workflows <./tutorial/5-workflow.html>`__
 * Map-reduce like semantics (see :ref:`Splitting and combining`)
 * Dynamic workflow construction using Python code (see :ref:`Dynamic construction`)
-* Modular execution systems for varied deployment on cloud, HPC, etc... (see `Execution options <./tutorial/2-advanced-execution.html>`__)
-* Support for the execution of tasks in containerized environments (see :ref:`Containers and environments`)
-* Global caching to reduce recomputation (see :ref:`Hashing and caching`)
+* Modular backends for deployment on different execution platforms (e.g. cloud, HPC, etc...) (see `Execution options <./tutorial/2-advanced-execution.html>`__)
+* Support for the execution of tasks in containerized environments (see :ref:`Software environments`)
+* Global caching to reduce recomputation (see :ref:`Caches and hashes`)
 * Support for strong type-checking, including file types, at workflow construction time (see :ref:`Typing and file-formats`)
 
 
diff --git a/new-docs/source/tutorial/2-advanced-execution.ipynb b/new-docs/source/tutorial/2-advanced-execution.ipynb