|
| 1 | +# Task graph |
| 2 | + |
| 3 | +In Spider, a task graph serves as the underlying representation of a job. It is a directed acyclic |
| 4 | +graph (DAG) that captures a collection of tasks and the dependency relationships among them. |
| 5 | + |
| 6 | +This document specifies the design and semantics of the task graph. |
| 7 | + |
| 8 | +## Specification |
| 9 | + |
| 10 | +### Task |
| 11 | + |
| 12 | +A task is a vertex in the task graph. Each task contains metadata required for execution, including: |
| 13 | + |
| 14 | +* TDL package name: The identifier of the TDL package that contains the task function |
| 15 | + implementation. |
| 16 | +* Task function name: The name of the function that implements the task's logic. |
| 17 | +* Other metadata such as maximum number of retries allowed. |
| 18 | + |
| 19 | +Tasks are classified based on their position in the graph: |
| 20 | + |
| 21 | +* Input task: A task with no parent tasks. It serves as a starting point for execution. |
| 22 | +* Output task: A task with no child tasks. It represents a terminal point in the execution. |
| 23 | +* Intermediate task: A task that has both parent and child tasks. |
| 24 | + |
| 25 | +### Task inputs |
| 26 | + |
| 27 | +Each task defines a finite, ordered list of task inputs. |
| 28 | + |
| 29 | +* Task inputs are positional and typed. |
| 30 | +* Each input position expects exactly one instance of the declared type. |
| 31 | + |
| 32 | +### Task outputs |
| 33 | + |
| 34 | +Each task defines a finite, ordered list of task outputs. |
| 35 | + |
| 36 | +* Task outputs are positional and typed. |
| 37 | +* Each output position produces at most one instance of the declared type upon successful task |
| 38 | + execution. |
| 39 | + |
| 40 | +### Task dependencies |
| 41 | + |
| 42 | +There are two types of dependencies between tasks in the task graph: data flow dependencies and |
| 43 | +control flow dependencies. |
| 44 | + |
| 45 | +#### Data and data-flow dependencies |
| 46 | + |
| 47 | +A task data-flow dependency represents a data dependency between tasks in a task graph. |
| 48 | + |
| 49 | +Conceptually, the task graph maintains a set of data objects, each of which represents the flow of a |
| 50 | +single typed value from one source to one or more destinations. |
| 51 | + |
| 52 | +##### The endpoints of a data object |
| 53 | + |
| 54 | +The source of a data object is exactly one of the following: |
| 55 | + |
| 56 | +* An external job input provided by the job creator, or |
| 57 | +* An output of a task in the task graph. |
| 58 | + |
| 59 | +The destination(s) of a data object are zero or more of the following: |
| 60 | + |
| 61 | +* Inputs of tasks in the task graph, or |
| 62 | +* The job output. |
| 63 | + |
| 64 | +A data object **may** have multiple destinations, enabling fan-out from a single source. |
| 65 | + |
| 66 | +##### Task-level data-flow dependencies |
| 67 | + |
| 68 | +Within the task graph, a data object implies one or more task data-flow dependencies. |
| 69 | + |
| 70 | +A task data-flow dependency chains a task output of a parent task (the data source) to a task input |
| 71 | +of a child task (the data destination). |
| 72 | + |
| 73 | +The type of the task output and the type of the task input **must** match exactly. |
| 74 | + |
| 75 | +**Constraints on task inputs** |
| 76 | + |
| 77 | +* Input tasks: |
| 78 | + * All task inputs must be provided at job creation time. |
| 79 | + * Each task input corresponds to a data object whose source is an external job input. |
| 80 | + * Input tasks must not depend on the output of any other task. |
| 81 | +* Non-input tasks: |
| 82 | + * Every task input must be chained to exactly one output of another task. |
| 83 | + * Each task input corresponds to a data object whose source is a task output. |
| 84 | + |
| 85 | +**Constraints on task outputs** |
| 86 | + |
| 87 | +* A task output may be chained to zero or more task inputs. |
| 88 | + * In this case, the task output serves as the source of a data object consumed by one or more |
| 89 | + downstream task inputs. |
| 90 | +* A task output may be dangling. |
| 91 | + * A dangling output is the source of a data object with no task-level destinations and may |
| 92 | + optionally be designated as a job output. |
| 93 | + |
| 94 | +#### Control-flow dependencies |
| 95 | + |
| 96 | +A task control-flow dependency represents an execution ordering constraint between tasks and is |
| 97 | +derived from task data-flow dependencies. |
| 98 | + |
| 99 | +* If there exists a task data-flow dependency from an output of task **A** to an input of task |
| 100 | + **B**, then: |
| 101 | + * **A** is a *parent* of **B**, and |
| 102 | + * **B** is a *child* of **A**. |
| 103 | +* The set of all parent–child relationships defines the **directed edges** of the task graph. |
| 104 | + |
| 105 | +Control-flow dependencies are not defined independently; they are fully implied by data-flow |
| 106 | +dependencies. |
| 107 | + |
| 108 | +### Execution implication |
| 109 | + |
| 110 | +In Spider, a task is eligible to be scheduled for execution if and only if **one** of the following |
| 111 | +conditions holds: |
| 112 | + |
| 113 | +* The task is an input task, or |
| 114 | +* All parent tasks of the task have completed successfully. |
| 115 | + |
| 116 | +## Implementation requirements |
| 117 | + |
| 118 | +:::{warning} 🚧 This section is still under construction. ::: |
0 commit comments