Skip to content

Commit a471faf

Browse files
docs(huntsman): Add specification for task graph design. (#277)
Co-authored-by: sitao <sitao.wang@mail.utoronto.ca>
1 parent 91cf88f commit a471faf

File tree

2 files changed

+126
-0
lines changed

2 files changed

+126
-0
lines changed

docs/huntsman/src/dev-docs/index.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,18 @@ Storage
1212
^^^
1313
Spider's storage system.
1414
:::
15+
16+
:::{grid-item-card}
17+
:link: task-graph
18+
Task graph
19+
^^^
20+
Spider's task graph design.
21+
:::
1522
::::
1623

1724
:::{toctree}
1825
:hidden:
1926

2027
storage
28+
task-graph
2129
:::
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Task graph
2+
3+
In Spider, a task graph serves as the underlying representation of a job. It is a directed acyclic
4+
graph (DAG) that captures a collection of tasks and the dependency relationships among them.
5+
6+
This document specifies the design and semantics of the task graph.
7+
8+
## Specification
9+
10+
### Task
11+
12+
A task is a vertex in the task graph. Each task contains metadata required for execution, including:
13+
14+
* TDL package name: The identifier of the TDL package that contains the task function
15+
implementation.
16+
* Task function name: The name of the function that implements the task's logic.
17+
* Other metadata such as maximum number of retries allowed.
18+
19+
Tasks are classified based on their position in the graph:
20+
21+
* Input task: A task with no parent tasks. It serves as a starting point for execution.
22+
* Output task: A task with no child tasks. It represents a terminal point in the execution.
23+
* Intermediate task: A task that has both parent and child tasks.
24+
25+
### Task inputs
26+
27+
Each task defines a finite, ordered list of task inputs.
28+
29+
* Task inputs are positional and typed.
30+
* Each input position expects exactly one instance of the declared type.
31+
32+
### Task outputs
33+
34+
Each task defines a finite, ordered list of task outputs.
35+
36+
* Task outputs are positional and typed.
37+
* Each output position produces at most one instance of the declared type upon successful task
38+
execution.
39+
40+
### Task dependencies
41+
42+
There are two types of dependencies between tasks in the task graph: data flow dependencies and
43+
control flow dependencies.
44+
45+
#### Data and data-flow dependencies
46+
47+
A task data-flow dependency represents a data dependency between tasks in a task graph.
48+
49+
Conceptually, the task graph maintains a set of data objects, each of which represents the flow of a
50+
single typed value from one source to one or more destinations.
51+
52+
##### The endpoints of a data object
53+
54+
The source of a data object is exactly one of the following:
55+
56+
* An external job input provided by the job creator, or
57+
* An output of a task in the task graph.
58+
59+
The destination(s) of a data object are zero or more of the following:
60+
61+
* Inputs of tasks in the task graph, or
62+
* The job output.
63+
64+
A data object **may** have multiple destinations, enabling fan-out from a single source.
65+
66+
##### Task-level data-flow dependencies
67+
68+
Within the task graph, a data object implies one or more task data-flow dependencies.
69+
70+
A task data-flow dependency chains a task output of a parent task (the data source) to a task input
71+
of a child task (the data destination).
72+
73+
The type of the task output and the type of the task input **must** match exactly.
74+
75+
**Constraints on task inputs**
76+
77+
* Input tasks:
78+
* All task inputs must be provided at job creation time.
79+
* Each task input corresponds to a data object whose source is an external job input.
80+
* Input tasks must not depend on the output of any other task.
81+
* Non-input tasks:
82+
* Every task input must be chained to exactly one output of another task.
83+
* Each task input corresponds to a data object whose source is a task output.
84+
85+
**Constraints on task outputs**
86+
87+
* A task output may be chained to zero or more task inputs.
88+
* In this case, the task output serves as the source of a data object consumed by one or more
89+
downstream task inputs.
90+
* A task output may be dangling.
91+
* A dangling output is the source of a data object with no task-level destinations and may
92+
optionally be designated as a job output.
93+
94+
#### Control-flow dependencies
95+
96+
A task control-flow dependency represents an execution ordering constraint between tasks and is
97+
derived from task data-flow dependencies.
98+
99+
* If there exists a task data-flow dependency from an output of task **A** to an input of task
100+
**B**, then:
101+
* **A** is a *parent* of **B**, and
102+
* **B** is a *child* of **A**.
103+
* The set of all parent–child relationships defines the **directed edges** of the task graph.
104+
105+
Control-flow dependencies are not defined independently; they are fully implied by data-flow
106+
dependencies.
107+
108+
### Execution implication
109+
110+
In Spider, a task is eligible to be scheduled for execution if and only if **one** of the following
111+
conditions holds:
112+
113+
* The task is an input task, or
114+
* All parent tasks of the task have completed successfully.
115+
116+
## Implementation requirements
117+
118+
:::{warning} 🚧 This section is still under construction. :::

0 commit comments

Comments
 (0)