-
Notifications
You must be signed in to change notification settings - Fork 290
Progress tracking
The main feature timely dataflow provides, in addition to moving data along dataflow edges, is the indication of progress through the stream of data.
The system alerts each recipient of timestamped data once it can determine that some previously active timestamp will never be seen again. This allows the operator to react with final messages and actions, for example sending aggregates, flushing state, or releasing held resources. The same mechanisms also alert the user as output data emerge from the dataflow graph.
We first need to develop timely dataflow's progress vocabulary.
A timely dataflow graph consists of operators, whose inputs and outputs are connected by channels. Each operator input has one associated channel, but an operator output may have many channels leading from it: multiple other operators may want to see the data it produces. Each group of records transmitted in timely dataflow have an associated timestamp.
Timely dataflow is responsible for producing information that would let each operator answer the question
Will I ever see data with this specific timestamp on this specific input of mine?
To provide this information, timely dataflow imposes some constraints on the structure of the dataflow graph and the behavior of operators.
The high-level constraint that timely dataflow imposes is that that there should be no cycles in the timely dataflow graph which permit data on some channel to eventually produce data on that channel with the same timestamp. This will let the system reason that once some timestamp has gone away, it will not be seen again.
This property can be difficult to reason about directly, so instead we will impose simpler, local constraints on the graph structure and operator behavior, and argue that these constraints imply the desired global property.
With one exception, operators may only be constructed from the full set of channels connected to their inputs, and their outputs are not available until this happens.
This constraint by itself would ensure that dataflow graphs are acyclic, as we could totally order the operators by their construction time, and edges would not be able to go backwards along this order.