Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

Commit bdac6cb

Browse files
committed
docs: concepts: DataFlow
Signed-off-by: John Andersen <[email protected]>
1 parent e83ce3f commit bdac6cb

File tree

5 files changed

+790
-0
lines changed

5 files changed

+790
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1414
- Shouldi got an operation to run npm-audit on JavaScript code
1515
- Docstrings and doctestable examples for `record.py` (features and evaluated)
1616
- Simplified model API with SimpleModel
17+
- Documentation on how DataFlows work conceptually.
1718
### Changed
1819
- Restructured contributing documentation
1920
- Use randomly generated data for scikit tests

docs/concepts/dataflow.rst

Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
DataFlows
2+
=========
3+
4+
A running DataFlow is an event loop. First we'll look at terminology associated
5+
with DataFlows. Then we'll go through the sequence of events that constitute the
6+
running of a DataFlow. Lastly we'll go over the benefits of using DataFlows.
7+
8+
Terminology
9+
-----------
10+
11+
- :py:class:`Operation <dffml.df.types.Operation>`
12+
13+
- Things that will happen when the DataFlow is running. They define inputs and
14+
outputs. Inputs are the data they require to run, and outputs are the data
15+
they produce as a result.
16+
17+
- Similar to a function prototype in C, an
18+
:py:class:`Operation <dffml.df.types.Operation>` only contains metadata.
19+
20+
- :py:class:`OperationImplementation <dffml.df.base.OperationImplementation>`
21+
22+
- The implementation of an :py:class:`Operation <dffml.df.types.Operation>`.
23+
This is the code that gets run when we talk about "running an operation".
24+
25+
- A Python function can be an
26+
:py:class:`OperationImplementation <dffml.df.base.OperationImplementation>`
27+
28+
- :py:class:`Input <dffml.df.types.Input>`
29+
30+
- Data that will be given to an
31+
:py:class:`Operation <dffml.df.types.Operation>` when it runs.
32+
33+
- :py:class:`DataFlow <dffml.df.types.DataFlow>`
34+
35+
- Description of how :py:class:`Operations <dffml.df.types.Operation>` are
36+
connected.
37+
38+
- Defines where :py:class:`Operations <dffml.df.types.Operation>` should get
39+
their inputs from.
40+
41+
- Inputs can be received from the outputs of other operations, predefined
42+
``seed`` values, or anywhere else.
43+
44+
- :py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>`
45+
46+
- The runner of the DataFlow. Facilitates running of operations and manages
47+
input data.
48+
49+
- The :py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>` makes use of
50+
four different "Networks" and a
51+
:py:class:`RedundancyChecker <dffml.df.base.BaseRedundancyChecker>`.
52+
53+
- The :py:class:`InputNetwork <dffml.df.base.BaseInputNetwork>` stores all
54+
the (:py:class:`Input <dffml.df.types.Input>`) data. It accepts incoming
55+
data and notifies the
56+
:py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>` when there is
57+
new data.
58+
59+
- The :py:class:`OperationNetwork <dffml.df.base.BaseOperationNetwork>`
60+
stores all :py:class:`Operations <dffml.df.types.Operation>` the
61+
:py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>` knows about.
62+
63+
- The :py:class:`OperationImplementationNetwork <dffml.df.base.BaseOperationImplementationNetwork>`
64+
is responsible for running an
65+
:py:class:`Operation <dffml.df.types.Operation>` with a set of
66+
:py:class:`Inputs <dffml.df.types.Input>`. A unique set of
67+
:py:class:`Inputs <dffml.df.types.Input>` for an
68+
:py:class:`Operation <dffml.df.types.Operation>` is known as a
69+
:py:class:`ParameterSet <dffml.df.base.BaseParameterSet>`.
70+
71+
- The :py:class:`LockNetwork <dffml.df.base.BaseLockNetwork>`
72+
manages locking of :py:class:`Inputs <dffml.df.types.Input>`. This is used
73+
when the :py:class:`Definition <dffml.df.types.Definition>` of the data
74+
type of an :py:class:`Input <dffml.df.types.Input>` declares that it may
75+
only be used when locked.
76+
77+
- The :py:class:`RedundancyChecker <dffml.df.base.BaseRedundancyChecker>`
78+
ensures that :py:class:`Operations <dffml.df.types.Operation>` don't get
79+
run with the same
80+
:py:class:`ParameterSet <dffml.df.base.BaseParameterSet>` more than once.
81+
82+
- :py:class:`Operations <dffml.df.types.Operation>` get their inputs from
83+
the outputs of other :py:class:`Operations <dffml.df.types.Operation>`
84+
within the same
85+
:py:class:`InputSetContext <dffml.df.base.BaseInputSetContext>`.
86+
:py:class:`InputSetContexts <dffml.df.base.BaseInputSetContext>` create
87+
barriers which prevent
88+
:py:class:`Inputs <dffml.df.types.Input>` within one context from being
89+
combined with :py:class:`Inputs <dffml.df.types.Input>` within another
90+
context.
91+
92+
.. Not sure if we want this example here, no other bullet points have examples.
93+
94+
In the :doc:`/usage/integration` example use case. There is a DataFlow
95+
which collects information on a Git repo. Each URL is used as a context,
96+
as well as an :py:class:`Input <dffml.df.types.Input>`. By using the URL
97+
as a context we ensure all
98+
:py:class:`ParameterSets <dffml.df.base.BaseParameterSet>` created
99+
only contain inputs associated with their URL. For example, this prevents
100+
commit hashes extracted from a downloaded repository from being used as
101+
as an :py:class:`Input <dffml.df.types.Input>` in a
102+
:py:class:`ParameterSet <dffml.df.base.BaseParameterSet>` where the
103+
directory of downloaded source code contains the code downloaded from a
104+
different URL.
105+
106+
What Happens When A DataFlow Runs
107+
---------------------------------
108+
109+
When the :py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>` starts
110+
running a DataFlow. The following sequence of events take place.
111+
112+
- :py:class:`OperationImplementationNetwork <dffml.df.base.BaseOperationImplementationNetwork>`
113+
instantiates all of the
114+
:py:class:`OperationImplementations <dffml.df.base.OperationImplementation>`
115+
that are needed by the DataFlow.
116+
117+
- Our first stage is the ``Processing Stage``, where data will be generated.
118+
119+
- The :py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>` kicks off any
120+
contexts that were given to the
121+
:py:class:`run <dffml.df.base.BaseOrchestratorContext.run>` method along with
122+
the inputs for each context.
123+
124+
- All ``seed`` :py:class:`Inputs <dffml.df.types.Input>` are added to each
125+
context.
126+
127+
- All inputs for each context are added to the
128+
:py:class:`InputNetwork <dffml.df.base.BaseInputNetwork>`. This is the ``New
129+
Inputs`` step in the flow chart below.
130+
131+
- The :py:class:`OperationNetwork <dffml.df.base.BaseOperationNetwork>` looks at
132+
what inputs just arrived. It ``determines which Operations may have new
133+
parameter sets``. If an :py:class:`Operation <dffml.df.types.Operation>`
134+
has inputs whose possible origins include the origin of one of the inputs
135+
which just arrived, then it may have a new
136+
:py:class:`ParameterSet <dffml.df.base.BaseParameterSet>`.
137+
138+
- We ``generate Operation parameter set pairs`` by checking if there are any new
139+
permutations of :py:class:`Inputs <dffml.df.types.Input>` for an
140+
:py:class:`Operation <dffml.df.types.Operation>`. If the
141+
:py:class:`RedundancyChecker <dffml.df.base.BaseRedundancyChecker>`
142+
has no record of that permutation being run we create a new
143+
:py:class:`ParameterSet <dffml.df.base.BaseParameterSet>` composed of
144+
those :py:class:`Inputs <dffml.df.types.Input>`.
145+
146+
- We ``dispatch operations for running`` which have new
147+
:py:class:`ParameterSets <dffml.df.base.BaseParameterSet>`.
148+
149+
- The :py:class:`LockNetwork <dffml.df.base.BaseLockNetwork>` locks any
150+
of :py:class:`Inputs <dffml.df.types.Input>` which can't have multiple
151+
operations use them at the same time.
152+
153+
- The :py:class:`OperationImplementationNetwork <dffml.df.base.BaseOperationImplementationNetwork>`
154+
``runs each operation using given parameter set as inputs``.
155+
156+
- The outputs of the
157+
:py:class:`Operation <dffml.df.types.Operation>` are added to the
158+
:py:class:`InputNetwork <dffml.df.base.BaseInputNetwork>` and the loop
159+
repeats.
160+
161+
- Once there are no more
162+
:py:class:`Operation <dffml.df.types.Operation>`
163+
:py:class:`ParameterSet <dffml.df.base.BaseParameterSet>` pairs
164+
which the
165+
:py:class:`RedundancyChecker <dffml.df.base.BaseRedundancyChecker>` knows to
166+
be unique, the ``Cleanup Stage`` begins.
167+
168+
- The ``Cleanup Stage`` contains operations which will release any underlying
169+
resources allocated for :py:class:`Inputs <dffml.df.types.Input>` generated
170+
during the ``Processing Stage``.
171+
172+
- Finally the ``Output Stage`` runs.
173+
:py:class:`Operations <dffml.df.types.Operation>` running in this stage query
174+
the :py:class:`InputNetwork <dffml.df.base.BaseInputNetwork>` to organize the
175+
data within it into the users desired output format.
176+
177+
.. TODO Auto generate this
178+
179+
graph TD
180+
181+
inputs[New Inputs]
182+
operations[Operations]
183+
opimps[Operation Implementations]
184+
185+
ictx[Input Network]
186+
opctx[Operation Network]
187+
opimpctx[Operation Implementation Network]
188+
rctx[Redundency Checker]
189+
lctx[Lock Network]
190+
191+
192+
opctx_operations[Determine which Operations may have new parameter sets]
193+
ictx_gather_inputs[Generate Operation parameter set pairs]
194+
opimpctx_dispatch[Dispatch operation for running]
195+
opimpctx_run_operation[Run an operation using given parameter set as inputs]
196+
197+
inputs --> ictx
198+
199+
operations -->|Register With| opctx
200+
opimps -->|Register With| opimpctx
201+
202+
ictx --> opctx_operations
203+
opctx --> opctx_operations
204+
205+
opctx_operations --> ictx_gather_inputs
206+
ictx_gather_inputs --> rctx
207+
rctx --> |If operation has not been run with given parameter set before| opimpctx_dispatch
208+
209+
opimpctx_dispatch --> opimpctx
210+
211+
opimpctx --> lctx
212+
213+
lctx --> |Lock any inputs that can't be used at the same time| opimpctx_run_operation
214+
215+
opimpctx_run_operation --> |Outputs of Operation become inputs to other operations| inputs
216+
217+
.. image:: /images/dataflow_diagram.svg
218+
:alt: Flow chart showing how DataFlow Orchestrator works
219+
220+
Benifits of DataFlows
221+
---------------------
222+
223+
- Modularity
224+
225+
- Adding a layer of abstraction to separate the operations from their
226+
implementations means we focus on the logic of the application rather than
227+
how it's implemented.
228+
229+
- Implementations are easily unit testable. They can be swapped out for
230+
another implementation with similar functionality. For example if you had a
231+
"send email" operation you could swap the implementation from sending via
232+
your email server to sending via a third party service.
233+
234+
- Visibility
235+
236+
- Inputs are tracked to understand where they came from and or what sequence
237+
of operations generated them.
238+
239+
- DataFlows can be visualized to understand where inputs can come from. What
240+
you see is what you get. Diagrams showing how your application works in your
241+
documentation will never get out of sync.
242+
243+
- Ease of use
244+
245+
- Execute code concurrently with managed locking of
246+
:py:class:`Inputs <dffml.df.types.Input>` which require locks to be used
247+
safely in a concurrent environment.
248+
249+
- If a resource can only be used by one operation at a time, the writer of
250+
the operation doesn't need concern themselves of how to prevent against
251+
unknown user defined operations clobbering it. The
252+
:py:class:`Orchestrator <dffml.df.base.BaseOrchestrator>` manages locking.
253+
254+
- As DFFML is plugin based, this enables developers to easily write and
255+
publish operations without users having to worry about how various
256+
operations will interact with each other.
257+
258+
- DataFlows can be used in many environments. They are a generic way to
259+
describe application logic and not tied to any particular programming
260+
language (currently we only have an implementation for Python, we provide
261+
multiple deployment options).
262+
263+
- Security
264+
265+
- Clear trust boundaries via :py:class:`Input <dffml.df.types.Input>` origins
266+
and built in input validation enable developers to ensure that untrusted
267+
inputs are properly validated.
268+
269+
- DataFlows are a serializeable programming language agnostic concept which
270+
can be validated according to any set of custom rules.

docs/concepts/index.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Concepts
2+
========
3+
4+
Here we explain the main concepts of DFFML. How things work, and the
5+
philosophies behind why they work they way they do. If anything here is unclear,
6+
or you think there's a more user friendly way to do something, please let us
7+
know. See the :doc:`contact` page for how to reach us.
8+
9+
.. toctree::
10+
:glob:
11+
:maxdepth: 2
12+
13+
dataflow

0 commit comments

Comments
 (0)