RFC: The Project Unit for Future Dora #1274

drindr · 2025-12-20T16:48:41Z

drindr
Dec 20, 2025

This discussion is inspired by the #1213, and added more detail about this requirement.

Obviously, current dataflow YAML describes the process of the whole Dora application crystal clear.

But as the Application grow larger, the YAML file will expand and will be no long readable in the end.
Of course, there is a dataflow-builder for constructing dataflow with programming languages. But I think it's still limited and not intuitive.

Could we just regard the dataflow as a special node while loading dataflow

nodes:
  - id: sub-dataflow
    dataflow: <path to the dataflow>
    prefix: <an optional prefix for the nodes in the dataflow>
    inputs:
      # ... 

# The input of the whole dataflow
outputs:
  - input-1: 
  - # ...

# The output of the whole dataflow
outputs:
  - output-1: 
  - # ...

Is the outputs field redundant?
Comparing with traditional package paradigm, find that we need to declare the outputs replicated each time we use the same node.
Could we just place some of the declaration of Node in a standalone file?

[[node]]
name = "node-1"
build = "<the build command>"
run = "<the run command>"
outputs = ["output-1", "output-2"]

A unified Node/Dataflow resolver
This is an extension of the 1. and 2.
Current nodes can from variant sources, local path, url, and git. Imaging a Dora application rely on another Dora application, there will be many replicated characters for navigating to nodes in the dependent application. Introducing a new Project(or Workspace) unit seems can mitigate it.
A draft idea about a project descriptor

name = "project-name"
[[node]] # detail in the 2.
name = "node-1"

[[dataflow]]
name = "dataflow-1"
dependencies = [ "project-a", "project-b" ]

[dependencies] 
# Declare all the dependencies required by all the dataflow in current project.
# Obviously, the relation between Nodes are defined by the dataflow, so they do not need dependency at all.
project-c = {git = ""}
project-d = {path = ""}

and in dataflow,

nodes:
- id: node-from-project
  name: @project-name/<node or dataflow name>

The exact content of the name field is resolved from the project descriptor file in the ancestors directories of the current working directory or user customized project descriptor file.
(The project descriptor file is just like the Cargo.toml for Rust, to some degree)

If the project unit is acceptable, some utilities for wrting dataflow YAML could be useful.
An auto-completion while writing nodes' inputs field is possible, because the outputs of the node have been well-defined in the project descriptor.

haixuanTao · 2025-12-21T10:06:39Z

haixuanTao
Dec 21, 2025
Maintainer

I understand your concern. I think the thing is that what you're proposing is a big enough change that it would need us to redesign a lot of things unfortunately, and I don't feel like an overall refactor is what we are looking for at the moment.

2 replies

drindr Dec 21, 2025
Author

Agree. So it is backwards compatible and do not break any existed dataflow descriptor content. It provides extra methods to load Node/Dataflow for node declaration and dataflow reuse, rather than refactor the existed one.

haixuanTao Dec 21, 2025
Maintainer

I understand, but I think we should probably focus on the current existing API and missing item rather than refactoring the yaml description that is arguably not perfect but is "functional".

phil-opp · 2025-12-22T09:29:47Z

phil-opp
Dec 22, 2025
Maintainer

Could we just regard the dataflow as a special node while loading dataflow

nodes:
  - id: sub-dataflow
    dataflow: <path to the dataflow>
    prefix: <an optional prefix for the nodes in the dataflow>
    inputs:
      # ... 

# The input of the whole dataflow
outputs:
  - input-1: 
  - # ...

# The output of the whole dataflow
outputs:
  - output-1: 
  - # ...

I like the idea of referencing/including another dataflow file. I see two possible designs:

A: Copy all nodes of the referenced dataflow into the outer dataflow, i.e. treat it as a C header include. This allows you to move some common node definitions into a common.yml file, which you can then reference by other dataflow YAML files
- An optional prefix makes sense in this case to provide a way to avoid conflicts. For example, in input of the outer dataflow could be mapped to prefix/node_id/output_id this way.
- There are no dataflow-level input/output fields in this design.
- The inner dataflow cannot reference nodes of the outer dataflow.
B: Treat the referenced dataflow like a node, abstracting away the inner dataflow's nodes. Kind of like a module in Rust, which has private implementations and only exposes a certain set of functionality.
- The inner dataflow would need to list a set of outputs for the whole dataflow, which would be mapped to node outputs (like a re-export in Rust). These are the only outputs available to the outer dataflows. Access to the nodes of the inner dataflow is not allowed.
The inner dataflow cannot reference nodes of the outer dataflow.

2 replies

phil-opp Dec 22, 2025
Maintainer

Variant B is probably too limiting and also more complex because of the additional dataflow-level output fields. So I would lean towords design A.

phil-opp Dec 22, 2025
Maintainer

Instead of only allowing paths to other dataflow yaml files, we could also support git references. This way, you could reuse and extend a dataflow from some remote GitHub repo for example.

phil-opp · 2025-12-22T09:38:15Z

phil-opp
Dec 22, 2025
Maintainer

2. Is the outputs field redundant?
Comparing with traditional package paradigm, find that we need to declare the outputs replicated each time we use the same node.
Could we just place some of the declaration of Node in a standalone file?
[[node]]
name = "node-1"
build = "<the build command>"
run = "<the run command>"
outputs = ["output-1", "output-2"]

This is something that we already discussed in the context of automated checks and schema specifications. Our plan is that each node specifies some schema information for its outputs to make the output types explicit. Ideally, this would be automatically verified against the actual executable in some way. By providing schema and output information, it will become easier to reuse existing nodes, e.g. from the node hub. For this, each node should be distributed with some node declaration file similar to the one you proposed here.

Making the build/run/outputs field optional in the dataflow config file seems like the logical next step then. If the node already specifies these, there is no need to specify this info again.

0 replies

phil-opp · 2025-12-22T09:47:47Z

phil-opp
Dec 22, 2025
Maintainer

3. A unified Node/Dataflow resolver
This is an extension of the 1. and 2.
Current nodes can from variant sources, local path, url, and git. Imaging a Dora application rely on another Dora application, there will be many replicated characters for navigating to nodes in the dependent application. Introducing a new Project(or Workspace) unit seems can mitigate it.
A draft idea about a project descriptor
name = "project-name"
[[node]] # detail in the 2.
name = "node-1"

[[dataflow]]
name = "dataflow-1"
dependencies = [ "project-a", "project-b" ]

[dependencies] 
# Declare all the dependencies required by all the dataflow in current project.
# Obviously, the relation between Nodes are defined by the dataflow, so they do not need dependency at all.
project-c = {git = ""}
project-d = {path = ""}
and in dataflow,
nodes:
- id: node-from-project
  name: @project-name/<node or dataflow name>
The exact content of the name field is resolved from the project descriptor file in the ancestors directories of the current working directory or user customized project descriptor file.
(The project descriptor file is just like the Cargo.toml for Rust, to some degree)

I'm not sure if I understand the motivation for the proposed project file. To me, the dataflow specification file already acts as a project file. With proposal 1, we could already include other dataflows, so the "Dora application rely[ing] on another Dora application" would be possible. Then we would not need a dependencies section, would we? The prefix would be similar as the proposed @project-name identifier.

0 replies

drindr · 2025-12-23T05:24:46Z

drindr
Dec 23, 2025
Author

Alright, so

an "include" method in dataflow like

# the dataflow file
include:
  - dataflow: <file path, url, or git path relative to current dataflow>
    prefix: <optional prefix for the nodes in the dataflow>
# maybe some mechanism to passthrough the env variables

a descriptor file for node

# the descriptor for the node
build: "<the build command>"
run: "the run command"
outputs:
   - output-1
   - output-2
   # ...

# more fields for constraint and verifing in the future

# the dataflow file
- id: node-1
  desc: <the path, url or git of the node's descriptor>

are enough for the requirements and much simpler than the original idea.

4 replies

phil-opp Dec 23, 2025
Maintainer

@haixuanTao What do you think about this approach?

haixuanTao Dec 27, 2025
Maintainer

I think that 1. and 2. try to solve different problem and we shoudn't try to fix both at once.

On 1.

I think one question I have is that it is going to be very difficult to manage working directory path, and I would expect issues from it.
I would also prefer if we try to keep the dataflow description consistent. Something like:

nodes: 
  - id: a
    ...


  - id: b
    ...

dataflows ( alternatively subdataflows, childdataflow):
  - id: subdataflow_a
    path: path/to/abc.yaml

How does choosing which daemon/machine run what works in this configuration?
I would prefer namespacing like subdataflow_a/camera/output_1 would probably make more sense to me rather than a custom prefix
One of the question that was blocking on this is how do we manage for the subdataflow/childdataflow to receive output from the parent dataflow? Maybe something like parent_dataflow/parent_node/output_1? What about other subdataflow output that may or may not exist?

haixuanTao Dec 27, 2025
Maintainer

On 2. I would rather not have a build and run command within a dora node specification as for python or rust they might be:

for build: Multiple build configuration ( feature flags, environment manager (uv, pip cargo)) that I would want user to have a control of and not make it something part of a node "maintainer". Build could also be dependant on the platform.
for run: Multiple run configuration as well ( debug, release, env manager, ... ), dependent on OS for things like path
I think that having outputs not in sync with the code or needing to be synced manually can be very frustrating and so I wouldn't want this to be the default. I would rather have something with decorators in python and proc_macro in rust wrapping some function like webserver to generate some automatic generated "swagger" for node to be much more resilient.

drindr Jan 7, 2026
Author

On 2. I would rather not have a build and run command within a dora node specification as for python or rust they might be:

* for `build`: Multiple build configuration ( feature flags, environment manager (uv, pip cargo)) that I would want user to have a control of and not make it something part of a node "maintainer". Build could also be dependant on the platform.

* for `run`: Multiple run configuration as well ( debug, release, env manager, ... ), dependent on OS for things like path

What about allowing override? So that dataflow users can experience the most flexible build/run and the simplest way to make a first attempt.

* I think that having outputs not in sync with the code or needing to be synced manually can be very frustrating and so I wouldn't want this to be the default. I would rather have something with decorators in python and proc_macro in rust wrapping some function like webserver to generate some automatic generated "swagger" for node to be much more resilient.

Of course, deriving the properties of the output/input from code is a nice feature, but the cross language requirement makes it a bit more complex. Maybe we can discuss in a new discussion later.

dora-rs

RFC: The Project Unit for Future Dora #1274

Uh oh!

Uh oh!

drindr Dec 20, 2025

Replies: 5 comments · 8 replies

Uh oh!

haixuanTao Dec 21, 2025 Maintainer

Uh oh!

drindr Dec 21, 2025 Author

Uh oh!

haixuanTao Dec 21, 2025 Maintainer

Uh oh!

phil-opp Dec 22, 2025 Maintainer

Uh oh!

phil-opp Dec 22, 2025 Maintainer

Uh oh!

phil-opp Dec 22, 2025 Maintainer

Uh oh!

phil-opp Dec 22, 2025 Maintainer

Uh oh!

phil-opp Dec 22, 2025 Maintainer

Uh oh!

Uh oh!

drindr Dec 23, 2025 Author

Uh oh!

phil-opp Dec 23, 2025 Maintainer

Uh oh!

haixuanTao Dec 27, 2025 Maintainer

Uh oh!

haixuanTao Dec 27, 2025 Maintainer

Uh oh!

Uh oh!

drindr Jan 7, 2026 Author

drindr
Dec 20, 2025

Replies: 5 comments 8 replies

haixuanTao
Dec 21, 2025
Maintainer

drindr Dec 21, 2025
Author

haixuanTao Dec 21, 2025
Maintainer

phil-opp
Dec 22, 2025
Maintainer

phil-opp Dec 22, 2025
Maintainer

phil-opp Dec 22, 2025
Maintainer

phil-opp
Dec 22, 2025
Maintainer

phil-opp
Dec 22, 2025
Maintainer

drindr
Dec 23, 2025
Author

phil-opp Dec 23, 2025
Maintainer

haixuanTao Dec 27, 2025
Maintainer

haixuanTao Dec 27, 2025
Maintainer

drindr Jan 7, 2026
Author