Parse dictionary output #228

samwaseda · 2025-07-15T15:24:23Z

I started feeling that we should give the possibility of returning dict and data classes, simply because it's the de facto standard for input and output parsers.

Before (this works just as well):

def get_output(job_instance):
    ...
    return positions, forces, energy

After:

def get_output(job_instance):
    ...
    return {"positions", x, "forces", f, "energy": E}

And these two should have the same output keys. This would imply that we would allow something like this:

def my_workflow(some_inputs):
    ...
    output_dict = get_output(job)
    my_input = get_input_of_another_job(energy=output_dict["energy"])
    ...

In this case the parsing will be slightly more complicated, but I guess it's still a reasonable amount of work. What do you think of the overall idea? @jan-janssen @liamhuber

Note: This PR would only parse get_output but my_workflow as given above would not work yet.

github-actions · 2025-07-15T15:24:33Z

👈 Launch a binder notebook on branch pyiron/semantikon/dict

codecov · 2025-07-15T15:26:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.20%. Comparing base (d8c08f1) to head (6db698c).
⚠️ Report is 57 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
+ Coverage   96.18%   96.20%   +0.02%     
==========================================
  Files           8        8              
  Lines        1440     1450      +10     
==========================================
+ Hits         1385     1395      +10     
  Misses         55       55

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

samwaseda · 2025-07-15T16:00:50Z

I think the question is important but it's not really urgent, so I'm gonna make it draft.

liamhuber · 2025-07-15T17:19:22Z

I'm a bit confused -- what is a "job" here?

samwaseda · 2025-07-15T18:59:19Z

I'm a bit confused -- what is a "job" here?

Maybe I should have called it output_file or something. It doesn't really matter as long as the function returns a set of values.

liamhuber · 2025-07-15T20:58:48Z

Maybe I should have called it output_file or something. It doesn't really matter as long as the function returns a set of values.

Ah, ok.

So do I understand correctly that you really want

def my_workflow(some_inputs):
    ...
    output_dict = get_output(job)
    my_input = get_input_of_another_job(energy=output_dict["energy"])
    ...

To be the right way to write things, regardless of whether output_dict uses your first or second definition? If so, I really dislike that. My rationale is that currently the workflows are also just valid python; but here the definition of my_workflow above is only compatible with the dictionary-returning definition of output_dict.

samwaseda · 2025-07-16T06:33:37Z

If so, I really dislike that

From the programming point of view, I don't like it either.

The problem comes rather from the reality: I'm having a hard time telling people to write output parsers with tuples for multiple outputs, because people usually use rather a dictionary or a data class and not a tuple. Since tuples and dictionaries are largely equivalent in this regard, I still think it is useful to allow the user to return a dictionary to mean multiple outputs. Alternatively, I'm also fine with not allowing multiple outputs at all, as @jan-janssen prefers. I find it a rather unintuitive constraint that only tuples are allowed for multiple outputs.

liamhuber · 2025-07-16T14:17:46Z

The problem comes rather from the reality: I'm having a hard time telling people to write output parsers with tuples for multiple outputs, because people usually use rather a dictionary or a data class and not a tuple. Since tuples and dictionaries are largely equivalent in this regard, I still think it is useful to allow the user to return a dictionary to mean multiple outputs. ... I find it a rather unintuitive constraint that only tuples are allowed for multiple outputs.

I didn't think we wanted to disallow dictionaries as outputs? I find this totally fine:

def get_output(job_instance):
    ...
    return {"positions", x, "forces", f, "energy": E}

def my_workflow(some_inputs):
    ...
    output_dict = get_output(job)
    my_input = get_input_of_another_job(energy=output_dict["energy"])
    ...

It's this that I am scared of:

def get_output(job_instance):
    ...
    return positions, forces, energy

def my_workflow(some_inputs):
    ...
    output_dict = get_output(job)
    my_input = get_input_of_another_job(energy=output_dict["energy"])
    ...

I think maybe I see where the drive for this is coming from, since in pyiron_workflow I can write

import pyiron_workflow as pwf

@pwf.as_function_node
def get_output(job_instance):
    ...
    return positions, forces, energy

@pwf.as_macro_node
def my_workflow(some_inputs):
    ...
    output = get_output(job)
    my_input = get_input_of_another_job(energy=output.outputs.energy)
    ...

IMO this is fine for pyiron_workflow, since inside the workflow these are all nodes. For semantikon.workflow this is supposed to be like python, so if output_dict is actually a tuple, I don't want to have people calling __getitem__ on it.

My understanding of the semantikon.workflow recipe parsing philosophy was that what is being written should be some restricted (not "modified"!) version of python, such that everything that is written sensibly executes whether the @semantikon.workflow.workflow decorator appears anywhere or not. The decorator is there merely to let us extract a recipe from the python. In this way, output_dict = get_output(job); output_dict["energy"] is anethema if get_output returns a tuple. Maybe we could weasle our way into things if get_output was itself decorated as a @semantikon.workflow.node or something, but now we're head back in the direction of just writing pyiron_workflow code to begin with.

liamhuber · 2025-07-16T14:24:14Z

Changing tack a little bit, I do rather like the idea of an @node parsing returned dictionaries into individual output channels. I was at first a little nervous because I was like "but what if the user is trying to return an entire coherent dictionary?!" but then I realized pyiron_workflow has the same problem with tuples, and we already get around it by saying "if you return a tuple but your node decorator has a single output label, we treat the entire tuple as a single return". Of course we could do exactly that same thing with returned dictionaries!

So I actually rather like that these changes to get_return_expressions exist. In the medium run, I'd like for pyiron_workflow to exploit the function parsing capabilities here, and so being able to parse tuples to output channels and dictionaries to output channels would be super. I think also it would trim down some of the intermediate steps that were necessary to make pyiron_workflow interoperable with the Python Workflow Definition pythonworkflow/python-workflow-definition#111

I.e. I like the power being introduced here, I'm just deeply unconvinced of the proposed plan for exploiting it in @semantikon.workflow.workflow functions.

liamhuber · 2025-07-16T14:32:56Z

I.e. I like the power being introduced here, I'm just deeply unconvinced of the proposed plan for exploiting it in @semantikon.workflow.workflow functions.

To this end, @samwaseda it may be helpful to me if there were some working example of your intent? The test suite extension here is useful, but doesn't touch the workflow recipe generation.

I checked the branch out and tried things for myself:

from semantikon.workflow import workflow

def outputs_dict(x, y):
    return {"x": x, "y": y}

def outputs_tuple(i, j):
    return i, j

def add(obj, other):
    sum = obj + other
    return sum

# @workflow
# def my_macro(x, y, i, j):
#     out_dict = outputs_dict(x, y)
#     out_i, out_j = outputs_tuple(i, j)
#     use_it = add(out_dict["x"], out_i)
#     return use_it

@workflow
def my_macro(x, y, i, j):
    out_dict = outputs_dict(x, y)
    out_i, out_j = outputs_tuple(i, j)
    return out_dict, out_i, out_j

my_macro._semantikon_workflow

Since the new functionality is not yet propagated to the workflows, on both main and this branch I get the entry

  'outputs_dict_0': {'inputs': {'x': {}, 'y': {}},
   'outputs': {'output': {}},
   'function': <function __main__.outputs_dict(x, y)>,
   'type': 'Function'}},

I guess the idea here is that this should have 'outputs': {'x': {}, 'y': {}}? That's something I can get behind.

From there, I would also be supportive of something like the commented-out workflow, where the tuple-based return and dict-based return both work but the syntax for leveraging them is different. However right now this unsurprisingly gives NotImplementedError: Only variable inputs supported, got: {'_type': 'Subscript', 'value': {'_type': 'Name', 'id': 'out_dict', 'ctx': {'_type': 'Load'}}, 'slice': {'_type': 'Constant', 'value': 'x', 'kind': None}, 'ctx': {'_type': 'Load'}}.

If it did work, I would imagine to see

 'edges': [
   ...
   ('outputs_dict_0.outputs.x', 'add_0.inputs.obj'),
   ('outputs_tuple_0.outputs.i', 'add_0.inputs.other'),
   ...

Now this is a little bit interesting, because it demands that WfMS interpret both tuples and dictionaries as independent outputs when parsing plain functions. I'm OK with this demand, and would like to support it in pyiron_workflow, I just think it's important to clearly note what requirements we're making.

liamhuber · 2025-07-16T14:38:14Z

Alternatively, I'm also fine with not allowing multiple outputs at all, as @jan-janssen prefers. I find it a rather unintuitive constraint that only tuples are allowed for multiple outputs.

For dataclasses this is not too bad, for dictionaries this just absolutely wrecks type hinting and ontological hinting. With a dictionary return there's no way to provide per-field hints on what goes back, so you just immediately lose this power.
If we only allow single outputs, everyone downstream needs to be set up to consume the entire output. Or we have splitters to split the output (__getitem__, or a dict-to-channels). Since it is very common in workflows to use only some of the output or to split output to different downstream targets, this just seems like it introduces tonnes of unnecessary overhead for the graph compared to simply making each piece of output available individually.
If we somehow disallow non-dict/non-dataclass returns, this would hamstring converting existing libraries to nodes. Many functions already return multiple values, so whatever we do we want to be able to snag and use as many of those as possible as easily as possible.

Multiple outputs per node is a beautiful thing that gives us tonnes of power.

samwaseda · 2025-07-16T15:29:08Z

Ah ok now I understood the confusion, sorry :D. For tuple the behavior is the same as before, meaning with

def get_output(job_instance):
    ...
    return positions, forces, energy

you would have to do

def my_workflow(some_inputs):
    ...
    positions, forces, energy = get_output(job)
    my_input = get_input_of_another_job(energy=energy)
    ...

liamhuber · 2025-07-16T16:44:22Z

Ah ok now I understood the confusion, sorry :D

Ahh, ok 😅 Then yes, I quite like the change. My note on our requirements stands, but it's something I think we should be aware of and not something that should stop us moving in this direction:

Now this is a little bit interesting, because it demands that WfMS interpret both tuples and dictionaries as independent outputs when parsing plain functions.

samwaseda · 2025-07-17T15:15:36Z

Now this is a little bit interesting, because it demands that WfMS interpret both tuples and dictionaries as independent outputs when parsing plain functions.

Yeah the more I think about it, the more I feel it's potentially a super dangerous step, because currently the number of outputs can be specified by the number of variables assigned, i.e.:

def f(x, y):
    return x, y

def my_workflow_1(x, y):
    x, y = f(x, y)
    return x, y

def my_workflow_2(x, y):
    z = f(x, y)
    return z

So in the first case there are two outputs, and in the second case one. But now if we use dict, the output is the same. We can see only from the subsequent lines whether the dictionary is resolved or not, or in the worst case we might have something like:

def f(x, y):
    return {"x": x, "y": y}

def my_workflow(x, y):
    d = f(x, y)
    a = g(d["x"])
    b = h(d)
    ...

In this case both multiple outputs and single outputs must be simultaneously dealt with. I don't think such a case can be represented straightforwardly...

liamhuber · 2025-07-17T17:46:11Z

In this case both multiple outputs and single outputs must be simultaneously dealt with. I don't think such a case can be represented straightforwardly...

Indeed, that is a very good point -- if the ast->workflow recipe parser is going to allow g(d["x"]) in the case that d = f(x, y) is saying "ah yes, f returns a dictionary, so d is effectively a panel of named outputs", then we would need to disallow h(d) appearing, because it is saying "here h, process a panel of output".

For dataclasses this is not too bad, for dictionaries this just absolutely wrecks type hinting and ontological hinting. With a dictionary return there's no way to provide per-field hints on what goes back, so you just immediately lose this power.

Having let my own idea sit with me longer, I want to double-down on this: do whatever you want internally, but for the user interface you should only ever take or return dictionaries if they are of a uniform typing, e.g. dict[str, float]. For parsers, I imagine it's very often more like dict[str, float | int | str | np.ndarray | tuple[int, int] | ...

So maybe we don't want to be encouraging this anyhow.

samwaseda added 2 commits July 15, 2025 09:59

Separate return list

6adc38a

Allow dictionary representation

6db698c

samwaseda marked this pull request as draft July 15, 2025 16:00

samwaseda closed this Aug 26, 2025

samwaseda deleted the dict branch August 26, 2025 13:47

Parse dictionary output #228

Parse dictionary output #228

Uh oh!

Conversation

samwaseda commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

codecov bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

samwaseda commented Jul 15, 2025

Uh oh!

liamhuber commented Jul 15, 2025

Uh oh!

samwaseda commented Jul 15, 2025

Uh oh!

liamhuber commented Jul 15, 2025

Uh oh!

samwaseda commented Jul 16, 2025

Uh oh!

liamhuber commented Jul 16, 2025

Uh oh!

liamhuber commented Jul 16, 2025

Uh oh!

liamhuber commented Jul 16, 2025

Uh oh!

liamhuber commented Jul 16, 2025

Uh oh!

samwaseda commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liamhuber commented Jul 16, 2025

Uh oh!

samwaseda commented Jul 17, 2025

Uh oh!

liamhuber commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samwaseda commented Jul 15, 2025 •

edited

Loading

codecov bot commented Jul 15, 2025 •

edited

Loading

samwaseda commented Jul 16, 2025 •

edited

Loading

liamhuber commented Jul 17, 2025 •

edited

Loading