Skip to content

Conversation

@josephjclark
Copy link
Collaborator

Short Description

Use an async streaming function to serialize state objects at the end of each step in the runtime

Fixes #1203

Implementation Details

We have seen a problem where huge - I mean seriously big - state objects can cause the worker to get OOMKilled by kubernetes.

This is hard to reproduce but I'm sure it's just the blocking JSON.stringify/parse calls we use on state objects at the end of each step.

The solution here uses a non blocking algorithm. It'll probably be slower but it means it'll yield, which allows gc and alloc to run, which means the worker thread should OOMKill itself before the supervisor process steps in.

Note that the serializer will now throw if an object is particularly large. Technically the state object limit at the end of each step should match the dataclip payload limit allowed to the run. I am concerned that right now, some workflows create large state objects in the middle of the workflow, and tidy up on the last step. So for a 10mb limit, maybe some middle step creates a 20mb state object. And it all works fine because that large state object never leaves the worker. But if we start strictly enforcing that limit, those workflows will fail.

So for now, I've set that limit crazily high, to 1gb. The idea is that any massive state objects will cause an OOM fail, rather than a runtime error, so it's a bit academic.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

@github-project-automation github-project-automation bot moved this to New Issues in v2 Jan 7, 2026
@josephjclark
Copy link
Collaborator Author

josephjclark commented Jan 7, 2026

Ah that failing integration test is quite damning.

It's failing to run this job:

each($.ids,
      get(\`https://jsonplaceholder.typicode.com/todos/\${$.data}\`).then(
      (s) => {
        s.results.push(s.data);
        return s;
      }
    )
  )

Because the resulting state looks like this:

[
  { userId: 1, id: 1, title: 'delectus aut autem', completed: false },
  {
    userId: 1,
    id: 2,
    title: 'quis ut nam facilis et officia qui',
    completed: false
  },
  '[Circular]'
]

For some reason, json-stream-stringify thinks that the third json object is a circular reference, and has redacted it. This is clearly wrong and actually pretty concerning

The test passes if I disable circular structure detection (which is probably why we've not seen this in the engine's events processing). Looks like a bug in json-stream-stringify to me. One which I can't get past.

@github-project-automation github-project-automation bot moved this from New Issues to Done in v2 Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants