Async state serialisation #1207
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Short Description
Use an async streaming function to serialize state objects at the end of each step in the runtime
Fixes #1203
Implementation Details
We have seen a problem where huge - I mean seriously big - state objects can cause the worker to get OOMKilled by kubernetes.
This is hard to reproduce but I'm sure it's just the blocking JSON.stringify/parse calls we use on state objects at the end of each step.
The solution here uses a non blocking algorithm. It'll probably be slower but it means it'll yield, which allows gc and alloc to run, which means the worker thread should OOMKill itself before the supervisor process steps in.
Note that the serializer will now throw if an object is particularly large. Technically the state object limit at the end of each step should match the dataclip payload limit allowed to the run. I am concerned that right now, some workflows create large state objects in the middle of the workflow, and tidy up on the last step. So for a 10mb limit, maybe some middle step creates a 20mb state object. And it all works fine because that large state object never leaves the worker. But if we start strictly enforcing that limit, those workflows will fail.
So for now, I've set that limit crazily high, to 1gb. The idea is that any massive state objects will cause an OOM fail, rather than a runtime error, so it's a bit academic.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy