-
Notifications
You must be signed in to change notification settings - Fork 14
Description
When the worker claims a Run, it'll hand it off to the engine to execute.
The engine maintains a pool of child processes. When the Run is handed to it, it'll assign a child process.
The engine maintains its own queue: so if all child processes are busy, it'll queue the Run and execute it when it's ready.
This in practice should never happen, because the worker should never claim more work than it has capacity for, and it reads capacity from the engine.
However we do occasionally see that the pool will defer a Run. And think this causes problems because the Worker will probably claim over capacity and maybe there's an unregistered run somewhere out there.
Here's a GCP log where run cf14bef8-9a10-4ade-8f17-cc3879084121 seems to deferred in the engine. But I don't actually think it executes properly and the run goes on to be marked lost
Maybe we should throw an error when this happens, rather than trying to pool the work. Because really we're just undermining the worker here. But how do we ensure this run doesn't get lost? It needs to be queued up somewhere. At least until we have some kind of reject event which puts the run back on the queue