Skip to content

Worker: lost run when a task is deferred by the pool? #1201

@josephjclark

Description

@josephjclark

When the worker claims a Run, it'll hand it off to the engine to execute.

The engine maintains a pool of child processes. When the Run is handed to it, it'll assign a child process.

The engine maintains its own queue: so if all child processes are busy, it'll queue the Run and execute it when it's ready.

This in practice should never happen, because the worker should never claim more work than it has capacity for, and it reads capacity from the engine.

However we do occasionally see that the pool will defer a Run. And think this causes problems because the Worker will probably claim over capacity and maybe there's an unregistered run somewhere out there.

Here's a GCP log where run cf14bef8-9a10-4ade-8f17-cc3879084121 seems to deferred in the engine. But I don't actually think it executes properly and the run goes on to be marked lost

Maybe we should throw an error when this happens, rather than trying to pool the work. Because really we're just undermining the worker here. But how do we ensure this run doesn't get lost? It needs to be queued up somewhere. At least until we have some kind of reject event which puts the run back on the queue

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    New Issues

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions