Refactor `Task` system to retain `ProtocolDAG`, upload `ProtocolUnitResult`s and `ResultFile`s as they complete

Currently, when a compute service executes a `Task`, it generates a new `ProtocolDAG` locally, executes it, and pushes the (successful or failed) `ProtocolDAGResult` back to the server. This adds the serialized `ProtocolDAGResult` to the object store, and a `ProtocolDAGResultRef` to the state store. A `Task` can have any number of failed `ProtocolDAGResultRef`s, and (typically) a single successful `ProtocolDAGResultRef`.

This approach does not currently support `ResultFile` upload (files produced by `ProtocolUnit`s that are desired for permanent storage, available on-demand to users later), nor does it allow for a `ProtocolDAG` that successfully executes some `ProtocolUnit`s to be started again from where it left off on another compute service (checkpointing). Our aim is to support both of these in `alchemiscale`.

This proposal should accomplish both:

- instead of storing things in object store by `ProtocolDAGResult`, we should store them by `Task`/`ProtocolDAG`
    - this fits in with the idea that the same file storage system can be used to enable partial restarts
- a Task gets a `ProtocolDAGRef` in state store upon creation, serialized `ProtocolDAG` in object store
- as `ProtocolDAG` is executed on compute service, `ProtocolUnitResult`s and `ResultFile`s shipped to object store
- on success, a complete `ProtocolDAGResult` shipped to object store, `ProtocolDAGResultRef` added to state store; same retrieval pattern as before
- on failure, same as above but for a failed `ProtocolDAGResult`
- when another compute service picks up a `Task`, it checks for existence of a `ProtocolDAGRef`; if present, pulls `ProtocolDAG` and its associated `ProtocolUnitResult`s from object store 
    - it then finds the `ProtocolUnit`s in the `ProtocolDAG` that have not successfully been executed (either failed or not run at all), identifies their dependency `ProtocolUnitResult`s, grabs their `ResultFile`s if included in outputs, and proceeds with DAG execution 


This has some nice properties:
- a `Task` has a single `ProtocolDAGRef` ever, and this may have any number of failed `ProtocolDAGResultRef`s and only one successful `ProtocolDAGResultRef`
- we don't have to do odd workarounds to utilize `gufe` storage system for `ResultFile`s (see [gufe#186](https://github.com/OpenFreeEnergy/gufe/pull/186) and [gufe#234](https://github.com/OpenFreeEnergy/gufe/pull/234) for current state as of this writing)
- we get architectural support for checkpointing for `ProtocolDAG`s, reducing waste and time to results
- still mostly the same system in terms of execution, status model, `Task` claiming, result retrieval, etc.
- gives what is needed to support `ResultFile` retrieval user-side
- gives what is needed to support `extends` support compute side, where one or more `ResultFile`s may be needed to extend a `ProtocolDAG` from a previous `ProtocolDAGResult`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `Task` system to retain `ProtocolDAG`, upload `ProtocolUnitResult`s and `ResultFile`s as they complete #180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor Task system to retain ProtocolDAG, upload ProtocolUnitResults and ResultFiles as they complete #180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Refactor `Task` system to retain `ProtocolDAG`, upload `ProtocolUnitResult`s and `ResultFile`s as they complete #180