Skip to content

Refactor Task system to retain ProtocolDAG, upload ProtocolUnitResults and ResultFiles as they complete #180

@dotsdl

Description

@dotsdl

Currently, when a compute service executes a Task, it generates a new ProtocolDAG locally, executes it, and pushes the (successful or failed) ProtocolDAGResult back to the server. This adds the serialized ProtocolDAGResult to the object store, and a ProtocolDAGResultRef to the state store. A Task can have any number of failed ProtocolDAGResultRefs, and (typically) a single successful ProtocolDAGResultRef.

This approach does not currently support ResultFile upload (files produced by ProtocolUnits that are desired for permanent storage, available on-demand to users later), nor does it allow for a ProtocolDAG that successfully executes some ProtocolUnits to be started again from where it left off on another compute service (checkpointing). Our aim is to support both of these in alchemiscale.

This proposal should accomplish both:

  • instead of storing things in object store by ProtocolDAGResult, we should store them by Task/ProtocolDAG
    • this fits in with the idea that the same file storage system can be used to enable partial restarts
  • a Task gets a ProtocolDAGRef in state store upon creation, serialized ProtocolDAG in object store
  • as ProtocolDAG is executed on compute service, ProtocolUnitResults and ResultFiles shipped to object store
  • on success, a complete ProtocolDAGResult shipped to object store, ProtocolDAGResultRef added to state store; same retrieval pattern as before
  • on failure, same as above but for a failed ProtocolDAGResult
  • when another compute service picks up a Task, it checks for existence of a ProtocolDAGRef; if present, pulls ProtocolDAG and its associated ProtocolUnitResults from object store
    • it then finds the ProtocolUnits in the ProtocolDAG that have not successfully been executed (either failed or not run at all), identifies their dependency ProtocolUnitResults, grabs their ResultFiles if included in outputs, and proceeds with DAG execution

This has some nice properties:

  • a Task has a single ProtocolDAGRef ever, and this may have any number of failed ProtocolDAGResultRefs and only one successful ProtocolDAGResultRef
  • we don't have to do odd workarounds to utilize gufe storage system for ResultFiles (see gufe#186 and gufe#234 for current state as of this writing)
  • we get architectural support for checkpointing for ProtocolDAGs, reducing waste and time to results
  • still mostly the same system in terms of execution, status model, Task claiming, result retrieval, etc.
  • gives what is needed to support ResultFile retrieval user-side
  • gives what is needed to support extends support compute side, where one or more ResultFiles may be needed to extend a ProtocolDAG from a previous ProtocolDAGResult

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Upcoming Sprint - Queued

Relationships

None yet

Development

No branches or pull requests

Issue actions