-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Currently, when a compute service executes a Task, it generates a new ProtocolDAG locally, executes it, and pushes the (successful or failed) ProtocolDAGResult back to the server. This adds the serialized ProtocolDAGResult to the object store, and a ProtocolDAGResultRef to the state store. A Task can have any number of failed ProtocolDAGResultRefs, and (typically) a single successful ProtocolDAGResultRef.
This approach does not currently support ResultFile upload (files produced by ProtocolUnits that are desired for permanent storage, available on-demand to users later), nor does it allow for a ProtocolDAG that successfully executes some ProtocolUnits to be started again from where it left off on another compute service (checkpointing). Our aim is to support both of these in alchemiscale.
This proposal should accomplish both:
- instead of storing things in object store by
ProtocolDAGResult, we should store them byTask/ProtocolDAG- this fits in with the idea that the same file storage system can be used to enable partial restarts
- a Task gets a
ProtocolDAGRefin state store upon creation, serializedProtocolDAGin object store - as
ProtocolDAGis executed on compute service,ProtocolUnitResults andResultFiles shipped to object store - on success, a complete
ProtocolDAGResultshipped to object store,ProtocolDAGResultRefadded to state store; same retrieval pattern as before - on failure, same as above but for a failed
ProtocolDAGResult - when another compute service picks up a
Task, it checks for existence of aProtocolDAGRef; if present, pullsProtocolDAGand its associatedProtocolUnitResults from object store- it then finds the
ProtocolUnits in theProtocolDAGthat have not successfully been executed (either failed or not run at all), identifies their dependencyProtocolUnitResults, grabs theirResultFiles if included in outputs, and proceeds with DAG execution
- it then finds the
This has some nice properties:
- a
Taskhas a singleProtocolDAGRefever, and this may have any number of failedProtocolDAGResultRefs and only one successfulProtocolDAGResultRef - we don't have to do odd workarounds to utilize
gufestorage system forResultFiles (see gufe#186 and gufe#234 for current state as of this writing) - we get architectural support for checkpointing for
ProtocolDAGs, reducing waste and time to results - still mostly the same system in terms of execution, status model,
Taskclaiming, result retrieval, etc. - gives what is needed to support
ResultFileretrieval user-side - gives what is needed to support
extendssupport compute side, where one or moreResultFiles may be needed to extend aProtocolDAGfrom a previousProtocolDAGResult
Metadata
Metadata
Assignees
Labels
Type
Projects
Status