Skip to content

Defragment framework interfaces into context based calls #107

@hongchaodeng

Description

@hongchaodeng

Problem Description

The latest build failure was caused in the race:

  1. Everything worked correctly at epoch 6
  2. After finished sending the data to parent and before new epoch set, a slave died. It restarted in a new node and repeated the same computation.
  3. In regression framework,
func (t *dummySlave) ParentDataReady(parentID uint64, req string, resp []byte) {
        ...
    if len(children) != 0 {
        t.framework.FlagMetaToChild("ParamReady")
    } else {
        // On leaf node, we can immediately return by and flag parent
        // that this node is ready.
        t.framework.FlagMetaToParent("GradientReady")
    }
}

In FlagMetaToX, it prepended epoch to the string and set it in etcd. However, the current implementation is wrong about synchronizing epoch here:

func (f *framework) FlagMetaToChild(meta string) {
    value := fmt.Sprintf("%d-%s", f.epoch, meta)
    f.etcdClient.Set (ChildMetaPath, value, 0 )
}

The problem is, a new epoch 7 was set, and it flagged meta with the epoch 7. But it's actually dealing with data in epoch 6!

Analysis

Apparently there are many ways to resolve it. We can stop all work on hand at the end of each epoch. But this is not code scalable -- if we write more remote requests, we also need to consistently close them, and heavy burden on debugging. Or we can just let user pass the epoch to FlagMetaToChild. This violates our design, and I am aware that keeping track of epoch, etc. info should be done in framework, not user level.

Proposed Solution

After a second thought to framework interface, actually there are two types of work in it. One type is epoch specific: FlagMetaToX, DataRequest, (maybe) IncEpoch. The rest of APIs are the other type, epoch agnostic.

I propose to defragment FlagMetatoX, DataRequest into a different interface:

type Context:
  GetEpoch, GetFromID
  FlagMetatoChild, FlagMetatoParent, DataRequest

And change Task interface:

func (t *dummySlave) ParentDataReady(ctx Context, req string, resp []byte) {
    if len(children) != 0 {
        ctx.FlagMetaToChild("ParamReady")
    } else {
        // On leaf node, we can immediately return by and flag parent
        // that this node is ready.
        ctx.FlagMetaToParent("GradientReady")
    }
}

In this way, the context will help us track epoch and it's easy for use to synchronized them under the framework hood.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions