-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Problem Description
The latest build failure was caused in the race:
- Everything worked correctly at epoch 6
- After finished sending the data to parent and before new epoch set, a slave died. It restarted in a new node and repeated the same computation.
- In regression framework,
func (t *dummySlave) ParentDataReady(parentID uint64, req string, resp []byte) {
...
if len(children) != 0 {
t.framework.FlagMetaToChild("ParamReady")
} else {
// On leaf node, we can immediately return by and flag parent
// that this node is ready.
t.framework.FlagMetaToParent("GradientReady")
}
}In FlagMetaToX, it prepended epoch to the string and set it in etcd. However, the current implementation is wrong about synchronizing epoch here:
func (f *framework) FlagMetaToChild(meta string) {
value := fmt.Sprintf("%d-%s", f.epoch, meta)
f.etcdClient.Set (ChildMetaPath, value, 0 )
}The problem is, a new epoch 7 was set, and it flagged meta with the epoch 7. But it's actually dealing with data in epoch 6!
Analysis
Apparently there are many ways to resolve it. We can stop all work on hand at the end of each epoch. But this is not code scalable -- if we write more remote requests, we also need to consistently close them, and heavy burden on debugging. Or we can just let user pass the epoch to FlagMetaToChild. This violates our design, and I am aware that keeping track of epoch, etc. info should be done in framework, not user level.
Proposed Solution
After a second thought to framework interface, actually there are two types of work in it. One type is epoch specific: FlagMetaToX, DataRequest, (maybe) IncEpoch. The rest of APIs are the other type, epoch agnostic.
I propose to defragment FlagMetatoX, DataRequest into a different interface:
type Context:
GetEpoch, GetFromID
FlagMetatoChild, FlagMetatoParent, DataRequestAnd change Task interface:
func (t *dummySlave) ParentDataReady(ctx Context, req string, resp []byte) {
if len(children) != 0 {
ctx.FlagMetaToChild("ParamReady")
} else {
// On leaf node, we can immediately return by and flag parent
// that this node is ready.
ctx.FlagMetaToParent("GradientReady")
}
}In this way, the context will help us track epoch and it's easy for use to synchronized them under the framework hood.