You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Make the corpus remote wrt workers
This change removes the assumption that the corpus is co-located with the workers. The new design relies on the assumption that the workers are reachable via a high-throughput network (datacenter setup). The trainer prefetches the modules for the next iteration and sends them as data blobs to the workers.
The rest of the change trickles out from this:
- the corpus operates internally with a metadata object about modules, (`ModuleSpec`)
- workers are given a `LoadedModuleSpec` which contains the data of the module (and its thinlto index, if needed).
- command line final shape is done on the worker side
This had the side-effect that responsibilities could be reassigned as follows:
- the corpus constructor handles preparing the `ModuleSpec`s, applying all the necessary flag modification handlers (i.e. add/delete/replace)
- flag modification logic is generic - it doesn't know anything about thinlto, '-cc1', etc.
- flag contextualization happens on the worker side. Flags reference a context object as a string formatting value, which is then provided by the worker.
In turn, this impacted testing.
This change impacts purely local training similar to how a previous change remoting the policy does - local files are copied by workers redundantly. It doesn not appear to have any impact on training performance, but should we want to, we can optimize the local scenario under the hood.
0 commit comments