-
Notifications
You must be signed in to change notification settings - Fork 133
Distributed training experience
Albert Zeyer edited this page Jun 10, 2020
·
7 revisions
This is about distributed training with TensorFlow.
This could use distributed TensorFlow (TFDistributed.py
in RETURNN, issue #296) or Horovod (RETURNN doc about Horovod) (or a mixture of both).
This could use the new TF dataset pipeline (TFDataPipeline.py
in RETURNN, issue #292) or the old data pipeline.
This might also need to extend some of the existing implementations.
We care about several settings:
- single-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
- multi-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
- AWS settings
- GCP settings (GPU or also TPU)