Skip to content

Distributed training experience

Albert Zeyer edited this page Jun 10, 2020 · 7 revisions

This is about distributed training with TensorFlow. This could use distributed TensorFlow (TFDistributed.py in RETURNN, issue #296) or Horovod (RETURNN doc about Horovod) (or a mixture of both). This could use the new TF dataset pipeline (TFDataPipeline.py in RETURNN, issue #292) or the old data pipeline. This might also need to extend some of the existing implementations.

We care about several settings:

  • single-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
  • multi-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
  • AWS settings
  • GCP settings (GPU or also TPU)
Clone this wiki locally