Skip to content

[RMP] Multi-GPU Data Parallel training for Tensorflow in Merlin Models #536

@viswa-nvidia

Description

@viswa-nvidia

Problem:

Single GPU training takes significantly longer than multi-gpu. Customers would like to be able to accelerate their training workflows by distributing training across multiple GPUs on a single node.

Goal:

Enable customers to do data parallel training within Merlin Models training pipeline.

Constraints:

  • Single node
  • Embedding tables fit within the memory of a single gpu
  • Use NVIDIA best practices; aka Horovod

Starting Point:

Example

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions