Problem:
Single GPU training takes significantly longer than multi-gpu. Customers would like to be able to accelerate their training workflows by distributing training across multiple GPUs on a single node.
Goal:
Enable customers to do data parallel training within Merlin Models training pipeline.
Constraints:
- Single node
- Embedding tables fit within the memory of a single gpu
- Use NVIDIA best practices; aka Horovod
Starting Point:
Example