[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving

## Problem:
We have customers who would like to use multi-GPU Transformers4Rec but are blocked by issues with our existing support for session-based models.

## Goal:
- Unblock customer use cases so they can try out T4R to give us feedback

## Constraints:
- We don't yet have Torchscript support (which is out of scope this issue)

## Starting Point:

- [x] Enable `DataParallel` / `DistributedDataParallel` training using HF Trainer for next-item prediction
  - [x] Next item prediction - https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/473 - `DataParallel` works if the model is wrapped manually by the user (i.e. `model = torch.nn.DataParallel(model)` for training, but that wrapping should happen automatically by the HF Trainer [here](https://github.com/huggingface/transformers/blob/d6eeb871706db0d64ab9ffd79f9545d95286b536/src/transformers/trainer.py#L1327)
  - [x] https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/483
  - [x] https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/496#pullrequestreview-1131707972
  - [x] NVIDIA-Merlin/Transformers4Rec#492

- [x] Fix the serving sections of the existing T4R notebooks
  - [x] https://github.com/NVIDIA-Merlin/NVTabular/pull/1628
  - [x] https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/468 
  
- [x] [Task] Add multi-GPU example for Transformer4Rec PyTorch (https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/508)
- [ ] https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/526

Note: The multi-GPU training of the specific use cases of session binary classification / regression is addressed by [RMP #708](https://github.com/NVIDIA-Merlin/Merlin/issues/708)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

Problem:

Goal:

Constraints:

Starting Point:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving #522

Description

Problem:

Goal:

Constraints:

Starting Point:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions