-
Notifications
You must be signed in to change notification settings - Fork 128
Closed
Description
Problem:
We have customers who would like to use multi-GPU Transformers4Rec but are blocked by issues with our existing support for session-based models.
Goal:
- Unblock customer use cases so they can try out T4R to give us feedback
Constraints:
- We don't yet have Torchscript support (which is out of scope this issue)
Starting Point:
-
Enable
DataParallel/DistributedDataParalleltraining using HF Trainer for next-item prediction- Next item prediction - [BUG] DataParallel training with Trainer is not using multiple GPUs Transformers4Rec#473 -
DataParallelworks if the model is wrapped manually by the user (i.e.model = torch.nn.DataParallel(model)for training, but that wrapping should happen automatically by the HF Trainer here - [BUG] trainer.model.module renamed and DataParallel mode fixed Transformers4Rec#483
- [feature]Multi-GPU DistributedDataParallel Fixed Transformers4Rec#496 (review)
- Documentation on multi-GPU training with DataParallel and DistributedDataParallel Transformers4Rec#492
- Next item prediction - [BUG] DataParallel training with Trainer is not using multiple GPUs Transformers4Rec#473 -
-
Fix the serving sections of the existing T4R notebooks
-
[Task] Add multi-GPU example for Transformer4Rec PyTorch ([Task] Add multi-GPU example for Transformer4Rec PyTorch Transformers4Rec#508)
Note: The multi-GPU training of the specific use cases of session binary classification / regression is addressed by RMP #708
Reactions are currently unavailable