-
Notifications
You must be signed in to change notification settings - Fork 5
Open
0 / 10 of 1 issue completedDescription
Distributed computation is not working well, and we should switch to DistributedDataParallel for better efficiency
- Samplers should work on independent data subsets
- Checkpointing needs to be done properly
Solve multiple backwards issues:
- Backward is called within trainers (using the
no_synccontext might lead to problems if the involved parameters are not the same...) - Micro-batching (using the
no_synccontext)
See https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Depends on experimaestro/experimaestro-python#32 since object duplication does not work with the current config/object layout
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels