Skip to content

Sort out distributed computation #32

@bpiwowar

Description

@bpiwowar

Distributed computation is not working well, and we should switch to DistributedDataParallel for better efficiency

  • Samplers should work on independent data subsets
  • Checkpointing needs to be done properly

Solve multiple backwards issues:

  • Backward is called within trainers (using the no_sync context might lead to problems if the involved parameters are not the same...)
  • Micro-batching (using the no_sync context)

See https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Depends on experimaestro/experimaestro-python#32 since object duplication does not work with the current config/object layout

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions