You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This could easily be considered a bug report, but it's hard to call this your bug.
What did you find confusing? Please describe.
There is no mention of torch.nn.SyncBatchNorm, and its potential to (confusingly) break a SM training job.
Describe how documentation can be improved
Mention that initializing smdistributed with init_process_group will not allow use of certain other torch features, such as (and likely not limited to) torch.nn.SyncBatchNorm.
An alternative to this would be to scan the estimator source for mentions of SyncBatchNorm. Another alternative would be to somehow get torch.nn to import smdistributed.dataparallel.torch.distributed. Even better would be a new SM API replacement for torch.nn, but that seems excessive!