-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
Description
π Documentation
I wrote some code involving CPU offloading in a DDP context, and I used Fabric.all_reduce with arguments which were tensors on the CPU for each process associated with a rank.
This just fails silently, in that the tensors for each rank are the same afterwards. The docs are also silent about this. It should say that all_reduce works only if for rank k, the tensor passed as argument must be on device("cuda", k). Otherwise, it just fails silently.
When you do CPU offloading, there are valid reasons to exchange tensors stored on CPU between processes, so this is (I think) not entirely dumb. I think the docs should be clear, and even better there should be an exception thrown.
Reactions are currently unavailable