Skip to content

Fabric.all_reduce silently fails when tensors are on CPUΒ #21530

@mseeger

Description

@mseeger

πŸ“š Documentation

I wrote some code involving CPU offloading in a DDP context, and I used Fabric.all_reduce with arguments which were tensors on the CPU for each process associated with a rank.

This just fails silently, in that the tensors for each rank are the same afterwards. The docs are also silent about this. It should say that all_reduce works only if for rank k, the tensor passed as argument must be on device("cuda", k). Otherwise, it just fails silently.

When you do CPU offloading, there are valid reasons to exchange tensors stored on CPU between processes, so this is (I think) not entirely dumb. I think the docs should be clear, and even better there should be an exception thrown.

cc @lantiga @justusschock

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation relatedfabriclightning.fabric.Fabric

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions