Adopt `FakeTensorMode` for FSDP

### Description & Motivation

Currently, very large models can't be instantiated before sharding if they don't fit in CPU memory. The solution to this, and similar to how other libraries like deepspeed do it, is to create fake tensors that don't allocate memory, then shard them, and only materialze them once sharded on each device. 

### Pitch

Adopt the [FakeTensorMode](https://github.com/pytorch/pytorch/blob/master/torch/_subclasses/fake_tensor.py#L721) context manager. It is not officially documented, but we can start experimenting it in Fabric.

The usage in PyTorch would be:

```py
with FakeTensorMode():
    model = MyModel(...)  # tensors are fake and don't allocate memory
```

This would translate to 
```py
with fabric.sharded_model():
    model = MyModel(...)
```

in Fabric.



### Alternatives

A similar mechanism exists in [torchdistx](https://github.com/pytorch/torchdistx/blob/2071f8b4ae2325487f5e555f48fc21c9161aae98/src/python/torchdistx/deferred_init.py).

### Additional context

Proposed by @justusschock 

_No response_

cc @borda @carmocca @justusschock @awaelchli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adopt `FakeTensorMode` for FSDP #16448

Description & Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adopt FakeTensorMode for FSDP #16448

Description

Description & Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Adopt `FakeTensorMode` for FSDP #16448