FSDP with HYBRID_SHARD loss doesn't improve with more nodes

### Bug description

When using the FSDP strategy with HYBRID SHARD set, the loss behaves as if only one node is training. When it is set to FULL_SHARD/etc the loss drops as expected when more nodes are added and batch size is left constant. I have verified NCCL connections are working correctly as everything behaves as expected when using FULL_SHARD.

### What version are you seeing the problem on?

v2.4

### How to reproduce the bug

_No response_

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```

</details>


### More info

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP with HYBRID_SHARD loss doesn't improve with more nodes #20385

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP with HYBRID_SHARD loss doesn't improve with more nodes #20385

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions