Unexpected behaviour when using prepare_data

### Bug description

Hi lightning team,

We were trying to using `prepare_data()` to move the data from a shared disk to local disk. 
Even though we copied the correct data to the local disk. When we printed out the dataloader file list on different GPUs(I have checked the md5 for those), the filelist is different. 
```
**GPU 0:** self.__files [**'ez13.npz', 'sc4.npz',** 'ez2.npz', 'mb9.npz', 'bc3.npz', 'lc14.npz', 'lc3.npz', 'ez10.npz', 'lo11.npz', 'ez3.npz', 'Read14_roto.npz', 'lo6.npz', 'sc6.npz', 'mb3.npz', 'lo10.npz', 'lo7.npz', 'lc8.npz', 'mb6.npz', 'bc7.npz', 'sc13.npz', 'se3.npz', 'ez12.npz', 'lc10.npz', 'lc7.npz', 'sc3.npz', 'bc15.npz', 'lc15.npz', 'bc11.npz', 'lc5.npz', 'se12.npz', 'se6.npz', 'bc9.npz', 'sc12.npz', 'lo12.npz', 'ez9.npz', 'mb7.npz', 'lo16.npz', 'se10.npz', 'se8.npz', 'mb10.npz', 'mb14.npz', 'bc12.npz', 'se5.npz', 'ez16.npz', 'se14.npz', 'Read15_roto.npz', 'bc4.npz', 'mb11.npz', 'lc6.npz', 'lo14.npz', 'bc14.npz', 'sc16.npz', 'se11.npz', 'lo2.npz', 'lo9.npz', 'sc8.npz', 'ez4.npz', 'se9.npz', 'sc9.npz', 'lo3.npz', 'lc2.npz', 'bc5.npz', 'se7.npz', 'mb1.npz', 'mb12.npz', 'se1.npz', 'sc2.npz', 'bc16.npz', 'ez8.npz', 'ez1.npz', 'bc10.npz', 'sc14.npz', 'se4.npz', 'lo13.npz', 'lc16.npz', 'mb2.npz', 'lo5.npz', 'lo15.npz', 'bc2.npz', 'ez6.npz', 'bc1.npz', 'lc11.npz', 'bc13.npz', 'mb8.npz', 'lo1.npz', 'sc11.npz', 'se2.npz', 'lc13.npz', 'ez7.npz', 'sc15.npz', 'lc9.npz', 'bc6.npz', 'sc7.npz', 'mb13.npz', 'lc1.npz', 'lo8.npz', 'sc5.npz', 'mb4.npz', 'mb5.npz', 'se13.npz', 'lc12.npz', 'lo4.npz', 'ez15.npz', 'ez5.npz', 'mb15.npz', 'ez11.npz', 'bc8.npz', 'lc4.npz', 'sc10.npz', 'sc1.npz', 'ez14.npz', 'mb16.npz']

**GPU1:** self.__files [**'bc11.npz', 'lo9.npz'**, 'se12.npz', 'bc9.npz', 'lc6.npz', 'se9.npz', 'sc13.npz', 'mb3.npz', 'se3.npz', 'bc16.npz', 'lc5.npz', 'bc6.npz', 'lo6.npz', 'Read14_roto.npz', 'bc7.npz', 'mb10.npz', 'bc5.npz', 'bc2.npz', 'lo2.npz', 'mb13.npz', 'lc7.npz', 'mb16.npz', 'lc16.npz', 'sc11.npz', 'mb6.npz', 'ez16.npz', 'lc8.npz', 'sc9.npz', 'bc1.npz', 'bc14.npz', 'lc3.npz', 'se2.npz', 'lc9.npz', 'lo3.npz', 'lc10.npz', 'sc3.npz', 'bc3.npz', 'lo10.npz', 'se4.npz', 'bc10.npz', 'sc14.npz', 'se11.npz', 'mb7.npz', 'ez1.npz', 'lo13.npz', 'lc14.npz', 'lo8.npz', 'bc4.npz', 'ez4.npz', 'ez8.npz', 'lc11.npz', 'ez6.npz', 'Read15_roto.npz', 'se6.npz', 'lo15.npz', 'lo12.npz', 'ez2.npz', 'sc7.npz', 'ez5.npz', 'lo5.npz', 'sc8.npz', 'se14.npz', 'bc15.npz', 'se1.npz', 'sc6.npz', 'se8.npz', 'ez15.npz', 'mb1.npz', 'lc13.npz', 'se13.npz', 'lc15.npz', 'se10.npz', 'lc2.npz', 'bc13.npz', 'lo7.npz', 'ez14.npz', 'sc5.npz', 'sc2.npz', 'ez13.npz', 'bc12.npz', 'lc1.npz', 'mb11.npz', 'ez11.npz', 'mb12.npz', 'sc15.npz', 'ez9.npz', 'mb4.npz', 'sc12.npz', 'se5.npz', 'mb15.npz', 'se7.npz', 'sc4.npz', 'lc12.npz', 'ez12.npz', 'lo4.npz', 'ez10.npz', 'ez7.npz', 'mb5.npz', 'sc16.npz', 'mb9.npz', 'mb8.npz', 'ez3.npz', 'bc8.npz', 'lo16.npz', 'sc10.npz', 'lo1.npz', 'lo14.npz', 'mb14.npz', 'lo11.npz', 'mb2.npz', 'lc4.npz', 'sc1.npz']


```

This issue caused problems when we used FSDP(protentially also for DDP). I printed the filenames in the dataloader and found that some data was missing, while other data was processed multiple times."
```
filename, counting
ez13.npz , 0
sc4.npz , 1
ez2.npz , 2
mb9.npz , 1
bc3.npz , 1
lc14.npz , 1
lc3.npz , 1
ez10.npz , 1
lo11.npz , 1
ez3.npz , 1
Read14_roto.npz , 1
lo6.npz , 2
sc6.npz , 1
mb3.npz , 1
lo10.npz , 1
lo7.npz , 1
lc8.npz , 0
mb6.npz , 2
```
I can tell the copying process happened in single thread, but still, there's something wrong within **prepare_data** function


### What version are you seeing the problem on?

v2.4

### Reproduced in studio

_No response_

### How to reproduce the bug

```python
def prepare_data(self):
        """
        Copy datasets to local disk for faster access during training.
        This is a LightningDataModule callback, called once on the CPU before
        dataloaders are initialized in the setup() callback.
        """

        for source_path in train_paths:
            target_path = os.path.join(LOCAL_DATA_ROOT, *(Path(source_path).parts[-2:]))
            if not os.path.exists(target_path):
                print(f"Copying training data from {source_path} to {target_path}...")
                shutil.copytree(source_path, target_path, dirs_exist_ok=True)
```

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```

</details>


### More info

_No response_

cc @lantiga @borda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected behaviour when using prepare_data #21058

Bug description

What version are you seeing the problem on?

Reproduced in studio

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected behaviour when using prepare_data #21058

Description

Bug description

What version are you seeing the problem on?

Reproduced in studio

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions