Skip to content

Unexpected behaviour when using prepare_data #21058

@bazingayu

Description

@bazingayu

Bug description

Hi lightning team,

We were trying to using prepare_data() to move the data from a shared disk to local disk.
Even though we copied the correct data to the local disk. When we printed out the dataloader file list on different GPUs(I have checked the md5 for those), the filelist is different.

**GPU 0:** self.__files [**'ez13.npz', 'sc4.npz',** 'ez2.npz', 'mb9.npz', 'bc3.npz', 'lc14.npz', 'lc3.npz', 'ez10.npz', 'lo11.npz', 'ez3.npz', 'Read14_roto.npz', 'lo6.npz', 'sc6.npz', 'mb3.npz', 'lo10.npz', 'lo7.npz', 'lc8.npz', 'mb6.npz', 'bc7.npz', 'sc13.npz', 'se3.npz', 'ez12.npz', 'lc10.npz', 'lc7.npz', 'sc3.npz', 'bc15.npz', 'lc15.npz', 'bc11.npz', 'lc5.npz', 'se12.npz', 'se6.npz', 'bc9.npz', 'sc12.npz', 'lo12.npz', 'ez9.npz', 'mb7.npz', 'lo16.npz', 'se10.npz', 'se8.npz', 'mb10.npz', 'mb14.npz', 'bc12.npz', 'se5.npz', 'ez16.npz', 'se14.npz', 'Read15_roto.npz', 'bc4.npz', 'mb11.npz', 'lc6.npz', 'lo14.npz', 'bc14.npz', 'sc16.npz', 'se11.npz', 'lo2.npz', 'lo9.npz', 'sc8.npz', 'ez4.npz', 'se9.npz', 'sc9.npz', 'lo3.npz', 'lc2.npz', 'bc5.npz', 'se7.npz', 'mb1.npz', 'mb12.npz', 'se1.npz', 'sc2.npz', 'bc16.npz', 'ez8.npz', 'ez1.npz', 'bc10.npz', 'sc14.npz', 'se4.npz', 'lo13.npz', 'lc16.npz', 'mb2.npz', 'lo5.npz', 'lo15.npz', 'bc2.npz', 'ez6.npz', 'bc1.npz', 'lc11.npz', 'bc13.npz', 'mb8.npz', 'lo1.npz', 'sc11.npz', 'se2.npz', 'lc13.npz', 'ez7.npz', 'sc15.npz', 'lc9.npz', 'bc6.npz', 'sc7.npz', 'mb13.npz', 'lc1.npz', 'lo8.npz', 'sc5.npz', 'mb4.npz', 'mb5.npz', 'se13.npz', 'lc12.npz', 'lo4.npz', 'ez15.npz', 'ez5.npz', 'mb15.npz', 'ez11.npz', 'bc8.npz', 'lc4.npz', 'sc10.npz', 'sc1.npz', 'ez14.npz', 'mb16.npz']

**GPU1:** self.__files [**'bc11.npz', 'lo9.npz'**, 'se12.npz', 'bc9.npz', 'lc6.npz', 'se9.npz', 'sc13.npz', 'mb3.npz', 'se3.npz', 'bc16.npz', 'lc5.npz', 'bc6.npz', 'lo6.npz', 'Read14_roto.npz', 'bc7.npz', 'mb10.npz', 'bc5.npz', 'bc2.npz', 'lo2.npz', 'mb13.npz', 'lc7.npz', 'mb16.npz', 'lc16.npz', 'sc11.npz', 'mb6.npz', 'ez16.npz', 'lc8.npz', 'sc9.npz', 'bc1.npz', 'bc14.npz', 'lc3.npz', 'se2.npz', 'lc9.npz', 'lo3.npz', 'lc10.npz', 'sc3.npz', 'bc3.npz', 'lo10.npz', 'se4.npz', 'bc10.npz', 'sc14.npz', 'se11.npz', 'mb7.npz', 'ez1.npz', 'lo13.npz', 'lc14.npz', 'lo8.npz', 'bc4.npz', 'ez4.npz', 'ez8.npz', 'lc11.npz', 'ez6.npz', 'Read15_roto.npz', 'se6.npz', 'lo15.npz', 'lo12.npz', 'ez2.npz', 'sc7.npz', 'ez5.npz', 'lo5.npz', 'sc8.npz', 'se14.npz', 'bc15.npz', 'se1.npz', 'sc6.npz', 'se8.npz', 'ez15.npz', 'mb1.npz', 'lc13.npz', 'se13.npz', 'lc15.npz', 'se10.npz', 'lc2.npz', 'bc13.npz', 'lo7.npz', 'ez14.npz', 'sc5.npz', 'sc2.npz', 'ez13.npz', 'bc12.npz', 'lc1.npz', 'mb11.npz', 'ez11.npz', 'mb12.npz', 'sc15.npz', 'ez9.npz', 'mb4.npz', 'sc12.npz', 'se5.npz', 'mb15.npz', 'se7.npz', 'sc4.npz', 'lc12.npz', 'ez12.npz', 'lo4.npz', 'ez10.npz', 'ez7.npz', 'mb5.npz', 'sc16.npz', 'mb9.npz', 'mb8.npz', 'ez3.npz', 'bc8.npz', 'lo16.npz', 'sc10.npz', 'lo1.npz', 'lo14.npz', 'mb14.npz', 'lo11.npz', 'mb2.npz', 'lc4.npz', 'sc1.npz']


This issue caused problems when we used FSDP(protentially also for DDP). I printed the filenames in the dataloader and found that some data was missing, while other data was processed multiple times."

filename, counting
ez13.npz , 0
sc4.npz , 1
ez2.npz , 2
mb9.npz , 1
bc3.npz , 1
lc14.npz , 1
lc3.npz , 1
ez10.npz , 1
lo11.npz , 1
ez3.npz , 1
Read14_roto.npz , 1
lo6.npz , 2
sc6.npz , 1
mb3.npz , 1
lo10.npz , 1
lo7.npz , 1
lc8.npz , 0
mb6.npz , 2

I can tell the copying process happened in single thread, but still, there's something wrong within prepare_data function

What version are you seeing the problem on?

v2.4

Reproduced in studio

No response

How to reproduce the bug

def prepare_data(self):
        """
        Copy datasets to local disk for faster access during training.
        This is a LightningDataModule callback, called once on the CPU before
        dataloaders are initialized in the setup() callback.
        """

        for source_path in train_paths:
            target_path = os.path.join(LOCAL_DATA_ROOT, *(Path(source_path).parts[-2:]))
            if not os.path.exists(target_path):
                print(f"Copying training data from {source_path} to {target_path}...")
                shutil.copytree(source_path, target_path, dirs_exist_ok=True)

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @lantiga @Borda

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions