-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workinglightningdatamodulepl.LightningDataModulepl.LightningDataModulever: 2.4.x
Description
Bug description
Hi lightning team,
We were trying to using prepare_data()
to move the data from a shared disk to local disk.
Even though we copied the correct data to the local disk. When we printed out the dataloader file list on different GPUs(I have checked the md5 for those), the filelist is different.
**GPU 0:** self.__files [**'ez13.npz', 'sc4.npz',** 'ez2.npz', 'mb9.npz', 'bc3.npz', 'lc14.npz', 'lc3.npz', 'ez10.npz', 'lo11.npz', 'ez3.npz', 'Read14_roto.npz', 'lo6.npz', 'sc6.npz', 'mb3.npz', 'lo10.npz', 'lo7.npz', 'lc8.npz', 'mb6.npz', 'bc7.npz', 'sc13.npz', 'se3.npz', 'ez12.npz', 'lc10.npz', 'lc7.npz', 'sc3.npz', 'bc15.npz', 'lc15.npz', 'bc11.npz', 'lc5.npz', 'se12.npz', 'se6.npz', 'bc9.npz', 'sc12.npz', 'lo12.npz', 'ez9.npz', 'mb7.npz', 'lo16.npz', 'se10.npz', 'se8.npz', 'mb10.npz', 'mb14.npz', 'bc12.npz', 'se5.npz', 'ez16.npz', 'se14.npz', 'Read15_roto.npz', 'bc4.npz', 'mb11.npz', 'lc6.npz', 'lo14.npz', 'bc14.npz', 'sc16.npz', 'se11.npz', 'lo2.npz', 'lo9.npz', 'sc8.npz', 'ez4.npz', 'se9.npz', 'sc9.npz', 'lo3.npz', 'lc2.npz', 'bc5.npz', 'se7.npz', 'mb1.npz', 'mb12.npz', 'se1.npz', 'sc2.npz', 'bc16.npz', 'ez8.npz', 'ez1.npz', 'bc10.npz', 'sc14.npz', 'se4.npz', 'lo13.npz', 'lc16.npz', 'mb2.npz', 'lo5.npz', 'lo15.npz', 'bc2.npz', 'ez6.npz', 'bc1.npz', 'lc11.npz', 'bc13.npz', 'mb8.npz', 'lo1.npz', 'sc11.npz', 'se2.npz', 'lc13.npz', 'ez7.npz', 'sc15.npz', 'lc9.npz', 'bc6.npz', 'sc7.npz', 'mb13.npz', 'lc1.npz', 'lo8.npz', 'sc5.npz', 'mb4.npz', 'mb5.npz', 'se13.npz', 'lc12.npz', 'lo4.npz', 'ez15.npz', 'ez5.npz', 'mb15.npz', 'ez11.npz', 'bc8.npz', 'lc4.npz', 'sc10.npz', 'sc1.npz', 'ez14.npz', 'mb16.npz']
**GPU1:** self.__files [**'bc11.npz', 'lo9.npz'**, 'se12.npz', 'bc9.npz', 'lc6.npz', 'se9.npz', 'sc13.npz', 'mb3.npz', 'se3.npz', 'bc16.npz', 'lc5.npz', 'bc6.npz', 'lo6.npz', 'Read14_roto.npz', 'bc7.npz', 'mb10.npz', 'bc5.npz', 'bc2.npz', 'lo2.npz', 'mb13.npz', 'lc7.npz', 'mb16.npz', 'lc16.npz', 'sc11.npz', 'mb6.npz', 'ez16.npz', 'lc8.npz', 'sc9.npz', 'bc1.npz', 'bc14.npz', 'lc3.npz', 'se2.npz', 'lc9.npz', 'lo3.npz', 'lc10.npz', 'sc3.npz', 'bc3.npz', 'lo10.npz', 'se4.npz', 'bc10.npz', 'sc14.npz', 'se11.npz', 'mb7.npz', 'ez1.npz', 'lo13.npz', 'lc14.npz', 'lo8.npz', 'bc4.npz', 'ez4.npz', 'ez8.npz', 'lc11.npz', 'ez6.npz', 'Read15_roto.npz', 'se6.npz', 'lo15.npz', 'lo12.npz', 'ez2.npz', 'sc7.npz', 'ez5.npz', 'lo5.npz', 'sc8.npz', 'se14.npz', 'bc15.npz', 'se1.npz', 'sc6.npz', 'se8.npz', 'ez15.npz', 'mb1.npz', 'lc13.npz', 'se13.npz', 'lc15.npz', 'se10.npz', 'lc2.npz', 'bc13.npz', 'lo7.npz', 'ez14.npz', 'sc5.npz', 'sc2.npz', 'ez13.npz', 'bc12.npz', 'lc1.npz', 'mb11.npz', 'ez11.npz', 'mb12.npz', 'sc15.npz', 'ez9.npz', 'mb4.npz', 'sc12.npz', 'se5.npz', 'mb15.npz', 'se7.npz', 'sc4.npz', 'lc12.npz', 'ez12.npz', 'lo4.npz', 'ez10.npz', 'ez7.npz', 'mb5.npz', 'sc16.npz', 'mb9.npz', 'mb8.npz', 'ez3.npz', 'bc8.npz', 'lo16.npz', 'sc10.npz', 'lo1.npz', 'lo14.npz', 'mb14.npz', 'lo11.npz', 'mb2.npz', 'lc4.npz', 'sc1.npz']
This issue caused problems when we used FSDP(protentially also for DDP). I printed the filenames in the dataloader and found that some data was missing, while other data was processed multiple times."
filename, counting
ez13.npz , 0
sc4.npz , 1
ez2.npz , 2
mb9.npz , 1
bc3.npz , 1
lc14.npz , 1
lc3.npz , 1
ez10.npz , 1
lo11.npz , 1
ez3.npz , 1
Read14_roto.npz , 1
lo6.npz , 2
sc6.npz , 1
mb3.npz , 1
lo10.npz , 1
lo7.npz , 1
lc8.npz , 0
mb6.npz , 2
I can tell the copying process happened in single thread, but still, there's something wrong within prepare_data function
What version are you seeing the problem on?
v2.4
Reproduced in studio
No response
How to reproduce the bug
def prepare_data(self):
"""
Copy datasets to local disk for faster access during training.
This is a LightningDataModule callback, called once on the CPU before
dataloaders are initialized in the setup() callback.
"""
for source_path in train_paths:
target_path = os.path.join(LOCAL_DATA_ROOT, *(Path(source_path).parts[-2:]))
if not os.path.exists(target_path):
print(f"Copying training data from {source_path} to {target_path}...")
shutil.copytree(source_path, target_path, dirs_exist_ok=True)
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinglightningdatamodulepl.LightningDataModulepl.LightningDataModulever: 2.4.x